Version 0.17.1¶
February 18, 2016
Changelog¶
Bug fixes¶
- Upgrade vendored joblib to version 0.9.4 that fixes an important bug in
joblib.Parallel
that can silently yield to wrong results when working on datasets larger than 1MB: https://github.com/joblib/joblib/blob/0.9.4/CHANGES.rst - Fixed reading of Bunch pickles generated with scikit-learn
version <= 0.16. This can affect users who have already
downloaded a dataset with scikit-learn 0.16 and are loading it
with scikit-learn 0.17. See #6196 for
how this affected
datasets.fetch_20newsgroups
. By Loic Esteve. - Fixed a bug that prevented using ROC AUC score to perform grid search on several CPU / cores on large arrays. See #6147 By Olivier Grisel.
- Fixed a bug that prevented to properly set the
presort
parameter inensemble.GradientBoostingRegressor
. See #5857 By Andrew McCulloh. - Fixed a joblib error when evaluating the perplexity of a
decomposition.LatentDirichletAllocation
model. See #6258 By Chyi-Kwei Yau.
Version 0.17¶
November 5, 2015
Changelog¶
New features¶
- All the Scaler classes but
preprocessing.RobustScaler
can be fitted online by calling partial_fit. By Giorgio Patrini. - The new class
ensemble.VotingClassifier
implements a “majority rule” / “soft voting” ensemble classifier to combine estimators for classification. By Sebastian Raschka. - The new class
preprocessing.RobustScaler
provides an alternative topreprocessing.StandardScaler
for feature-wise centering and range normalization that is robust to outliers. By Thomas Unterthiner. - The new class
preprocessing.MaxAbsScaler
provides an alternative topreprocessing.MinMaxScaler
for feature-wise range normalization when the data is already centered or sparse. By Thomas Unterthiner. - The new class
preprocessing.FunctionTransformer
turns a Python function into aPipeline
-compatible transformer object. By Joe Jevnik. - The new classes
cross_validation.LabelKFold
andcross_validation.LabelShuffleSplit
generate train-test folds, respectively similar tocross_validation.KFold
andcross_validation.ShuffleSplit
, except that the folds are conditioned on a label array. By Brian McFee, Jean Kossaifi and Gilles Louppe. decomposition.LatentDirichletAllocation
implements the Latent Dirichlet Allocation topic model with online variational inference. By Chyi-Kwei Yau, with code based on an implementation by Matt Hoffman. (#3659)- The new solver
sag
implements a Stochastic Average Gradient descent and is available in bothlinear_model.LogisticRegression
andlinear_model.Ridge
. This solver is very efficient for large datasets. By Danny Sullivan and Tom Dupre la Tour. (#4738) - The new solver
cd
implements a Coordinate Descent indecomposition.NMF
. Previous solver based on Projected Gradient is still available setting new parametersolver
topg
, but is deprecated and will be removed in 0.19, along withdecomposition.ProjectedGradientNMF
and parameterssparseness
,eta
,beta
andnls_max_iter
. New parametersalpha
andl1_ratio
control L1 and L2 regularization, andshuffle
adds a shuffling step in thecd
solver. By Tom Dupre la Tour and Mathieu Blondel.
Enhancements¶
manifold.TSNE
now supports approximate optimization via the Barnes-Hut method, leading to much faster fitting. By Christopher Erick Moody. (#4025)cluster.mean_shift_.MeanShift
now supports parallel execution, as implemented in themean_shift
function. By Martino Sorbaro.naive_bayes.GaussianNB
now supports fitting withsample_weight
. By Jan Hendrik Metzen.dummy.DummyClassifier
now supports a prior fitting strategy. By Arnaud Joly.- Added a
fit_predict
method formixture.GMM
and subclasses. By Cory Lorenz. - Added the
metrics.label_ranking_loss
metric. By Arnaud Joly. - Added the
metrics.cohen_kappa_score
metric. - Added a
warm_start
constructor parameter to the bagging ensemble models to increase the size of the ensemble. By Tim Head. - Added option to use multi-output regression metrics without averaging. By Konstantin Shmelkov and Michael Eickenberg.
- Added
stratify
option tocross_validation.train_test_split
for stratified splitting. By Miroslav Batchkarov. - The
tree.export_graphviz
function now supports aesthetic improvements fortree.DecisionTreeClassifier
andtree.DecisionTreeRegressor
, including options for coloring nodes by their majority class or impurity, showing variable names, and using node proportions instead of raw sample counts. By Trevor Stephens. - Improved speed of
newton-cg
solver inlinear_model.LogisticRegression
, by avoiding loss computation. By Mathieu Blondel and Tom Dupre la Tour. - The
class_weight="auto"
heuristic in classifiers supportingclass_weight
was deprecated and replaced by theclass_weight="balanced"
option, which has a simpler formula and interpretation. By Hanna Wallach and Andreas Müller. - Add
class_weight
parameter to automatically weight samples by class frequency forlinear_model.PassiveAggressiveClassifier
. By Trevor Stephens. - Added backlinks from the API reference pages to the user guide. By Andreas Müller.
- The
labels
parameter tosklearn.metrics.f1_score
,sklearn.metrics.fbeta_score
,sklearn.metrics.recall_score
andsklearn.metrics.precision_score
has been extended. It is now possible to ignore one or more labels, such as where a multiclass problem has a majority class to ignore. By Joel Nothman. - Add
sample_weight
support tolinear_model.RidgeClassifier
. By Trevor Stephens. - Provide an option for sparse output from
sklearn.metrics.pairwise.cosine_similarity
. By Jaidev Deshpande. - Add
minmax_scale
to provide a function interface forMinMaxScaler
. By Thomas Unterthiner. dump_svmlight_file
now handles multi-label datasets. By Chih-Wei Chang.- RCV1 dataset loader (
sklearn.datasets.fetch_rcv1
). By Tom Dupre la Tour. - The “Wisconsin Breast Cancer” classical two-class classification dataset
is now included in scikit-learn, available with
sklearn.dataset.load_breast_cancer
. - Upgraded to joblib 0.9.3 to benefit from the new automatic batching of
short tasks. This makes it possible for scikit-learn to benefit from
parallelism when many very short tasks are executed in parallel, for
instance by the
grid_search.GridSearchCV
meta-estimator withn_jobs > 1
used with a large grid of parameters on a small dataset. By Vlad Niculae, Olivier Grisel and Loic Esteve. - For more details about changes in joblib 0.9.3 see the release notes: https://github.com/joblib/joblib/blob/master/CHANGES.rst#release-093
- Improved speed (3 times per iteration) of
decomposition.DictLearning
with coordinate descent method fromlinear_model.Lasso
. By Arthur Mensch. - Parallel processing (threaded) for queries of nearest neighbors (using the ball-tree) by Nikolay Mayorov.
- Allow
datasets.make_multilabel_classification
to output a sparsey
. By Kashif Rasul. cluster.DBSCAN
now accepts a sparse matrix of precomputed distances, allowing memory-efficient distance precomputation. By Joel Nothman.tree.DecisionTreeClassifier
now exposes anapply
method for retrieving the leaf indices samples are predicted as. By Daniel Galvez and Gilles Louppe.- Speed up decision tree regressors, random forest regressors, extra trees regressors and gradient boosting estimators by computing a proxy of the impurity improvement during the tree growth. The proxy quantity is such that the split that maximizes this value also maximizes the impurity improvement. By Arnaud Joly, Jacob Schreiber and Gilles Louppe.
- Speed up tree based methods by reducing the number of computations needed when computing the impurity measure taking into account linear relationship of the computed statistics. The effect is particularly visible with extra trees and on datasets with categorical or sparse features. By Arnaud Joly.
ensemble.GradientBoostingRegressor
andensemble.GradientBoostingClassifier
now expose anapply
method for retrieving the leaf indices each sample ends up in under each try. By Jacob Schreiber.- Add
sample_weight
support tolinear_model.LinearRegression
. By Sonny Hu. (##4881) - Add
n_iter_without_progress
tomanifold.TSNE
to control the stopping criterion. By Santi Villalba. (#5186) - Added optional parameter
random_state
inlinear_model.Ridge
, to set the seed of the pseudo random generator used insag
solver. By Tom Dupre la Tour. - Added optional parameter
warm_start
inlinear_model.LogisticRegression
. If set to True, the solverslbfgs
,newton-cg
andsag
will be initialized with the coefficients computed in the previous fit. By Tom Dupre la Tour. - Added
sample_weight
support tolinear_model.LogisticRegression
for thelbfgs
,newton-cg
, andsag
solvers. By Valentin Stolbunov. Support added to theliblinear
solver. By Manoj Kumar. - Added optional parameter
presort
toensemble.GradientBoostingRegressor
andensemble.GradientBoostingClassifier
, keeping default behavior the same. This allows gradient boosters to turn off presorting when building deep trees or using sparse data. By Jacob Schreiber. - Altered
metrics.roc_curve
to drop unnecessary thresholds by default. By Graham Clenaghan. - Added
feature_selection.SelectFromModel
meta-transformer which can be used along with estimators that have coef_ or feature_importances_ attribute to select important features of the input data. By Maheshakya Wijewardena, Joel Nothman and Manoj Kumar. - Added
metrics.pairwise.laplacian_kernel
. By Clyde Fare. covariance.GraphLasso
allows separate control of the convergence criterion for the Elastic-Net subproblem via theenet_tol
parameter.- Improved verbosity in
decomposition.DictionaryLearning
. ensemble.RandomForestClassifier
andensemble.RandomForestRegressor
no longer explicitly store the samples used in bagging, resulting in a much reduced memory footprint for storing random forest models.- Added
positive
option tolinear_model.Lars
andlinear_model.lars_path
to force coefficients to be positive. (#5131) - Added the
X_norm_squared
parameter tometrics.pairwise.euclidean_distances
to provide precomputed squared norms forX
. - Added the
fit_predict
method topipeline.Pipeline
. - Added the
preprocessing.min_max_scale
function.
Bug fixes¶
- Fixed non-determinism in
dummy.DummyClassifier
with sparse multi-label output. By Andreas Müller. - Fixed the output shape of
linear_model.RANSACRegressor
to(n_samples, )
. By Andreas Müller. - Fixed bug in
decomposition.DictLearning
whenn_jobs < 0
. By Andreas Müller. - Fixed bug where
grid_search.RandomizedSearchCV
could consume a lot of memory for large discrete grids. By Joel Nothman. - Fixed bug in
linear_model.LogisticRegressionCV
where penalty was ignored in the final fit. By Manoj Kumar. - Fixed bug in
ensemble.forest.ForestClassifier
while computing oob_score and X is a sparse.csc_matrix. By Ankur Ankan. - All regressors now consistently handle and warn when given
y
that is of shape(n_samples, 1)
. By Andreas Müller and Henry Lin. (#5431) - Fix in
cluster.KMeans
cluster reassignment for sparse input by Lars Buitinck. - Fixed a bug in
lda.LDA
that could cause asymmetric covariance matrices when using shrinkage. By Martin Billinger. - Fixed
cross_validation.cross_val_predict
for estimators with sparse predictions. By Buddha Prakash. - Fixed the
predict_proba
method oflinear_model.LogisticRegression
to use soft-max instead of one-vs-rest normalization. By Manoj Kumar. (#5182) - Fixed the
partial_fit
method oflinear_model.SGDClassifier
when called withaverage=True
. By Andrew Lamb. (#5282) - Dataset fetchers use different filenames under Python 2 and Python 3 to avoid pickling compatibility issues. By Olivier Grisel. (#5355)
- Fixed a bug in
naive_bayes.GaussianNB
which caused classification results to depend on scale. By Jake Vanderplas. - Fixed temporarily
linear_model.Ridge
, which was incorrect when fitting the intercept in the case of sparse data. The fix automatically changes the solver to ‘sag’ in this case. #5360 by Tom Dupre la Tour. - Fixed a performance bug in
decomposition.RandomizedPCA
on data with a large number of features and fewer samples. (#4478) By Andreas Müller, Loic Esteve and Giorgio Patrini. - Fixed bug in
cross_decomposition.PLS
that yielded unstable and platform dependent output, and failed on fit_transform. By Arthur Mensch. - Fixes to the
Bunch
class used to store datasets. - Fixed
ensemble.plot_partial_dependence
ignoring thepercentiles
parameter. - Providing a
set
as vocabulary inCountVectorizer
no longer leads to inconsistent results when pickling. - Fixed the conditions on when a precomputed Gram matrix needs to
be recomputed in
linear_model.LinearRegression
,linear_model.OrthogonalMatchingPursuit
,linear_model.Lasso
andlinear_model.ElasticNet
. - Fixed inconsistent memory layout in the coordinate descent solver
that affected
linear_model.DictionaryLearning
andcovariance.GraphLasso
. (#5337) By Olivier Grisel. manifold.LocallyLinearEmbedding
no longer ignores thereg
parameter.- Nearest Neighbor estimators with custom distance metrics can now be pickled. (#4362)
- Fixed a bug in
pipeline.FeatureUnion
wheretransformer_weights
were not properly handled when performing grid-searches. - Fixed a bug in
linear_model.LogisticRegression
andlinear_model.LogisticRegressionCV
when usingclass_weight='balanced'```or ``class_weight='auto'
. By Tom Dupre la Tour. - Fixed bug #5495 when doing OVR(SVC(decision_function_shape=”ovr”)). Fixed by Elvis Dohmatob.
API changes summary¶
- Attribute data_min, data_max and data_range in
preprocessing.MinMaxScaler
are deprecated and won’t be available from 0.19. Instead, the class now exposes data_min_, data_max_ and data_range_. By Giorgio Patrini. - All Scaler classes now have an scale_ attribute, the feature-wise
rescaling applied by their transform methods. The old attribute std_
in
preprocessing.StandardScaler
is deprecated and superseded by scale_; it won’t be available in 0.19. By Giorgio Patrini. svm.SVC`
andsvm.NuSVC
now have andecision_function_shape
parameter to make their decision function of shape(n_samples, n_classes)
by settingdecision_function_shape='ovr'
. This will be the default behavior starting in 0.19. By Andreas Müller.- Passing 1D data arrays as input to estimators is now deprecated as it
caused confusion in how the array elements should be interpreted
as features or as samples. All data arrays are now expected
to be explicitly shaped
(n_samples, n_features)
. By Vighnesh Birodkar. lda.LDA
andqda.QDA
have been moved todiscriminant_analysis.LinearDiscriminantAnalysis
anddiscriminant_analysis.QuadraticDiscriminantAnalysis
.- The
store_covariance
andtol
parameters have been moved from the fit method to the constructor indiscriminant_analysis.LinearDiscriminantAnalysis
and thestore_covariances
andtol
parameters have been moved from the fit method to the constructor indiscriminant_analysis.QuadraticDiscriminantAnalysis
. - Models inheriting from
_LearntSelectorMixin
will no longer support the transform methods. (i.e, RandomForests, GradientBoosting, LogisticRegression, DecisionTrees, SVMs and SGD related models). Wrap these models around the metatransfomerfeature_selection.SelectFromModel
to remove features (according to coefs_ or feature_importances_) which are below a certain threshold value instead. cluster.KMeans
re-runs cluster-assignments in case of non-convergence, to ensure consistency ofpredict(X)
andlabels_
. By Vighnesh Birodkar.- Classifier and Regressor models are now tagged as such using the
_estimator_type
attribute. - Cross-validation iterators always provide indices into training and test set, not boolean masks.
- The
decision_function
on all regressors was deprecated and will be removed in 0.19. Usepredict
instead. datasets.load_lfw_pairs
is deprecated and will be removed in 0.19. Usedatasets.fetch_lfw_pairs
instead.- The deprecated
hmm
module was removed. - The deprecated
Bootstrap
cross-validation iterator was removed. - The deprecated
Ward
andWardAgglomerative
classes have been removed. Useclustering.AgglomerativeClustering
instead. cross_validation.check_cv
is now a public function.- The property
residues_
oflinear_model.LinearRegression
is deprecated and will be removed in 0.19. - The deprecated
n_jobs
parameter oflinear_model.LinearRegression
has been moved to the constructor. - Removed deprecated
class_weight
parameter fromlinear_model.SGDClassifier
’sfit
method. Use the construction parameter instead. - The deprecated support for the sequence of sequences (or list of lists) multilabel
format was removed. To convert to and from the supported binary
indicator matrix format, use
MultiLabelBinarizer
. - The behavior of calling the
inverse_transform
method ofPipeline.pipeline
will change in 0.19. It will no longer reshape one-dimensional input to two-dimensional input. - The deprecated attributes
indicator_matrix_
,multilabel_
andclasses_
ofpreprocessing.LabelBinarizer
were removed. - Using
gamma=0
insvm.SVC
andsvm.SVR
to automatically set the gamma to1. / n_features
is deprecated and will be removed in 0.19. Usegamma="auto"
instead.
Code Contributors¶
Aaron Schumacher, Adithya Ganesh, akitty, Alexandre Gramfort, Alexey Grigorev, Ali Baharev, Allen Riddell, Ando Saabas, Andreas Mueller, Andrew Lamb, Anish Shah, Ankur Ankan, Anthony Erlinger, Ari Rouvinen, Arnaud Joly, Arnaud Rachez, Arthur Mensch, banilo, Barmaley.exe, benjaminirving, Boyuan Deng, Brett Naul, Brian McFee, Buddha Prakash, Chi Zhang, Chih-Wei Chang, Christof Angermueller, Christoph Gohlke, Christophe Bourguignat, Christopher Erick Moody, Chyi-Kwei Yau, Cindy Sridharan, CJ Carey, Clyde-fare, Cory Lorenz, Dan Blanchard, Daniel Galvez, Daniel Kronovet, Danny Sullivan, Data1010, David, David D Lowe, David Dotson, djipey, Dmitry Spikhalskiy, Donne Martin, Dougal J. Sutherland, Dougal Sutherland, edson duarte, Eduardo Caro, Eric Larson, Eric Martin, Erich Schubert, Fernando Carrillo, Frank C. Eckert, Frank Zalkow, Gael Varoquaux, Ganiev Ibraim, Gilles Louppe, Giorgio Patrini, giorgiop, Graham Clenaghan, Gryllos Prokopis, gwulfs, Henry Lin, Hsuan-Tien Lin, Immanuel Bayer, Ishank Gulati, Jack Martin, Jacob Schreiber, Jaidev Deshpande, Jake Vanderplas, Jan Hendrik Metzen, Jean Kossaifi, Jeffrey04, Jeremy, jfraj, Jiali Mei, Joe Jevnik, Joel Nothman, John Kirkham, John Wittenauer, Joseph, Joshua Loyal, Jungkook Park, KamalakerDadi, Kashif Rasul, Keith Goodman, Kian Ho, Konstantin Shmelkov, Kyler Brown, Lars Buitinck, Lilian Besson, Loic Esteve, Louis Tiao, maheshakya, Maheshakya Wijewardena, Manoj Kumar, MarkTab marktab.net, Martin Ku, Martin Spacek, MartinBpr, martinosorb, MaryanMorel, Masafumi Oyamada, Mathieu Blondel, Matt Krump, Matti Lyra, Maxim Kolganov, mbillinger, mhg, Michael Heilman, Michael Patterson, Miroslav Batchkarov, Nelle Varoquaux, Nicolas, Nikolay Mayorov, Olivier Grisel, Omer Katz, Óscar Nájera, Pauli Virtanen, Peter Fischer, Peter Prettenhofer, Phil Roth, pianomania, Preston Parry, Raghav RV, Rob Zinkov, Robert Layton, Rohan Ramanath, Saket Choudhary, Sam Zhang, santi, saurabh.bansod, scls19fr, Sebastian Raschka, Sebastian Saeger, Shivan Sornarajah, SimonPL, sinhrks, Skipper Seabold, Sonny Hu, sseg, Stephen Hoover, Steven De Gryze, Steven Seguin, Theodore Vasiloudis, Thomas Unterthiner, Tiago Freitas Pereira, Tian Wang, Tim Head, Timothy Hopper, tokoroten, Tom Dupré la Tour, Trevor Stephens, Valentin Stolbunov, Vighnesh Birodkar, Vinayak Mehta, Vincent, Vincent Michel, vstolbunov, wangz10, Wei Xue, Yucheng Low, Yury Zhauniarovich, Zac Stewart, zhai_pro, Zichen Wang