Version 0.16.1¶
April 14, 2015
Changelog¶
Bug fixes¶
- Allow input data larger than
block_size
incovariance.LedoitWolf
by Andreas Müller. - Fix a bug in
isotonic.IsotonicRegression
deduplication that caused unstable result incalibration.CalibratedClassifierCV
by Jan Hendrik Metzen. - Fix sorting of labels in func:preprocessing.label_binarize by Michael Heilman.
- Fix several stability and convergence issues in
cross_decomposition.CCA
andcross_decomposition.PLSCanonical
by Andreas Müller - Fix a bug in
cluster.KMeans
whenprecompute_distances=False
on fortran-ordered data. - Fix a speed regression in
ensemble.RandomForestClassifier
’spredict
andpredict_proba
by Andreas Müller. - Fix a regression where
utils.shuffle
converted lists and dataframes to arrays, by Olivier Grisel
Version 0.16¶
March 26, 2015
Highlights¶
- Speed improvements (notably in
cluster.DBSCAN
), reduced memory requirements, bug-fixes and better default settings. - Multinomial Logistic regression and a path algorithm in
linear_model.LogisticRegressionCV
. - Out-of core learning of PCA via
decomposition.IncrementalPCA
. - Probability callibration of classifiers using
calibration.CalibratedClassifierCV
. cluster.Birch
clustering method for large-scale datasets.- Scalable approximate nearest neighbors search with Locality-sensitive
hashing forests in
neighbors.LSHForest
. - Improved error messages and better validation when using malformed input data.
- More robust integration with pandas dataframes.
Changelog¶
New features¶
- The new
neighbors.LSHForest
implements locality-sensitive hashing for approximate nearest neighbors search. By Maheshakya Wijewardena. - Added
svm.LinearSVR
. This class uses the liblinear implementation of Support Vector Regression which is much faster for large sample sizes thansvm.SVR
with linear kernel. By Fabian Pedregosa and Qiang Luo. - Incremental fit for
GaussianNB
. - Added
sample_weight
support todummy.DummyClassifier
anddummy.DummyRegressor
. By Arnaud Joly. - Added the
metrics.label_ranking_average_precision_score
metrics. By Arnaud Joly. - Add the
metrics.coverage_error
metrics. By Arnaud Joly. - Added
linear_model.LogisticRegressionCV
. By Manoj Kumar, Fabian Pedregosa, Gael Varoquaux and Alexandre Gramfort. - Added
warm_start
constructor parameter to make it possible for any trained forest model to grow additional trees incrementally. By Laurent Direr. - Added
sample_weight
support toensemble.GradientBoostingClassifier
andensemble.GradientBoostingRegressor
. By Peter Prettenhofer. - Added
decomposition.IncrementalPCA
, an implementation of the PCA algorithm that supports out-of-core learning with apartial_fit
method. By Kyle Kastner. - Averaged SGD for
SGDClassifier
andSGDRegressor
By Danny Sullivan. - Added
cross_val_predict
function which computes cross-validated estimates. By Luis Pedro Coelho - Added
linear_model.TheilSenRegressor
, a robust generalized-median-based estimator. By Florian Wilhelm. - Added
metrics.median_absolute_error
, a robust metric. By Gael Varoquaux and Florian Wilhelm. - Add
cluster.Birch
, an online clustering algorithm. By Manoj Kumar, Alexandre Gramfort and Joel Nothman. - Added shrinkage support to
discriminant_analysis.LinearDiscriminantAnalysis
using two new solvers. By Clemens Brunner and Martin Billinger. - Added
kernel_ridge.KernelRidge
, an implementation of kernelized ridge regression. By Mathieu Blondel and Jan Hendrik Metzen. - All solvers in
linear_model.Ridge
now support sample_weight. By Mathieu Blondel. - Added
cross_validation.PredefinedSplit
cross-validation for fixed user-provided cross-validation folds. By Thomas Unterthiner. - Added
calibration.CalibratedClassifierCV
, an approach for calibrating the predicted probabilities of a classifier. By Alexandre Gramfort, Jan Hendrik Metzen, Mathieu Blondel and Balazs Kegl.
Enhancements¶
- Add option
return_distance
inhierarchical.ward_tree
to return distances between nodes for both structured and unstructured versions of the algorithm. By Matteo Visconti di Oleggio Castello. The same option was added inhierarchical.linkage_tree
. By Manoj Kumar - Add support for sample weights in scorer objects. Metrics with sample weight support will automatically benefit from it. By Noel Dawe and Vlad Niculae.
- Added
newton-cg
and lbfgs solver support inlinear_model.LogisticRegression
. By Manoj Kumar. - Add
selection="random"
parameter to implement stochastic coordinate descent forlinear_model.Lasso
,linear_model.ElasticNet
and related. By Manoj Kumar. - Add
sample_weight
parameter tometrics.jaccard_similarity_score
andmetrics.log_loss
. By Jatin Shah. - Support sparse multilabel indicator representation in
preprocessing.LabelBinarizer
andmulticlass.OneVsRestClassifier
(by Hamzeh Alsalhi with thanks to Rohit Sivaprasad), as well as evaluation metrics (by Joel Nothman). - Add
sample_weight
parameter to metrics.jaccard_similarity_score. By Jatin Shah. - Add support for multiclass in metrics.hinge_loss. Added
labels=None
as optional parameter. By Saurabh Jha. - Add
sample_weight
parameter to metrics.hinge_loss. By Saurabh Jha. - Add
multi_class="multinomial"
option inlinear_model.LogisticRegression
to implement a Logistic Regression solver that minimizes the cross-entropy or multinomial loss instead of the default One-vs-Rest setting. Supports lbfgs and newton-cg solvers. By Lars Buitinck and Manoj Kumar. Solver option newton-cg by Simon Wu. DictVectorizer
can now performfit_transform
on an iterable in a single pass, when giving the optionsort=False
. By Dan Blanchard.GridSearchCV
andRandomizedSearchCV
can now be configured to work with estimators that may fail and raise errors on individual folds. This option is controlled by the error_score parameter. This does not affect errors raised on re-fit. By Michal Romaniuk.- Add
digits
parameter to metrics.classification_report to allow report to show different precision of floating point numbers. By Ian Gilmore. - Add a quantile prediction strategy to the
dummy.DummyRegressor
. By Aaron Staple. - Add
handle_unknown
option topreprocessing.OneHotEncoder
to handle unknown categorical features more gracefully during transform. By Manoj Kumar. - Added support for sparse input data to decision trees and their ensembles. By Fares Hedyati and Arnaud Joly.
- Optimized
cluster.AffinityPropagation
by reducing the number of memory allocations of large temporary data-structures. By Antony Lee. - Parellization of the computation of feature importances in random forest. By Olivier Grisel and Arnaud Joly.
- Add
n_iter_
attribute to estimators that accept amax_iter
attribute in their constructor. By Manoj Kumar. - Added decision function for
multiclass.OneVsOneClassifier
By Raghav RV and Kyle Beauchamp. neighbors.kneighbors_graph
andradius_neighbors_graph
support non-Euclidean metrics. By Manoj Kumar- Parameter
connectivity
incluster.AgglomerativeClustering
and family now accept callables that return a connectivity matrix. By Manoj Kumar. - Sparse support for
paired_distances
. By Joel Nothman. cluster.DBSCAN
now supports sparse input and sample weights and has been optimized: the inner loop has been rewritten in Cython and radius neighbors queries are now computed in batch. By Joel Nothman and Lars Buitinck.- Add
class_weight
parameter to automatically weight samples by class frequency forensemble.RandomForestClassifier
,tree.DecisionTreeClassifier
,ensemble.ExtraTreesClassifier
andtree.ExtraTreeClassifier
. By Trevor Stephens. grid_search.RandomizedSearchCV
now does sampling without replacement if all parameters are given as lists. By Andreas Müller.- Parallelized calculation of
pairwise_distances
is now supported for scipy metrics and custom callables. By Joel Nothman. - Allow the fitting and scoring of all clustering algorithms in
pipeline.Pipeline
. By Andreas Müller. - More robust seeding and improved error messages in
cluster.MeanShift
by Andreas Müller. - Make the stopping criterion for
mixture.GMM
,mixture.DPGMM
andmixture.VBGMM
less dependent on the number of samples by thresholding the average log-likelihood change instead of its sum over all samples. By Hervé Bredin. - The outcome of
manifold.spectral_embedding
was made deterministic by flipping the sign of eigenvectors. By Hasil Sharma. - Significant performance and memory usage improvements in
preprocessing.PolynomialFeatures
. By Eric Martin. - Numerical stability improvements for
preprocessing.StandardScaler
andpreprocessing.scale
. By Nicolas Goix svm.SVC
fitted on sparse input now implementsdecision_function
. By Rob Zinkov and Andreas Müller.cross_validation.train_test_split
now preserves the input type, instead of converting to numpy arrays.
Documentation improvements¶
- Added example of using
FeatureUnion
for heterogeneous input. By Matt Terry - Documentation on scorers was improved, to highlight the handling of loss functions. By Matt Pico.
- A discrepancy between liblinear output and scikit-learn’s wrappers is now noted. By Manoj Kumar.
- Improved documentation generation: examples referring to a class or function are now shown in a gallery on the class/function’s API reference page. By Joel Nothman.
- More explicit documentation of sample generators and of data transformation. By Joel Nothman.
sklearn.neighbors.BallTree
andsklearn.neighbors.KDTree
used to point to empty pages stating that they are aliases of BinaryTree. This has been fixed to show the correct class docs. By Manoj Kumar.- Added silhouette plots for analysis of KMeans clustering using
metrics.silhouette_samples
andmetrics.silhouette_score
. See Selecting the number of clusters with silhouette analysis on KMeans clustering
Bug fixes¶
- Metaestimators now support ducktyping for the presence of
decision_function
,predict_proba
and other methods. This fixes behavior ofgrid_search.GridSearchCV
,grid_search.RandomizedSearchCV
,pipeline.Pipeline
,feature_selection.RFE
,feature_selection.RFECV
when nested. By Joel Nothman - The
scoring
attribute of grid-search and cross-validation methods is no longer ignored when agrid_search.GridSearchCV
is given as a base estimator or the base estimator doesn’t have predict. - The function
hierarchical.ward_tree
now returns the children in the same order for both the structured and unstructured versions. By Matteo Visconti di Oleggio Castello. feature_selection.RFECV
now correctly handles cases whenstep
is not equal to 1. By Nikolay Mayorov- The
decomposition.PCA
now undoes whitening in itsinverse_transform
. Also, itscomponents_
now always have unit length. By Michael Eickenberg. - Fix incomplete download of the dataset when
datasets.download_20newsgroups
is called. By Manoj Kumar. - Various fixes to the Gaussian processes subpackage by Vincent Dubourg and Jan Hendrik Metzen.
- Calling
partial_fit
withclass_weight=='auto'
throws an appropriate error message and suggests a work around. By Danny Sullivan. RBFSampler
withgamma=g
formerly approximatedrbf_kernel
withgamma=g/2.
; the definition ofgamma
is now consistent, which may substantially change your results if you use a fixed value. (If you cross-validated overgamma
, it probably doesn’t matter too much.) By Dougal Sutherland.- Pipeline object delegate the
classes_
attribute to the underlying estimator. It allows, for instance, to make bagging of a pipeline object. By Arnaud Joly neighbors.NearestCentroid
now uses the median as the centroid when metric is set tomanhattan
. It was using the mean before. By Manoj Kumar- Fix numerical stability issues in
linear_model.SGDClassifier
andlinear_model.SGDRegressor
by clipping large gradients and ensuring that weight decay rescaling is always positive (for large l2 regularization and large learning rate values). By Olivier Grisel - When compute_full_tree is set to “auto”, the full tree is
built when n_clusters is high and is early stopped when n_clusters is
low, while the behavior should be vice-versa in
cluster.AgglomerativeClustering
(and friends). This has been fixed By Manoj Kumar - Fix lazy centering of data in
linear_model.enet_path
andlinear_model.lasso_path
. It was centered around one. It has been changed to be centered around the origin. By Manoj Kumar - Fix handling of precomputed affinity matrices in
cluster.AgglomerativeClustering
when using connectivity constraints. By Cathy Deng - Correct
partial_fit
handling ofclass_prior
forsklearn.naive_bayes.MultinomialNB
andsklearn.naive_bayes.BernoulliNB
. By Trevor Stephens. - Fixed a crash in
metrics.precision_recall_fscore_support
when using unsortedlabels
in the multi-label setting. By Andreas Müller. - Avoid skipping the first nearest neighbor in the methods
radius_neighbors
,kneighbors
,kneighbors_graph
andradius_neighbors_graph
insklearn.neighbors.NearestNeighbors
and family, when the query data is not the same as fit data. By Manoj Kumar. - Fix log-density calculation in the
mixture.GMM
with tied covariance. By Will Dawson - Fixed a scaling error in
feature_selection.SelectFdr
where a factorn_features
was missing. By Andrew Tulloch - Fix zero division in
neighbors.KNeighborsRegressor
and related classes when using distance weighting and having identical data points. By Garret-R. - Fixed round off errors with non positive-definite covariance matrices in GMM. By Alexis Mignon.
- Fixed a error in the computation of conditional probabilities in
naive_bayes.BernoulliNB
. By Hanna Wallach. - Make the method
radius_neighbors
ofneighbors.NearestNeighbors
return the samples lying on the boundary foralgorithm='brute'
. By Yan Yi. - Flip sign of
dual_coef_
ofsvm.SVC
to make it consistent with the documentation anddecision_function
. By Artem Sobolev. - Fixed handling of ties in
isotonic.IsotonicRegression
. We now use the weighted average of targets (secondary method). By Andreas Müller and Michael Bommarito.
API changes summary¶
GridSearchCV
andcross_val_score
and other meta-estimators don’t convert pandas DataFrames into arrays any more, allowing DataFrame specific operations in custom estimators.multiclass.fit_ovr
,multiclass.predict_ovr
,predict_proba_ovr
,multiclass.fit_ovo
,multiclass.predict_ovo
,multiclass.fit_ecoc
andmulticlass.predict_ecoc
are deprecated. Use the underlying estimators instead.- Nearest neighbors estimators used to take arbitrary keyword arguments
and pass these to their distance metric. This will no longer be supported
in scikit-learn 0.18; use the
metric_params
argument instead. - n_jobs parameter of the fit method shifted to the constructor of the
- LinearRegression class.
- The
predict_proba
method ofmulticlass.OneVsRestClassifier
now returns two probabilities per sample in the multiclass case; this is consistent with other estimators and with the method’s documentation, but previous versions accidentally returned only the positive probability. Fixed by Will Lamond and Lars Buitinck. - Change default value of precompute in
ElasticNet
andLasso
to False. Setting precompute to “auto” was found to be slower when n_samples > n_features since the computation of the Gram matrix is computationally expensive and outweighs the benefit of fitting the Gram for just one alpha.precompute="auto"
is now deprecated and will be removed in 0.18 By Manoj Kumar. - Expose
positive
option inlinear_model.enet_path
andlinear_model.enet_path
which constrains coefficients to be positive. By Manoj Kumar. - Users should now supply an explicit
average
parameter tosklearn.metrics.f1_score
,sklearn.metrics.fbeta_score
,sklearn.metrics.recall_score
andsklearn.metrics.precision_score
when performing multiclass or multilabel (i.e. not binary) classification. By Joel Nothman. - scoring parameter for cross validation now accepts ‘f1_micro’, ‘f1_macro’ or ‘f1_weighted’. ‘f1’ is now for binary classification only. Similar changes apply to ‘precision’ and ‘recall’. By Joel Nothman.
- The
fit_intercept
,normalize
andreturn_models
parameters inlinear_model.enet_path
andlinear_model.lasso_path
have been removed. They were deprecated since 0.14 - From now onwards, all estimators will uniformly raise
NotFittedError
(utils.validation.NotFittedError
), when any of thepredict
like methods are called before the model is fit. By Raghav RV. - Input data validation was refactored for more consistent input
validation. The
check_arrays
function was replaced bycheck_array
andcheck_X_y
. By Andreas Müller. - Allow
X=None
in the methodsradius_neighbors
,kneighbors
,kneighbors_graph
andradius_neighbors_graph
insklearn.neighbors.NearestNeighbors
and family. If set to None, then for every sample this avoids setting the sample itself as the first nearest neighbor. By Manoj Kumar. - Add parameter
include_self
inneighbors.kneighbors_graph
andneighbors.radius_neighbors_graph
which has to be explicitly set by the user. If set to True, then the sample itself is considered as the first nearest neighbor. - thresh parameter is deprecated in favor of new tol parameter in
GMM
,DPGMM
andVBGMM
. See Enhancements section for details. By Hervé Bredin. - Estimators will treat input with dtype object as numeric when possible. By Andreas Müller
- Estimators now raise ValueError consistently when fitted on empty data (less than 1 sample or less than 1 feature for 2D input). By Olivier Grisel.
- The
shuffle
option oflinear_model.SGDClassifier
,linear_model.SGDRegressor
,linear_model.Perceptron
,linear_model.PassiveAggressiveClassifier
andlinear_model.PassiveAggressiveRegressor
now defaults toTrue
. cluster.DBSCAN
now uses a deterministic initialization. The random_state parameter is deprecated. By Erich Schubert.
Code Contributors¶
A. Flaxman, Aaron Schumacher, Aaron Staple, abhishek thakur, Akshay, akshayah3, Aldrian Obaja, Alexander Fabisch, Alexandre Gramfort, Alexis Mignon, Anders Aagaard, Andreas Mueller, Andreas van Cranenburgh, Andrew Tulloch, Andrew Walker, Antony Lee, Arnaud Joly, banilo, Barmaley.exe, Ben Davies, Benedikt Koehler, bhsu, Boris Feld, Borja Ayerdi, Boyuan Deng, Brent Pedersen, Brian Wignall, Brooke Osborn, Calvin Giles, Cathy Deng, Celeo, cgohlke, chebee7i, Christian Stade-Schuldt, Christof Angermueller, Chyi-Kwei Yau, CJ Carey, Clemens Brunner, Daiki Aminaka, Dan Blanchard, danfrankj, Danny Sullivan, David Fletcher, Dmitrijs Milajevs, Dougal J. Sutherland, Erich Schubert, Fabian Pedregosa, Florian Wilhelm, floydsoft, Félix-Antoine Fortin, Gael Varoquaux, Garrett-R, Gilles Louppe, gpassino, gwulfs, Hampus Bengtsson, Hamzeh Alsalhi, Hanna Wallach, Harry Mavroforakis, Hasil Sharma, Helder, Herve Bredin, Hsiang-Fu Yu, Hugues SALAMIN, Ian Gilmore, Ilambharathi Kanniah, Imran Haque, isms, Jake VanderPlas, Jan Dlabal, Jan Hendrik Metzen, Jatin Shah, Javier López Peña, jdcaballero, Jean Kossaifi, Jeff Hammerbacher, Joel Nothman, Jonathan Helmus, Joseph, Kaicheng Zhang, Kevin Markham, Kyle Beauchamp, Kyle Kastner, Lagacherie Matthieu, Lars Buitinck, Laurent Direr, leepei, Loic Esteve, Luis Pedro Coelho, Lukas Michelbacher, maheshakya, Manoj Kumar, Manuel, Mario Michael Krell, Martin, Martin Billinger, Martin Ku, Mateusz Susik, Mathieu Blondel, Matt Pico, Matt Terry, Matteo Visconti dOC, Matti Lyra, Max Linke, Mehdi Cherti, Michael Bommarito, Michael Eickenberg, Michal Romaniuk, MLG, mr.Shu, Nelle Varoquaux, Nicola Montecchio, Nicolas, Nikolay Mayorov, Noel Dawe, Okal Billy, Olivier Grisel, Óscar Nájera, Paolo Puggioni, Peter Prettenhofer, Pratap Vardhan, pvnguyen, queqichao, Rafael Carrascosa, Raghav R V, Rahiel Kasim, Randall Mason, Rob Zinkov, Robert Bradshaw, Saket Choudhary, Sam Nicholls, Samuel Charron, Saurabh Jha, sethdandridge, sinhrks, snuderl, Stefan Otte, Stefan van der Walt, Steve Tjoa, swu, Sylvain Zimmer, tejesh95, terrycojones, Thomas Delteil, Thomas Unterthiner, Tomas Kazmar, trevorstephens, tttthomasssss, Tzu-Ming Kuo, ugurcaliskan, ugurthemaster, Vinayak Mehta, Vincent Dubourg, Vjacheslav Murashkin, Vlad Niculae, wadawson, Wei Xue, Will Lamond, Wu Jiang, x0l, Xinfan Meng, Yan Yi, Yu-Chin