scikit-learn - w3toppers.com

ValueError: Dimension mismatch

Sounds to me, like you just need to use vectorizer.transform for the test dataset, since the training dataset fixes the vocabulary (you cannot know the full vocabulary including the training set afterall). Just to be clear, thats vectorizer.transform instead of vectorizer.fit_transform.

ModuleNotFoundError: No module named ‘sklearn’

You can just use pip for installing packages, even when you are using anaconda: pip install -U scikit-learn scipy matplotlib This should work for installing the package. And for Python 3.x just use pip3: pip3 install -U scikit-learn scipy matplotlib

ImportError: No module named sklearn.cross_validation

It must relate to the renaming and deprecation of cross_validation sub-module to model_selection. Try substituting cross_validation to model_selection

Saving StandardScaler() model for use on new datasets

you could use joblib dump function to save the standard scaler model. Here’s a complete example for reference. from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris data, target = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(data, target) sc = StandardScaler() X_train_std = sc.fit_transform(X_train) if you want to save the sc standardscaller … Read more

scikit-learn .predict() default threshold

The threshold can be set using clf.predict_proba() for example: from sklearn.tree import DecisionTreeClassifier clf = DecisionTreeClassifier(random_state = 2) clf.fit(X_train,y_train) # y_pred = clf.predict(X_test) # default threshold is 0.5 y_pred = (clf.predict_proba(X_test)[:,1] >= 0.3).astype(bool) # set threshold as 0.3

Determining the most contributing features for SVM classifier in sklearn

Yes, there is attribute coef_ for SVM classifier but it only works for SVM with linear kernel. For other kernels it is not possible because data are transformed by kernel method to another space, which is not related to input space, check the explanation. from matplotlib import pyplot as plt from sklearn import svm def … Read more

Does the SVM in sklearn support incremental (online) learning?

While online algorithms for SVMs do exist, it has become important to specify if you want kernel or linear SVMs, as many efficient algorithms have been developed for the special case of linear SVMs. For the linear case, if you use the SGD classifier in scikit-learn with the hinge loss and L2 regularization you will … Read more

Under what parameters are SVC and LinearSVC in scikit-learn equivalent?

In mathematical sense you need to set: SVC(kernel=”linear”, **kwargs) # by default it uses RBF kernel and LinearSVC(loss=”hinge”, **kwargs) # by default it uses squared hinge loss Another element, which cannot be easily fixed is increasing intercept_scaling in LinearSVC, as in this implementation bias is regularized (which is not true in SVC nor should be … Read more

Model help using Scikit-learn when using GridSearch

GridSearchCV as @Gauthier Feuillen said is used to search best parameters of an estimator for given data. Description of GridSearchCV:- gcv = GridSearchCV(pipe, clf_params,cv=cv) gcv.fit(features,labels) clf_params will be expanded to get all possible combinations separate using ParameterGrid. features will now be split into features_train and features_test using cv. Same for labels Now the gridSearch estimator … Read more

How should I teach machine learning algorithm using data with big disproportion of classes? (SVM)

The most basic approach here is to use so called “class weighting scheme” – in classical SVM formulation there is a C parameter used to control the missclassification count. It can be changed into C1 and C2 parameters used for class 1 and 2 respectively. The most common choice of C1 and C2 for a … Read more