scikit-learn - w3toppers.com

Does GridSearchCV perform cross-validation?

All estimators in scikit where name ends with CV perform cross-validation. But you need to keep a separate test set for measuring the performance. So you need to split your whole data to train and test. Forget about this test data for a while. And then pass this train data only to grid-search. GridSearch will … Read more

Early stopping with Keras and sklearn GridSearchCV cross-validation

[Answer after the question was edited & clarified:] Before rushing into implementation issues, it is always a good practice to take some time to think about the methodology and the task itself; arguably, intermingling early stopping with the cross validation procedure is not a good idea. Let’s make up an example to highlight the argument. … Read more

ValueError: Expected 2D array, got 1D array instead:

You need to give both the fit and predict methods 2D arrays. Your x_train and x_test are currently only 1 dimensional. What is suggested by the console should work: x_train= x_train.reshape(-1, 1) x_test = x_test.reshape(-1, 1) This uses numpy’s reshape to transform your array. For example, x = [1, 2, 3] wopuld be transformed to … Read more

confused about random_state in decision tree of scikit learn

This is explained in the documentation The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms … Read more

Using Smote with Gridsearchcv in Scikit-learn

Yes, it can be done, but with imblearn Pipeline. You see, imblearn has its own Pipeline to handle the samplers correctly. I described this in a similar question here. When called predict() on a imblearn.Pipeline object, it will skip the sampling method and leave the data as it is to be passed to next transformer. … Read more

Got continuous is not supported error in RandomForestRegressor

It’s because accuracy_score is for classification tasks only. For regression you should use something different, for example: clf.score(X_test, y_test) Where X_test is samples, y_test is corresponding ground truth values. It will compute predictions inside.

sklearn : TFIDF Transformer : How to get tf-idf values of given words in document

You can use TfidfVectorizer from sklean from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np from scipy.sparse.csr import csr_matrix #need this if you want to save tfidf_matrix tf = TfidfVectorizer(input=”filename”, analyzer=”word”, ngram_range=(1,6), min_df = 0, stop_words=”english”, sublinear_tf=True) tfidf_matrix = tf.fit_transform(corpus) The above tfidf_matix has the TF-IDF values of all the documents in the corpus. This is … Read more

Tensorflow Precision / Recall / F1 score and Confusion matrix

You do not really need sklearn to calculate precision/recall/f1 score. You can easily express them in TF-ish way by looking at the formulas: Now if you have your actual and predicted values as vectors of 0/1, you can calculate TP, TN, FP, FN using tf.count_nonzero: TP = tf.count_nonzero(predicted * actual) TN = tf.count_nonzero((predicted – 1) … Read more

scikit-learn: how to scale back the ‘y’ predicted result

You can use inverse_transform using your scalery object: y_new_inverse = scalery.inverse_transform(y_new)

Stratified Sampling in Pandas

Use min when passing the number to sample. Consider the dataframe df df = pd.DataFrame(dict( A=[1, 1, 1, 2, 2, 2, 2, 3, 4, 4], B=range(10) )) df.groupby(‘A’, group_keys=False).apply(lambda x: x.sample(min(len(x), 2))) A B 1 1 1 2 1 2 3 2 3 6 2 6 7 3 7 9 4 9 8 4 8