How should I teach machine learning algorithm using data with big disproportion of classes? (SVM)

The most basic approach here is to use so called “class weighting scheme” – in classical SVM formulation there is a C parameter used to control the missclassification count. It can be changed into C1 and C2 parameters used for class 1 and 2 respectively. The most common choice of C1 and C2 for a … Read more

How to get most informative features for scikit-learn classifier for different class?

In the case of binary classification, it seems like the coefficient array has been flatten. Let’s try to relabel our data with only two labels: import codecs, re, time from itertools import chain import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB trainfile=”train.txt” # Vectorizing data. train = [] word_vectorizer = CountVectorizer(analyzer=”word”) … Read more

Attach a queue to a numpy array in tensorflow for data fetch instead of files?

Probably the easiest way to make your data work with the CNN example code is to make a modified version of read_cifar10() and use it instead: Write out a binary file containing the contents of your numpy array. import numpy as np images_and_labels_array = np.array([[…], …], # [[1,12,34,24,53,…,102], # [12,112,43,24,52,…,98], # …] dtype=np.uint8) images_and_labels_array.tofile(“/tmp/images.bin”) This … Read more

Machine learning – Linear regression using batch gradient descent

The error is very simple. Your delta declaration should be inside the first for loop. Every time you accumulate the weighted differences between the training sample and output, you should start accumulating from the beginning. By not doing this, what you’re doing is accumulating the errors from the previous iteration which takes the error of … Read more

Scikit learn – fit_transform on the test set

You are not supposed to do fit_transform on your test data, but only transform. Otherwise, you will get different vectorization than the one used during training. For the memory issue, I recommend TfIdfVectorizer, which has numerous options of reducing the dimensionality (by removing rare unigrams etc.). UPDATE If the only problem is fitting test data, … Read more