data-mining - w3toppers.com

Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

You have at least two options: Transform all your data into a categorical representation by computing percentiles for each continuous variables and then binning the continuous variables using the percentiles as bin boundaries. For instance for the height of a person create the following bins: “very small”, “small”, “regular”, “big”, “very big” ensuring that each … Read more

How to optimal K in K – Means Algorithm [duplicate]

The base idea is to evaluate cluster scoring on sample data, usally it is distance inside cluster and distance between clusters. The more this measure the better clustering, based on this mesure you can select best clustring paramters. One of metrics can be found here http://alias-i.com/lingpipe/docs/api/com/aliasi/cluster/ClusterScore.html

random unit vector in multi-dimensional space

One simple trick is to select each dimension from a gaussian distribution, then normalize: from random import gauss def make_rand_vector(dims): vec = [gauss(0, 1) for i in range(dims)] mag = sum(x**2 for x in vec) ** .5 return [x/mag for x in vec] For example, if you want a 7-dimensional random vector, select 7 random … Read more

What is the difference between linear regression and logistic regression? [closed]

Linear regression output as probabilities It’s tempting to use the linear regression output as probabilities but it’s a mistake because the output can be negative, and greater than 1 whereas probability can not. As regression might actually produce probabilities that could be less than 0, or even bigger than 1, logistic regression was introduced. Source: … Read more

Clustering values by their proximity in python (machine learning?) [duplicate]

Don’t use clustering for 1-dimensional data Clustering algorithms are designed for multivariate data. When you have 1-dimensional data, sort it, and look for the largest gaps. This is trivial and fast in 1d, and not possible in 2d. If you want something more advanced, use Kernel Density Estimation (KDE) and look for local minima to … Read more

Why does one hot encoding improve machine learning performance? [closed]

Many learning algorithms either learn a single weight per feature, or they use distances between samples. The former is the case for linear models such as logistic regression, which are easy to explain. Suppose you have a dataset having only a single categorical feature “nationality”, with values “UK”, “French” and “US”. Assume, without loss of … Read more

Can someone give an example of cosine similarity, in a very simple, graphical way? [closed]

Here are two very short texts to compare: Julie loves me more than Linda loves me Jane likes me more than Julie loves me We want to know how similar these texts are, purely in terms of word counts (and ignoring word order). We begin by making a list of the words from both texts: … Read more

Finding 2 & 3 word Phrases Using R TM Package

You can pass in a custom tokenizing function to tm‘s DocumentTermMatrix function, so if you have package tau installed it’s fairly straightforward. library(tm); library(tau); tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method=”string”,n=n))))) texts <- c(“This is the first document.”, “This is the second file.”, “This is the third text.”) corpus <- Corpus(VectorSource(texts)) matrix <- DocumentTermMatrix(corpus,control=list(tokenize=tokenize_ngrams)) Where n in … Read more

Matlab – PCA analysis and reconstruction of multi dimensional data

Here’s a quick walkthrough. First we create a matrix of your hidden variables (or “factors”). It has 100 observations and there are two independent factors. >> factors = randn(100, 2); Now create a loadings matrix. This is going to map the hidden variables onto your observed variables. Say your observed variables have four features. Then … Read more

Speed-efficient classification in Matlab

Approach #1 For a N x 2 sized points/pixels array, you can avoid permute as suggested in the other solution by Luis, which could slow down things a bit, to have a kind of “permute-unrolled” version of it and also let’s bsxfun work towards a 2D array instead of a 3D array, which must be … Read more