Is Spark’s KMeans unable to handle bigdata?

I think the ‘hanging’ is because your executors keep dying. As I mentioned in a side conversation, this code runs fine for me, locally and on a cluster, in Pyspark and Scala. However, it takes a lot longer than it should. It is almost all time spent in k-means|| initialization. I opened https://issues.apache.org/jira/browse/SPARK-17389 to track … Read more

Cluster one-dimensional data optimally? [closed]

Univariate k-means clustering can be solved in O(kn) time (on already sorted input) based on theoretical results on Monge matrices, but the approach was not popular most likely due to numerical instability and also perhaps coding challenges. A better option is an O(knlgn) method that is now implemented in Ckmeans.1d.dp version 3.4.6. This implementation is … Read more

Clustering text documents using scikit-learn kmeans in Python

This is a simpler example: from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans from sklearn.metrics import adjusted_rand_score documents = [“Human machine interface for lab abc computer applications”, “A survey of user opinion of computer system response time”, “The EPS user interface management system”, “System and human system engineering testing of EPS”, “Relation of user perceived … Read more

Reading wav file in Java

The official Java Sound Programmer Guide walks through reading and writing audio files. This article by A Greensted: Reading and Writing Wav Files in java should be helpful. The WavFile class is very useful and it can be tweaked to return the entire data array instead of buffered fragments.

Is it possible to specify your own distance function using scikit-learn K-Means Clustering?

Here’s a small kmeans that uses any of the 20-odd distances in scipy.spatial.distance, or a user function. Comments would be welcome (this has had only one user so far, not enough); in particular, what are your N, dim, k, metric ? #!/usr/bin/env python # kmeans.py using any of the 20-odd metrics in scipy.spatial.distance # kmeanssample … Read more