topic-modeling
LDA model generates different topics everytime i train on the same corpus
Why does the same LDA parameters and corpus generate different topics everytime? Because LDA uses randomness in both training and inference steps. And how do i stabilize the topic generation? By resetting the numpy.random seed to the same value every time a model is trained or inference is performed, with numpy.random.seed: SOME_FIXED_SEED = 42 # … Read more
Remove empty documents from DocumentTermMatrix in R topicmodels?
“Each row of the input matrix needs to contain at least one non-zero entry” The error means that sparse matrix contain a row without entries(words). one Idea is to compute the sum of words by row rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document dtm.new <- dtm[rowTotals> 0, ] … Read more
Spark MLlib LDA, how to infer the topics distribution of a new unseen document?
As of Spark 1.5 this functionality has not been implemented for the DistributedLDAModel. What you’re going to need to do is convert your model to a LocalLDAModel using the toLocal method and then call the topicDistributions(documents: RDD[(Long, Vector]) method where documents are the new (i.e. out-of-training) documents, something like this: newDocuments: RDD[(Long, Vector)] = … … Read more