Fast n-gram calculation

Since you didn’t indicate whether you want word or character-level n-grams, I’m just going to assume the former, without loss of generality. I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself. def ngrams(tokens, MIN_N, MAX_N): n_tokens = len(tokens) for i in … Read more

Hibernate Search | ngram analyzer with minGramSize 1

Updated answer for Hibernate Search 6 With Hibernate Search 6, you can define a second analyzer, identical to your “ngram” analyzer except that it does not have an ngram filter, and assign it as the searchAnalyzer for your field: public class Hospital { // … @FullTextField(analyzer = “ngram”, searchAnalyzer = “my_analyzer_without_ngrams”) private String name = … Read more

Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

Check out NLTK package: http://www.nltk.org it has everything what you need For the cosine_similarity: def cosine_distance(u, v): “”” Returns the cosine of the angle between vectors v and u. This is equal to u.v / |u||v|. “”” return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v))) For ngrams: def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None): “”” … Read more

Computing N Grams using Python

A short Pythonesque solution from this blog: def find_ngrams(input_list, n): return zip(*[input_list[i:] for i in range(n)]) Usage: >>> input_list = [‘all’, ‘this’, ‘happened’, ‘more’, ‘or’, ‘less’] >>> find_ngrams(input_list, 1) [(‘all’,), (‘this’,), (‘happened’,), (‘more’,), (‘or’,), (‘less’,)] >>> find_ngrams(input_list, 2) [(‘all’, ‘this’), (‘this’, ‘happened’), (‘happened’, ‘more’), (‘more’, ‘or’), (‘or’, ‘less’)] >>> find_ngrams(input_list, 3)) [(‘all’, ‘this’, ‘happened’), (‘this’, … Read more

N-gram generation from a sentence

I believe this would do what you want: import java.util.*; public class Test { public static List<String> ngrams(int n, String str) { List<String> ngrams = new ArrayList<String>(); String[] words = str.split(” “); for (int i = 0; i < words.length – n + 1; i++) ngrams.add(concat(words, i, i+n)); return ngrams; } public static String concat(String[] … Read more

Filename search with ElasticSearch

You have various problems with what you pasted: 1) Incorrect mapping When creating the index, you specify: “mappings”: { “files”: { But your type is actually file, not files. If you checked the mapping, you would see that immediately: curl -XGET ‘http://127.0.0.1:9200/files/_mapping?pretty=1’ # { # “files” : { # “files” : { # “properties” : … Read more