n-gram - w3toppers.com

Quick implementation of character n-grams for word

To generate bigrams: In [8]: b=’student’ In [9]: [b[i:i+2] for i in range(len(b)-1)] Out[9]: [‘st’, ‘tu’, ‘ud’, ‘de’, ‘en’, ‘nt’] To generalize to a different n: In [10]: n=4 In [11]: [b[i:i+n] for i in range(len(b)-n+1)] Out[11]: [‘stud’, ‘tude’, ‘uden’, ‘dent’]

Fast n-gram calculation

Since you didn’t indicate whether you want word or character-level n-grams, I’m just going to assume the former, without loss of generality. I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself. def ngrams(tokens, MIN_N, MAX_N): n_tokens = len(tokens) for i in … Read more

Hibernate Search | ngram analyzer with minGramSize 1

Updated answer for Hibernate Search 6 With Hibernate Search 6, you can define a second analyzer, identical to your “ngram” analyzer except that it does not have an ngram filter, and assign it as the searchAnalyzer for your field: public class Hospital { // … @FullTextField(analyzer = “ngram”, searchAnalyzer = “my_analyzer_without_ngrams”) private String name = … Read more

Python: Reducing memory usage of dictionary

I cannot offer a complete strategy that would help improve memory footprint, but I believe it may help to analyse what exactly is taking so much memory. If you look at the Python implementation of dictionary (which is a relatively straight-forward implementation of a hash table), as well as the implementation of the built-in string … Read more

Fast/Optimize N-gram implementations in python

Some attempts with some profiling. I thought using generators could improve the speed here. But the improvement was not noticeable compared to a slight modification of the original. But if you don’t need the full list at the same time, the generator functions should be faster. import timeit from itertools import tee, izip, islice def … Read more

Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

Check out NLTK package: http://www.nltk.org it has everything what you need For the cosine_similarity: def cosine_distance(u, v): “”” Returns the cosine of the angle between vectors v and u. This is equal to u.v / |u||v|. “”” return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v))) For ngrams: def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None): “”” … Read more

Computing N Grams using Python

A short Pythonesque solution from this blog: def find_ngrams(input_list, n): return zip(*[input_list[i:] for i in range(n)]) Usage: >>> input_list = [‘all’, ‘this’, ‘happened’, ‘more’, ‘or’, ‘less’] >>> find_ngrams(input_list, 1) [(‘all’,), (‘this’,), (‘happened’,), (‘more’,), (‘or’,), (‘less’,)] >>> find_ngrams(input_list, 2) [(‘all’, ‘this’), (‘this’, ‘happened’), (‘happened’, ‘more’), (‘more’, ‘or’), (‘or’, ‘less’)] >>> find_ngrams(input_list, 3)) [(‘all’, ‘this’, ‘happened’), (‘this’, … Read more

N-gram generation from a sentence

I believe this would do what you want: import java.util.*; public class Test { public static List<String> ngrams(int n, String str) { List<String> ngrams = new ArrayList<String>(); String[] words = str.split(” “); for (int i = 0; i < words.length – n + 1; i++) ngrams.add(concat(words, i, i+n)); return ngrams; } public static String concat(String[] … Read more

Filename search with ElasticSearch

You have various problems with what you pasted: 1) Incorrect mapping When creating the index, you specify: “mappings”: { “files”: { But your type is actually file, not files. If you checked the mapping, you would see that immediately: curl -XGET ‘http://127.0.0.1:9200/files/_mapping?pretty=1’ # { # “files” : { # “files” : { # “properties” : … Read more

n-grams in python, four, five, six grams?

Great native python based answers given by other users. But here’s the nltk approach (just in case, the OP gets penalized for reinventing what’s already existing in the nltk library). There is an ngram module that people seldom use in nltk. It’s not because it’s hard to read ngrams, but training a model base on … Read more