How does lucene index documents?

In a nutshell, Lucene builds an inverted index using Skip-Lists on disk, and then loads a mapping for the indexed terms into memory using a Finite State Transducer (FST). Note, however, that Lucene does not (necessarily) load all indexed terms to RAM, as described by Michael McCandless, the author of Lucene’s indexing system himself. Note … Read more

Exact Phrase search using Lucene?

Try a PhraseQuery instead: PhraseQuery query = new PhraseQuery(); String[] words = sentence.split(” “); for (String word : words) { query.add(new Term(“contents”, word)); } booleanQuery.add(query, BooleanClause.Occur.MUST); Edit: I think you have a different problem. What other parts are there to your booleanQuery? Here’s a full working example of searching for a phrase: public class LucenePhraseQuery … Read more

how do I normalise a solr/lucene score?

To quote http://wiki.apache.org/lucene-java/ScoresAsPercentages: People frequently want to compute a “Percentage” from Lucene scores to determine what is a “100% perfect” match vs a “50%” match. This is also somethings called a “normalized score” Don’t do this. Seriously. Stop trying to think about your problem this way, it’s not going to end well. That page does … Read more

What is the default list of stopwords used in Lucene’s StopFilter?

The default stop words set in StandardAnalyzer and EnglishAnalyzer is from StopAnalyzer.ENGLISH_STOP_WORDS_SET, as found in the source file: “a”, “an”, “and”, “are”, “as”, “at”, “be”, “but”, “by”, “for”, “if”, “in”, “into”, “is”, “it”, “no”, “not”, “of”, “on”, “or”, “such”, “that”, “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to”, “was”, “will”, “with” StopFilter itself defines no … Read more

Elasticsearch vs Cassandra vs Elasticsearch with Cassandra

One of our applications uses data that is stored into both Cassandra and ElasticSearch. We use Cassandra to access those records whenever we can, and have data duplicated into query tables designed to adhere to specific application-side requests. For a more liberal search than our query tables can allow, ElasticSearch performs that functionality nicely. We … Read more

get cosine similarity between two documents in lucene

As Julia points out Sujit Pal’s example is very useful but the Lucene 4 API has substantial changes. Here is a version rewritten for Lucene 4. import java.io.IOException; import java.util.*; import org.apache.commons.math3.linear.*; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.core.SimpleAnalyzer; import org.apache.lucene.document.*; import org.apache.lucene.document.Field.Store; import org.apache.lucene.index.*; import org.apache.lucene.store.*; import org.apache.lucene.util.*; public class CosineDocumentSimilarity { public static final String CONTENT … Read more