Stemming English words with Lucene
SnowballAnalyzer is deprecated, you can use Lucene Porter Stemmer instead: PorterStemmer stem = new PorterStemmer(); stem.setCurrent(word); stem.stem(); String result = stem.getCurrent(); Hope this help!
SnowballAnalyzer is deprecated, you can use Lucene Porter Stemmer instead: PorterStemmer stem = new PorterStemmer(); stem.setCurrent(word); stem.stem(); String result = stem.getCurrent(); Hope this help!
I think there are two problems: first, the scripts should have “-utf8” in their name, e.g. cmd/tagger-chunker-german-utf8, because you downloaded the UTF-8 data. Second, tagging and chunking requires a data file each. See the homepage which has a section “Parameter files for PC” and “Chunker parameter files for PC” – download the files from both … Read more
If you know Python, The Natural Language Toolkit (NLTK) has a very powerful lemmatizer that makes use of WordNet. Note that if you are using this lemmatizer for the first time, you must download the corpus prior to using it. This can be done by: >>> import nltk >>> nltk.download(‘wordnet’) You only have to do … Read more
Q1: “[..] are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English” Yes. Stemmers are much simpler, smaller and usually faster than lemmatizers, and for many applications their results are good enough. Using a lemmatizer for that is a waste of resources. Consider, for example, dimensionality reduction … Read more