Stemmers vs Lemmatizers

Q1: “[..] are English stemmers any useful at all today? Since we have a plethora of lemmatization tools for English”

Yes. Stemmers are much simpler, smaller and usually faster than lemmatizers, and for many applications their results are good enough. Using a lemmatizer for that is a waste of resources. Consider, for example, dimensionality reduction in Information Retrieval. You replace all drive/driving by driv in both the searched documents and the query. You do not care if it is drive or driv or x17a$ as long as it clusters inflectionally related words together.

Q2: “[..]how should we move on to build robust lemmatizers that can take on nounify, verbify, adjectify and adverbify preprocesses?

What is your definition of a lemma, does it include derivation (drivedriver) or only inflection (drivedrivesdrove)? Does it take into account semantics?

If you want to include derivation (which most people would say includes verbing nouns etc.) then keep in mind that derivation is far more irregular than inflection. There are many idiosyncracies, gaps, etc. Do you really want for to change (change trains) and change (as coins) to have the same lemma? If not, where do you draw the boundary? How about nerveunnerve, earthunearthearthling, … It really depends on the application.

If you take into account semantics (bank would be labeled as bank-money or bank-river depending on context), how deep do you go (do you distinguish bank-institution from bank-building)? Some apps may not care about this at all, some might want to distinguish basic semantics, some might want it fined-grained.

Q3: “How could the lemmatization task be easily scaled to other languages that have similar morphological structures as English?”

What do you mean by “similar morphological structures as English”? English has very little inflectional morphology. There are good lemmatizers for languages of other morphological types (truly inflectional, agglutinative, template, …).

With a possible exception of agglutinative languages, I would argue that a lookup table (say a compressed trie) is the best solution. (Possibly with some backup rules for unknown words such as proper names). The lookup is followed by some kind of disambiguation (ranging from trivial – take the first one, or take the first one consistent with the words POS tag, to much more sophisticated). The more sophisticated disambiguations are usually supervised stochastical algorithms (e.g. TreeTagger or Faster), although combination of machine learning and manually created rules has been done too (see e.g. this).

Obviously for most languages you do not want to create the lookup table by
hand, but instead generate it from a description of morphology of
that language. For inflectional languages, you can go the engineering
way of Hajic for Czech or Mikheev for Russian, or, if you are daring,
you use two-level morphology. Or you can do something in between,
such as Hana (myself) (Note that these are all full
morphological analyzers that include lemmatization). Or you can learn
the lemmatizer in an unsupervised manner a la Yarowsky and
Wicentowski
, possibly with manual post-processing, correcting the
most frequent words.

There are way too many options and it really all depends what you want to do with the results.

Leave a Comment