CT: Rather than using a stemmer, you can use a lemmatizer, a tool from Natural Language Processing which does full morphological analysis to accurately identify the lemma for each word. Doing full morphological analysis produces at most very modest benefits for retrieval. It is hard to say more, because either form of normalization tends not to improve English information retrieval performance in aggregate – at least not by very much. While it helps a lot for some queries, it equally hurts performance a lot for others. Stemming increases recall while harming precision. As an example of what can go wrong, note that the Porter stemmer stems all of the following words:
- operate operating operates operation operative operatives operational.
to oper. However, since operate in its various forms is a common verb, we would expect to lose considerable precision on queries such as the following with Porter stemming:
- operational and research;
- operating and system;
- operative and dentistry.
For a case like this, moving to using a lemmatizer would not completely fix the problem because particular inflectional forms are used in particular collocations: a sentence with the words operate and system is not a good match for the query operating and system. Getting better value from term normalization depends more on pragmatic issues of word use than on formal issues of linguistic morphology.
The situation is different for languages with much more morphology (such as Spanish, German, and Finnish). Results in the European CLEF evaluations have repeatedly shown quite large gains from the use of stemmers (and compound splitting for languages like German).
S: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html (last access: 19 December 2014)
N: 1. From lemmatize. New Latin lemmat-, lemma lemma + English -ize, to sort (words in a corpus) in order to group with a lemma all its variant and inflected forms.
2. For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
S: 1. MW – http://www.merriam-webster.com/dictionary/lemmatize (last access: 19 December 2014). 2. http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html (last access: 19 December 2014).