>

Word Stemming

Stemming consists of processing a word so that only its stem or root form is left. In Okapi, indexed keywords are held in stemmed form, and so are query words entered by the user. This provides a better likelihood of matching relevant documents.

Suppose the user types: impressionists as a search term. Without stemming, this would match only records containing the plural: (impressionists) and not the singular: (impressionist), thus losing some potentially useful documents. Following stemming, records containing both singular and plural forms would be matched, and, (depending on the precise method used) perhaps impressionistic and impressionism as well.

In the Okapi system, a stemming algorithm developed by Martin Porter at Cambridge is used. This implements weak stemming to remove common plural endings and other grammatical suffixes like -ing and -ed. For Okapi '86, the extended version of this algorithm was used for strong stemming, to remove derivational suffixes like -ent, -ence, and -ision.