Okapi '86: Word stemming, spelling correction and cross reference tables

This version of Okapi, developed by Stephen Walker and Richard Jones, aimed to reduce the proportion of failed searches caused by miskeyings and spelling mistakes. They investigated the way in which stemming, truncation, spelling correction and a cross reference table to look up synonyms could be used to minimise such errors and reduce the number of failed searches. The project was based at the Polytechnic of Central London, now Westminster University.

Design

The Okapi catalogue file of 90,000 records was created from a subset of MARC data. Since results from the Okapi '84 evaluation experiments indicated that users wanted to do more extensive subject searching, MARC field 651 was added to the bibliographic files. This field comprises subject headings and contains form, content and geographic subdivisions.

User input was parsed so that words could be stemmed and spelling normalised. Both weak and strong stemming was performed on query terms input by the user. The system used a rule based spelling standardisation process to identify common variations like colour and color or medieval and mediaeval.

Cross referencing of synonyms was implemented using a lookup table. This contained abbreviations (BBC, CND), noun/adjective pairs (Wales/Welsh), irregular plurals (wife, wives), alternatives (USSR, Soviet Union) and alternative spellings (tsar, tzar, czar, csar).

Spelling errors were dealt with by checking terms the system could not find against a Soundex-type index of candidate alternative. If one of these matched the user's input closely enough it was offered to the user as a replacement.

Evaluation

A live evaluation was carried out to assess the effects of the different levels of stemming. One experimental condition used both weak and strong stemming, another condition used only weak stemming and the third condition used no stemming.

Results of re-running searches showed that weak stemming was beneficial, but that strong stemming was not always safe. The spelling correction feature was found to correct unintentionally. The lookup table for synonyms was neither conclusively helpful nor detrimental to searches but it did relieve the user of the need to consider common synonyms. However, as with a thesaurus, it carried the overhead of maintenance.

The next project, Okapi '87, examined the relative effectiveness of two online catalogues - Libertas and Okapi.