Eliminating term duplication

Next: Ranking Thresholds Up: Term Set Manipulation Previous: Suggested words from

Eliminating term duplication

It can happen that the same term is derived from different sources, e.g. a word initially entered by the user may turn out to be a one-word descriptor (thesaurus) term which during the session accumulates a high enough weight through occurrence in relevant documents to be added to the current query. Terms from different sources are treated slightly differently for searching, so such duplicates must both be represented in the underlying set object, but it would clearly be puzzling to show them on the surface. Thus when the query layer assembles lists of terms for the interface script to display, it must check for and eliminate duplication. Similarly, if the user deletes a term from the current query list, all its occurrences, from whatever source, must be removed from the underlying set.

Another form of duplication arises when several different terms (e.g. bank, banks, banking) have the same stem. The general principle is to stem all query terms, and to show in the query list only one unstemmed form, namely the first one encountered during the session. Problems arose during some of the early evaluation experiments when, after the user had deleted a term such as bank, the system extracted a grammatical variant such as banks and inserted it into the query. The query layer is now responsible for ensuring that whenever a term is deleted there is an implicit deletion of any other with the same stem.

PAYNE A
Wed Jul 3 14:11:32 BST 1996