next up previous contents
Next: Eliminating term duplication Up: Term Set Manipulation Previous: Term Set Manipulation

Suggested words from the thesaurus

When a query is first entered by the user, the system attempts to find thesaurus terms which might also be relevant. Possible candidates are weighted according to the number of words which they have in common with the query and the overall frequency of those words in the thesaurus as a whole --- a simple-minded version of the Okapi best-match document ranking algorithm which was used in the earlier CILKS research project. Possible terms are shown in weight order; following experiments it was found that the top ten from the London Business School thesaurus and the top thirty from the INSPEC thesaurus would provide a reasonable choice.

The thesaurus match routines must also handle the problem of ``lead-in'' terms: those which are not themselves used as document descriptors, but are (supposedly) synonyms of ``preferred'' terms which are. From the user's point of view, only the preferred terms will be useful additions to the query, but since the relationship between the preferred term and the original query is not always obvious, it is necessary to show the corresponding lead-in term to explain the connection.

Exploration of these and related issues produced the following rules to be followed by the query layer software in constructing the suggested words list:

Omitted from the suggestions list are preferred terms which do not index any documents, or lead-in terms whose preferred terms are already in the suggestions list or the working query, or do not index any documents. Clearly the above rules will not produce the ``best'' result in every case --- they must be considered as heuristics which try to give users maximum benefit from the thesaurus without overloading them with irrelevant material. The developers considered further refinements to cover exactly-matching substrings which were longer than one word but shorter than a complete query or thesaurus term, but it was felt that the law of diminishing returns would apply: the additional complexity would not be justified by the number of real cases encountered.



next up previous contents
Next: Eliminating term duplication Up: Term Set Manipulation Previous: Term Set Manipulation



PAYNE A
Wed Jul 3 14:11:32 BST 1996