CILKS Final Report : Probabilistic Retrieval

What is Probabilistic Retrieval?

In probabilistic retrieval, a numeric weight is assigned to each retrieved document as an estimate of the probability that it is relevant to the query. Documents are presented to the user in weight order, so those which are considered to be the best match come at the top of the hit list.

The two most important factors in the weight calculation are:

the number of different query terms in the retrieved document,
the frequency of those terms in the database as a whole. Frequent terms are given a lower weight than rare terms, as they are assumed to be less specific, and therefore less useful for pinpointing relevant documents.

When retrieving from databases of long documents, two other factors are considered:

the number of times each query term occurs in the retrieved document,
the length of the document as a whole.

Documents retrieved by the original query are presented to the user, who is asked to supply relevance feedback about them. Given a set of relevant documents, the original query terms are re-weighted (and additional terms extracted) in order to identify other similar documents. This is an effective method of query expansion.

The probabilistic retrieval system used for the CILKS project was OKAPI.