The Probabilistic Retrieval Model

This model was first developed by Steve Robertson (Head of Department, Information Science) and Karen Sparck Jones in the 1970s. The following description of the model is based on their 1994 paper, Simple, proven approaches to text retrieval, which has appeared as a Cambridge Computer Laboratory Technical Report.

The Simplest Probabilistic Approach

In the probabilistic model, (as contrasted with a Boolean retrieval system) a query typed by a database user to retrieve information is taken as an unstructured list of words or phrases. These terms are then matched to documents in the database.

One probabilistic approach to the matching process would be to rank retrieved documents in terms of the degree of match. For example, if the user entered the query: american art 20th century, the retrieved documents might be:

Degree of Match           Document
---------------           --------
4/4 words match           American Art in the 20th Century
4/4 words match           American Folk Art: 20th Century Blues
3/4 words match           Art in the 20th Century
2/4 words match           20th Century World History
1/4 words match           Modern Art

However, if there are many documents which contain the same number of user-entered terms this is not very helpful.

A Better Probabilistic Approach - Weighting Terms

A better way of matching a query to documents, (leading to far better performance) is to weight each term (or term-document combination) and to rank retrieved documents according to the sum of all the weights. This means that ranking is more sophisticated with documents which contain an equal number of the user's query terms being ranked according to their likely importance.

The idea behind term weighting is selectivity: what makes a term a good one is whether it can pick any on the few relevant documents from the many non-relevant ones.
(Robertson,S.E. & Sparck Jones,K. 1994)

The three sources of weighting data are:

Collection Frequency - Terms which occur in only a few documents are likely to be more useful than ones occuring in many. The collection frequency weight is a measure of this.
Term Frequency - The more frequently a term appears in a document, the more important it is likely to be for that document. The term frequency weight is a measure of this.
Document Length - A term that occurs the same number of times in a short document as in a long one is likely to be more important to the short document than it is to the long one. The document length weight is a measure of this.

Each of these weights is combined together to give a score for each term-document combination and then each term's score is combined to give a total score for each document which matches the query.

For more information:

Read the paper described above, or look at:-

Robertson,S.E. and Sparck Jones,K. 'Relevance weighting of search terms', Journal of the American Society for Information Science, 27, 1976, 129-146.

This paper is reprinted in:- Willett,P. (ed) Document retrieval systems, London: Taylor Graham, 1988.

The probabilistic model has been used extensively with the Okapi experimental retrieval system.