This model was first developed by Steve Robertson (Head of Department, Information Science) and Karen Sparck Jones in the 1970s. The following description of the model is based on their 1994 paper, Simple, proven approaches to text retrieval, which has appeared as a Cambridge Computer Laboratory Technical Report.
One
probabilistic approach to the matching process would be to rank retrieved
documents in terms of the degree of match. For example, if the user entered
the query:
american art 20th century
,
the retrieved documents might be:
Degree of Match Document --------------- -------- 4/4 words match American Art in the 20th Century 4/4 words match American Folk Art: 20th Century Blues 3/4 words match Art in the 20th Century 2/4 words match 20th Century World History 1/4 words match Modern Art
However, if there are many documents which contain the same number of user-entered terms this is not very helpful.
The idea behind term weighting is selectivity: what makes a term a good one is whether it can pick any on the few relevant documents from the many non-relevant ones.The three sources of weighting data are:(Robertson,S.E. & Sparck Jones,K. 1994)
Each of these weights is combined together to give a score for each term-document combination and then each term's score is combined to give a total score for each document which matches the query.
For more information:
Read the paper described above, or look at:-
Robertson,S.E. and Sparck Jones,K. 'Relevance weighting of search terms', Journal of the American Society for Information Science, 27, 1976, 129-146.
This paper is reprinted in:- Willett,P. (ed) Document retrieval systems, London: Taylor Graham, 1988.
The probabilistic model has been used extensively with the Okapi experimental retrieval system.