Specification of the interactive task for TREC-4 Prepared for the TREC-4 Interactive Group by Steve Robertson, 9 June 1995

The TREC interactive task used 25 adhoc topics selected for this purpose by Donna Harman.

In what follows, it is assumed that (for the purposes of this task) each topic is the subject of an interactive search by a single searcher. The task has two parts: the primary task defines the interactive phase of the search, the secondary task (optional but recommended) involves the choice of a single search formulation and a subsequent off-line search. There is also an optional but recommended baseline task, a comparable non-interactive search.

Participants may need to refine the specification given, and/or provide additional guidance, in their instructions to their searchers. Any such refinement/guidance should form part of the report.

1. The primary task

The primary task is for the searcher to conduct an interactive search with the following specification:

"Find as many documents as you can which address the given information problem, but without too much rubbish. You should complete the task in around 30 minutes or less."

It will be necessary for the system and/or the searcher to record and report on the progress and outcome of the search, in various ways. There follows a series of notes on specific items that need to be recorded; below is a partial specification of the reporting format.

Data to be recorded

Documents selected: the searcher's selection (choice) of items for the final output list must be identified. This selection and the time taken form the basis for the primary evaluation (see below).

Time taken: the elapsed (clock) time taken for the search, from the time the searcher first sees the topic until s/he declares the search to be finished, should be recorded. It is assumed that the interactive search takes place in one uninterrupted session. If a session is unavoidably interrupted, it is recommended that it be abandoned and the topic given to another searcher.

Sequence of events: all significant events in the course of the interaction should be recorded. The events listed below are those that seem to be fairly generally applicable to different systems and interactive environments; however, the list may need extending or modifying for specific systems.

Timing of events: it may be necessary to record the times of individual events in the interaction (see below).

Intermediate search formulations: if appropriate to the system, these should be recorded.

Documents viewed: "viewing" is taken to mean the searcher seeing a title or some other brief information about a document; these events should be recorded.

Documents seen: "seeing" is taken to mean the searcher seeing the text of a document, or a substantial section of text; these events should be recorded.

Terms entered by the searcher: if appropriate to the system, these should be recorded.

Terms seen (offered by the system): if appropriate to the system, these should be recorded.

Selection/rejection: documents or terms selected by the user for any further stage of the search (in addition to the final selection of documents).

Reporting formats

The evaluation (see below) is separated into two stages: a primary evaluation, involving a small number of measures defined here, and a secondary evaluation, which may involve a number of different measures not all of which have yet been defined. The "sparse" reporting format defined here is intended to provide the minimum data required for the primary evaluation; the "rich" format is to provide the additional information needed for the secondary evaluation.

"Sparse" format: a list of the identifiers of the selected documents for each topic, together with the elapsed (clock) time of the search.

"Rich" format: for each topic, the sequence of events as indicated above, and perhaps the times of events. A fuller specification of this rich format will be made at a later date; it is likely to require further interaction among the groups taking part, to ensure that all groups can comply.

Two further items should be reported. A full narrative description should be given of the interactive session for one specified topic (this topic will be specified at a later date). As indicated above, any further guidance and/or refinement of the task specification given to the searchers should also be reported.

Evaluation of the primary task

The primary evaluation measures will be recall and precision (according to official relevance judgements), measured for the selected set, and elapsed search time, as defined above. These constitute a triplet of measures, to be taken together.

As secondary measures for the primary task, we will be looking for additional measures of performance, and also measures of the effort and complexity of the search, based on what actions and decisions the searcher takes. Measures of performance may include the utility measures being considered by the filtering track for set-retrieval evaluation. Effort measures may include the average search time per relevant selected, the number of documents viewed or seen, and the density of relevant documents selected as a proportion of documents viewed at each stage (a sort of local precision). Some such measures will be defined in association with the "rich" reporting format described above, but it is likely that different measures would be appropriate for different systems, and participants are invited to suggest their own ways of making such measurements and to present the results in their papers. The object would be not only to allow for between-system comparisons, but also to provide diagnostic information on any aspects of the search process.

2. Secondary task

The secondary task is a second stage of the search. It is not required for participation in the interactive track, but is highly recommended.

It is for the searcher to generate at the end of the interactive session a ranked list of 1000 documents. We envisage that this will be achieved by the searcher choosing an appropriate search formulation from those already tried, or by defining a new one on the basis of the experience of the interactive session. The chosen formulation could then be run off-line. The target is to include as many relevant documents as possible, as high up as possible, in this ranked list.

The specification of this task to the searchers will depend somewhat on the nature of the system and therefore the manner in which the ranked list can be generated, and should be reported.

The relation between this ranked list and any documents which figured in the primary task will depend on the specific system, and therefore needs to be fully specified in the report. We envisage the following as an appropriate relation:

The items selected during the primary task would be frozen in place at the top of the ranking (as being those the searcher is most confident of);
Any items positively rejected during the primary task would be excluded from the ranking (on the same grounds);
Any items retrieved during the primary task but neither selected nor rejected would take their natural place in the final ranking.

Such rules would be consistent with the searcher trying to establish the best possible ranked list, without examining every item.

Reporting formats

For each topic: ranked list of documents; search formulation used.

Generally: specification given to searchers; relation between the ranked list and items encountered in the interactive session.

Evaluation of the secondary task

The ranked list of 1000 items will be evaluated using standard TREC evaluation methods, as used for the main TREC tasks. The results would be, at least at some level, comparable with the main adhoc TREC runs.

3. Baseline task

The baseline task is a non-interactive run which is comparable to the interactive run. It is not required for participation in the interactive track, but is highly recommended. It could be an entry for the TREC main (adhoc) task.

There are many possible ways one might set up such a baseline for comparison, though there may be systems for which no non-interactive baseline is possible. The idea would be for a run which would use essentially the same system as is used for the interactive run, but without interaction. The starting point might be the topics as given, or a manually derived query (but without reference to the documents).

The baseline task should produce a ranked list of 1000 documents, to be reported and evaluated in the usual TREC fashion.

Latex source for this specification:

Postscript formatted for A4:

Postscript formatted for US letter size:

ser@is.city.ac.uk