OkapiNET: The creation of a portable search engine and Web interface

Summary
Aim and Objectives
Rationale and Benefits
The Current State of Okapi
Deliverables and Dissemination
- Okapi Search Engine
- Okapi Web Interface
Collaboration
References

Summary: The OkapiNET Project

The OkapiNET project is the latest in a long line of Okapi projects carried out by the Centre for Interactive Systems Research at City University. The project, which began at the beginning of January 1997 and will be completed in February 1998, is funded by the British Library's Research and Innovation Centre.
The Okapi experimental text retrieval system has become a well established evaluation facility for advanced methods in information retrieval. Okapi's success in the international Text REtrieval Conferences over the past four years has brought recognition and stimulated much interest within the IR research community. The time now seems opportune for increasing Okapi's visibility and accessibility. The overall aim of the project is to make Okapi more widely available than at present to demonstrate its capabilities and in turn create new opportunities for IR research and to assist in the technology transfer process. The specific objectives are twofold: 1) to design a portable search engine which can be used more widely as a research tool by other IR researchers, 2) to create a Web interface which will make Okapi more accessible via the Internet and demonstrate its full search capabilities. The rationale is not to develop Okapi into a product but to further develop it as a research platform which can impact on product development in information retrieval systems and applications.

Contents Section

The overall aim of the project is to make Okapi more widely available than at present to demonstrate its capabilities and in turn stimulate new research, which could promote the take-up of its proven statistical probabilistic retrieval techniques in commercial applications.

There are two specific objectives to achieve that aim:

To make the Okapi software portable so that it can be used more widely as a research tool by other researchers in the IR community.
To make Okapi more readily accessible via the Internet through a Web interface which can support and demonstrate its full search capabilities.

It is considered that the portability and WEB access are complementary objectives to achieve the general aim of the project as well as in terms of the resources required. That is to say, the same level of effort would be needed to meet both objectives as it would to limit ourselves to one or other.

Contents Section

Rationale and benefits

The rationale for the project is not to develop Okapi into a product, but to further develop it as a research platform which can impact on product development in information retrieval systems and applications. The development work can be viewed essentially as a more pro-active approach to the dissemination of our research and is considered to be a necessary way forward. This will behefit our research effort in several ways:

Firstly, our operational testing to date has been limited to library sites or to recruiting users on the local area network at City University or the London Business School. Widening access on the Internet will provide us with a global pool and greater variety of user groups. This in turn will enable us to study different sorts of information seeking activities and behaviours.

Secondly, access to data for building test collections or for carrying out live evaluations has always been problematic in IR research. The possibility of making use of different sorts of data already in the public domain on the Internet will be a great advantage.

Thirdly, the Okapi team has been repeatedly approached by other researchers to make the Okapi software available for their own research purposes in the same way as the SMART system, developed at Cornell, has been available for experimental use for more than twenty years (3). Versions of Okapi have been used by researchers at the universities in Sheffield and UCLA but this has proven to be not very satisfactory. In the light of the general success of the statistical methods at TREC and in particular the competitive performance of Okapi's term weighting approach, there is now a strong case for making Okapi available as an alternative search engine and research tool to the vector space model of the SMART system (4).

Fourthly, the promotion of post-Boolean retrieval methods is paramount for the future development of information systems. Although some commercial products have adopted statistical methods, for the most part these have been based on the SMART vector space model. A World-Wide-Web implementation of Okapi would be an effective way of demonstrating its capability. The portability of the software as a set of development tools would go some way to promote the up-take by system and product developers.

Contents Section

The Current State of Okapi

At present (1997) an Okapi system comprises four layered software components:

A user interface program written in a high-level scripting language (TCL) which supports a GUI running under X-Windows. This has some degree of configurability, for example different scripts cater for short bibliographic records or for full text documents.
An intermediate Query Layer, providing a higher-level interface to the Basic Search System (BSS), and supporting additional functions such as incremental query expansion and passage retrieval.
A Basic Search System, providing a comprehensive set of low-level commands for searching an Okapi database.
A set of indexing routines which accept raw text files and create a database in a form suitable for searching.

The following sections consider each of these components in turn, indicating what developments and issues will be addressed to package them into a more standard set of tools to meet the dual objectives listed above. The work on the interface and query layers would contribute to the building of the web interface and that on the basic indexing and search layers would form the redesigned search engine.

The User Interface Layer

The original VT100 user interface of Okapi was accessible from dumb terminals. In some ways our more recent developments on workstations have in effect restricted access. The aim is to develop a more flexible user interface environment which will be compatible with the current developments on the Internet and accommodate both PC and Unix based platforms.

The Okapi GUIs in current use have been developed using the TCL scripting language, which provides a very flexible method for defining and changing the appearance and behaviour of interface objects (4, 5). The GUI has been used successfully across the Internet (at the London Business School), but since the interface program is executed on the server machine the response time is not acceptable over long distances. Moreover, the need to demonstrate Okapi functionality much more widely indicates that we should now be moving into full client-server mode and providing an interface over the World-Wide-Web.

For this purpose it will be necessary to look again at the current user interface and its relationship with the supporting query layer, and decide how to re-design or re-engineer it for the new environment. Some issues to be investigated are:

Separation of functions between client and server - how much of the query state should be held at the client end; is there a need to minimise interactions between client and server? The current GUI is highly interactive: any relevance judgement or term manipulation leads to a change in both the internal and external representation of the working query. It may be difficult to achieve acceptable response times for such immediate feedback in all circumstances - options for batching certain interactions may be needed in both the interface and the query layers.
The trade-off between accessibility and functionality - a traditional web interface based on HTML forms and the dynamic construction of HTML pages will have limited functionality but be usable with any web browser. More sophisticated functions (e.g. multi-frame pages, direct manipulation of interface objects) can be provided, but only to a sub-set of potential users. We should thus consider developing both a basic and an advanced version of the interface.
Possible development tools - their capabilities and limitations. The value of using a high-level interpreted language for interface-building has already been demonstrated with TCL, but at present no other language appears to provide quite this level of flexibility. Sun's Java offers a powerful combination of facilities, but it is closer to a conventional programming language like C++ than to a scripting language (6). It will be necessary to assess which web interface-building tools will best suit our purpose.

The Intermediate (Query) Layer

The existing query layer was written to support the development of the ENQUIRE configurable GUI to Okapi (7). It manages a set of objects (term-sets, document-sets, etc.) defining the state of an interactive retrieval session. It hides much of the low-level detail of BSS operations, and provides additional functions, e.g. for interactive query manipulation, incremental query expansion, and transaction logging. The role of this intermediate layer for a distributed system architecture in a network environment is complex and is an approach which has been adopted elsewhere (8). Further development of this middle ware software would involve:

a. Incorporating additional functions to the query layer, e.g. long document display and possibly support for explicit adjacency searching and passage retrieval. The objective is to provide a common platform for the development of different user interfaces and handling the variations between them via option settings.

b. Partitioning the software into a kernel set of functions, and a command interpreter to communicate with user interface programs which could be written in different languages. The format of messages passed to and from those programs will vary according to the syntax of the language used, so different mappings between supplied commands and internal functions will be needed.

c. Extending its current capabilities as an experimental tool, by generalising the transaction logging functions (9). The logging of user activities is an important aspect of evaluation but the functions could also be used to vary experimental conditions and provide facilities for blind testing. Researchers should be able to specify two or more ways of formulating a query with a given set of terms; the software would carry out multiple searches and merge the results before presenting them to the user, allowing controlled comparison and hypothesis testing.

The Basic Search System

The BSS provides a set of low-level retrieval commands as a service to other Okapi software components. It is similar to the Z39.50 / SR protocol in supporting a continuous session with its clients and creating and saving document sets for use in subsequent operations. It includes standard boolean operators. These are used behind the scenes in current Okapi projects, although any BSS interface built by researchers in the future could make them explicit, if required. However it is not limited to the Boolean functionality of the current standard, since it supports term weighting and best-match document retrieval. New functions are added periodically to support new experiments but the core functions are now stable, robust, and highly optimised.

The basic development effort here would involve documentation, and testing for portability across a range of hardware platforms.

Further changes would be required to make the BSS capable of operating in a fully-distributed environment. The possible extensions to the database creation facilities discussed below (e.g. multiple record types) could also lead to the introduction of new variations on the search commands.

Okapi Database Creation and Indexing

If other researchers and organisations are to use Okapi with their own data, they must be provided with indexing software in an appropriately packaged form. The existing routines are parameterised via set-up files, and the minimum work required to make them generally usable would be a command-, menu- or GUI-based front-end to create the set-up files and run the appropriate shell scripts. However, there are some extensions to consider:

a. At present an Okapi database contains only a single record type - handling records with different structures involves using cumbersome limit functions at search time. The possibility of supporting more than one record type definition per database should be investigated.

b. Much of the on-line material to be searched in future will originate on the World-Wide-Web. Techniques for web page harvesting and the construction of cross-site indexes are advancing very rapidly; some of the routines currently used for this purpose are freely available in source form and thus potentially adaptable for our own needs. We will need to ensure that the indexing operations for creating Okapi databases will take account of existing Web techniques and tools.

c. At present an Okapi database is created and indexed in a single operation - there is no provision for periodic updating. There are situations where dynamic database management is essential, for example, in a routing task where new incoming data is matched against an existing query. This capability would thus extend our test bed for experimentation.

Contents Section

Deliverables and Dissemination

The outcome of the project will be two deliverables.

Okapi Search Engine

The Okapi search engine will consist of a library of routines to support the basic indexing, searching and weighting functions. This will enable other bona fide researchers to implement it for searching their own database test collections. It is envisaged that researchers would develop their own user interfaces and perhaps add other functionality depending on the nature of their research. The packaged version of the BSS will be available only across Sun hardware and the Unix operating system: porting it to Windows95/NT is beyond the scope of the current project. However, it will be possible for researchers to build Windows front-ends to Okapi if they wish, using the BSS as an API.

This will be provided under licence with a minimal fee to cover administration and any support. The Centre will be responsible for promotion, distribution, and protecting the intellectual property rights.

The software will be delivered to researchers together with comprehensive documentation.

It is envisaged that the Okapi search engine would be available for some of the TREC participants for the 1997 round.

Okapi Web Interface

The development of the search engine will in turn allow us to implement the system on the World Wide Web. The Okapi Web interface will provide access to different types of databases which will demonstrate Okapi's retrieval capability. This will add a new version in the family of Okapi systems but will not constitute a Web search engine per se. Different sources of textual data will be sought to create an appropriate testbed. The Okapi Web site will serve two main purposes. Firstly, it can be used as a tool for teaching information retrieval by other LIS departments, thus filling the same function as the previous VT100 version. Secondly, it will provide the Okapi team with access to a large heterogeneous end user population for test subjects. Although Internet users will be able to access it freely, we may also request some users to register for specific parts of the service in order to carry out different experiments.

It should be emphasised that the search engine will not be used to 'harvest' miscellaneous material off the Web - there are already many of these in existence! The objective is instead to demonstrate the principles of probabilistic searching to the widest possible audience, using the following dedicated databases:

A database of selected "newswire" material from TREC;
Statutory documents (Acts and Bills) from the Office of the Parliamentary Counsel;
Indexed reviews/articles from the Athenaeum, a 19th century journal.

Others will be added as they become available.

Contents Section

Collaboration

Collaboration for testing the search engine will be sought from some of the researchers who have already expressed their interest in using the Okapi system for their research and with whom colloaborative work has already been undertaken. These include:

Dr. Peter Willett (Sheffield University)
Dr. Alan Smeaton (Dublin City University)
Dr. Edie Rasmussen (Pittsburgh University)
Dr. Efthimis Efthimiadis (University of Southern California at Los Angeles)

Contents Section

References

1. British Library Research and Development Department. Twenty years of the British Library Research and Development Department, 1974-1994. Continuity & change: the evolution of research programmes. London: BLRDD, 1994.

2. Robertson, S. E. et al. Okapi at Trec. In: D. Harman (ed). The Text REtrieval Conferences (TREC-1,2,3,4). Gaithersburg, MD: NIST, 1993, 1994, 1995, 1996.

3. Salton, G. (ed). The SMART Retrieval System-Experiments in automatic document processing. Englewood Cliff, N.J.: Prentice-Hall, 1971.

4. Salton, G. and Buckley, C. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 21(5), 1988, 513-523.

5 Ousterhout, J.K. Tcl and the Tk Toolkit. Addison-Wesley, 1994.

6. Beaulieu, M. et al. ENQUIRE Okapi Project. London: British Library, 1996.

(British Library Research & Innovation Centre Report No.17)

7. Lemay, L and Perkins C. L. Teach yourself Java. Indianapolis, In: Sams.net, 1996.

8. Jones, S. et al. Query modelling for IR interface design. The New Review of Document and Text Management, 1, 1995, 47-62.

9. Hendry, D. G. and Harper, D. J. A user-interface architecture for implementing extensible information-seeking environments. Paper to be presented at SIGIR 96, 19th International Conference on Research and Development in Information Retrieval, Zurich, August 1996.

10. Jones, S. Transaction logging. Special issue on Okapi in Journal of Documentation January 1997.