Members of the PCO drafting new legislation obviously need to refer frequently to earlier Acts in order to maintain consistency, so an automatic method of identifying relevant sections and paragraphs is desirable. Initially, the Wordperfect files were converted to Ascii format and searched on a Unix system using fgrep. Later it was decided to build a more convenient system which would allow the texts to be searched and presented via the World-Wide-Web. This necessitated, as a first stage, the conversion of the Wordperfect / Ascii files into HTML.
Following are some of the issues which had to be addressed when designing the HTML documents and writing conversion scripts:
Because the Wordperfect files had been created only with an eye to the printed appearance of the Acts (even though an SGML DTD for statutory material was specified by HMSO in 1990!) there were errors in the mark-up, so the conversion scripts had to perform syntax checking and produce error log files. The scripts made heavy use of pattern matching and replacement techniques, and were written in perl, a more powerful successor language to awk and sed.
It was therefore decided to implement searching by sequential scanning rather than pre-indexing. This provided greater flexibility, although with a penalty of slower response. On receiving the search parameters - start year, end year, and search terms(s) - the search script loops through the appropriate year directories, and pipes each file in turn through fgrep to obtain a list of candidate lines. The relevant filename (in this case comprising year, chapter, and section or schedule) appears on each line of the result-file, and the -n option is used to obtain the actual line-numbers as well.
The matching lines returned by fgrep are processed to produce a structured hitlist. They are checked for the and and not conditions, bold tags are put around the query terms, any unwanted HTML mark-up (e.g. internal HREF tags) is stripped out. The Act title is identified (using year and chapter as a ``key'' to a permanent look-up table), and so is the section or schedule number. These are printed as headers, marked up with HREF links to the appropriate file. The individual lines are then displayed, marked up with HREF links to a NAME target corresponding to a line number. This allows users to go directly to any selected line or paragraph, and see its immediate context.
An additional search function provides details of cross-references between Acts, and from current Bills to past Acts. These are found by scanning converted documents for relevant HREF markup, and building a file with one (variable-length) record per referenced Act. The technique is rather similar to the one used to generate an inverted list keyword file, but it exploits perl's ability to map an internal associative array to a permanent Unix dbm file, and so provide genuine random access.