Database of Acts and Bills

The aim of this project was to produce a database of Acts and Bills published by the Parliamentary Counsel Office, available over the Web. This involved heavy text processing to automatically convert the Office's Wordperfect documents into html. Scripts were also written to support keyword searching on either full documents or titles.

Processing the data

The Parliamentary Counsel Office drafts legislation for the British Parliament, using the PC Wordperfect package to generate machine-readable versions of new Bills and Acts for printing by HMSO. In principle these are documents with a well-defined logical structure, but the mark-up within the Wordperfect file does not reflect this. The start of each new Part, Section or Schedule of an Act is signalled only by bold and / or centred headings, inserted quotations (where one Act refers directly to the text of another) are indicated by various levels of indentation, and special characters are represented by non-standard Ascii codes ``escaped'' with a preceding backslash.

Members of the PCO drafting new legislation obviously need to refer frequently to earlier Acts in order to maintain consistency, so an automatic method of identifying relevant sections and paragraphs is desirable. Initially, the Wordperfect files were converted to Ascii format and searched on a Unix system using fgrep. Later it was decided to build a more convenient system which would allow the texts to be searched and presented via the World-Wide-Web. This necessitated, as a first stage, the conversion of the Wordperfect / Ascii files into HTML.

Following are some of the issues which had to be addressed when designing the HTML documents and writing conversion scripts:

Layout. The documents had to show multiple levels of indentation, to represent inserted quotations and multilevel lists. This was achieved by converting ``left-margin shift'' and ``subparagraph'' markers into BLOCKQUOTE and DL / DD tags.
Hierarchical structure. Section and Schedule headings were recognized from bold font / centring directives, and converted into appropriate H2 tags. Internal HREF links were set to and from the ``Arrangement of Sections'' (contents list) at the start of each Act. In the initial design, complete Acts, however long, were retained as complete Web documents, so the HREF links were internal. Later it was decided to improve response by making each section a separate HTML file, so additional next and previous links were set at the start and end of each Section or Schedule.
Tables. Many Schedules are in tabular form, e.g. there is often a table summarising the consequences of the new legislation for earlier Acts. The TABLE tag was used for this purpose, although there were some table formats which could not be reproduced with total accuracy.
Cross-references. Reference to earlier Acts were identified by scanning a stored look-up table of Act titles, and appropriate HREF links set to the corresponding files. (For this and other purposes it was necessary to establish a very consistent file-naming system.)
Finally, NAME anchors were generated for each paragraph in the text, to serve as link targets following a search.

Because the Wordperfect files had been created only with an eye to the printed appearance of the Acts (even though an SGML DTD for statutory material was specified by HMSO in 1990!) there were errors in the mark-up, so the conversion scripts had to perform syntax checking and produce error log files. The scripts made heavy use of pattern matching and replacement techniques, and were written in perl, a more powerful successor language to awk and sed.

Supporting keyword searches

The requirement was to support keyword searches on Act titles or full texts, in each case within a given range of years. Searching was to be case-insensitive, allowing word-truncation or multi-word phrases, and simple boolean operations. The full-text search would need to identify matching paragraphs and lines, rather than whole documents.

It was therefore decided to implement searching by sequential scanning rather than pre-indexing. This provided greater flexibility, although with a penalty of slower response. On receiving the search parameters - start year, end year, and search terms(s) - the search script loops through the appropriate year directories, and pipes each file in turn through fgrep to obtain a list of candidate lines. The relevant filename (in this case comprising year, chapter, and section or schedule) appears on each line of the result-file, and the -n option is used to obtain the actual line-numbers as well.

The matching lines returned by fgrep are processed to produce a structured hitlist. They are checked for the and and not conditions, bold tags are put around the query terms, any unwanted HTML mark-up (e.g. internal HREF tags) is stripped out. The Act title is identified (using year and chapter as a ``key'' to a permanent look-up table), and so is the section or schedule number. These are printed as headers, marked up with HREF links to the appropriate file. The individual lines are then displayed, marked up with HREF links to a NAME target corresponding to a line number. This allows users to go directly to any selected line or paragraph, and see its immediate context.

An additional search function provides details of cross-references between Acts, and from current Bills to past Acts. These are found by scanning converted documents for relevant HREF markup, and building a file with one (variable-length) record per referenced Act. The technique is rather similar to the one used to generate an inverted list keyword file, but it exploits perl's ability to map an internal associative array to a permanent Unix dbm file, and so provide genuine random access.