Finding those key words

Automating the Lexicon - English for the Computer
April 11, 1997

If ever a book celebrated cooperation, the coming together of diverse people in an effort to learn from each other, that book is Automating the Lexicon. The publication arose from a workshop at Marina di Grosseto in Italy which, the editors say, was "a turning-point in the field" - a meeting that engendered a new spirit of interdisciplinary cooperation in areas which are notorious for disputes.

Three groups of people were represented at the Grosseto workshop. The largest group had a background in natural language processing (NLP), also known as computational linguistics. These are the people who have tried to teach computers to use human language, so that systems can understand humans speech, talk back to us, and even translate automatically from one human language to another. The central shared assumptions in this field have been that the computer is a useful metaphor for the human mind and that modelling language in a computer is one part of modelling human intelligence in general.

The second group were from linguistics, a field where insights and theories about language are developed with little regard for computer applications. This group has a professional interest in stressing the uniqueness of language. The outcome has often been hostility to the assumptions that NLP holds dear: linguists tend not to accept the computer metaphor, seeing language as separate from general intelligence.

These two groups had long ignored each other, even attacked each other as misconceived. What brought them together at Grosseto was a dawning realisation that they had lots to learn from the third group - the lexicographers. This group has tended to avoid the academic limelight. But the NLP community had a palpable reason for taking an interest in dictionaries: with lexicography just beginning to enter the computer age some dictionaries are now available in the form of computer files.

Any NLP researcher designing a computer program for processing English has to incorporate information about English words, to cover their spelling, grammatical properties and meanings. A good starting point might be a computerised dictionary (or, in the favoured jargon, a "machine-readable dictionary: computational linguists must be the only people who still use "machine" to mean "computer").

There are also deeper reasons for lexicographers now taking centre stage. Many systems for language processing by computer are "toy" systems which only work in restricted domains due to their limited vocabularies. If the software only has to handle 100 different words, it is possible to write an ad hoc routine for each one. Robust language which understands software, however, must be able to cope efficiently with the huge number of words regularly used. Ad hockery is just not viable: a program with 100,000 special routines would be hopelessly unwieldy and take too long to write.

Realdictionaries have conventions about information that is common to lots of words: if a word is labelled verb, for instance, we can assume that it has a present participle ending in the form "ing", which does not need tobe listed separately (unless the "ing" form has a life of its own: a thrashing is a noun, though we don't find a drinking or snowing noun). NLP workers can, it is hoped, study these conventions and work out ways to streamline their software. Similar remarks hold true for the grammatical theories proposed by linguists, which often have limited scope and tend to pile up masses of "exceptions" when extended to real language in all its messy glory.

Several of the papers in this book examine how information about words is systematised by lexicographers, and compare this proposals from linguists and NLP workers. A central concern for many contributors is "re-usability": if a lexical resource developed by one research team is to be drawn on by another, it makes sense to use an agreed format for encoding how information about words is stored, and also exactly what information is stored. The guidelines of the Text Encoding Initiative have become a widely accepted format for such resources.

Another recurrent theme is the potential of computerised "corpora" - large collections of authentic text which are now used as basic tools in all the fields represented here.

It is a pity that this book was so long in appearing. The Grosseto workshop took place in 1986, and since then the world has moved on rapidly. Computer hardware has changed out of all recognition in the past ten years, large NLP projects such as Alvey and Eurotra have been and gone, and both lexicographical theory and practice have made giant strides.

Automating the Lexicon is an important stock-taking from the mid-1980s; but its relevance to the late-1990s is less clear. While many of its papers survey a wide range of different approaches and issues in computational linguistics, picking interesting ideas from here and there, the book English for the Computer, is completely different in style. Sampson takes one issue - how to analyse and classify the words and phrases in a corpus of real English texts - and deals with it exhaustively. It may seem strange that no one has done this before, but that is precisely the point. The questions that arise if you take this issue seriously often look very different from the ones that usually pre-occupy grammarians.

For example, take postal addresses. Is "5 Founthill Avenue, Saltdean, BN2 8SG" one noun phrase, or a series of noun phrases? Is this entire address in apposition to the personal name that might precede it on an envelope? If there was a comma after "5" would the structure change? Likewise, would the meaning be different if there were no comma after "Saltdean"? Grammars of English tend to avoid such questions in favour of weightier matter like verb tenses and sentence structure, but in real life addresses occur widely and a comprehensive analytic scheme has to include them.

Sampson deals with innumerable issues of this kind in painstaking detail. His book aims to give linguistics what Linneaus gave biology: a systematic, thorough-going naming of parts, in which everything encountered is classified. Early evidence is that Sampson's book and associated computer files are indeed meeting a need. The material is already widely used and often mentioned by researchers in NLP.

As a linguist, my basic question is - what can we learn about language from the research efforts presented in these two books? Behind both of them lies the belief that we must look at large numbers of individual words in careful detail if we want to build theories about language that correspond to reality. The days of ambitious theories with pitifully small empirical foundations may be numbered - and so much the better for the study of language.

Raphael Salkie is principal lecturer in language studies, University of Brighton.

Automating the Lexicon: Research and Practice in a Multilingual Environment

Author - Donald E. Walker, Antonio Zampolli and Nicoletta Calzolari
ISBN - 0 19 823950 5
Publisher - Oxford University Press
Price - £50.00
Pages - 413

Please login or register to read this article

Register to continue

Get a month's unlimited access to THE content online. Just register and complete your career summary.

Registration is free and only takes a moment. Once registered you can read a total of 3 articles each month, plus:

  • Sign up for the editor's highlights
  • Receive World University Rankings news first
  • Get job alerts, shortlist jobs and save job searches
  • Participate in reader discussions and post comments

Have your say

Log in or register to post comments