ITRI'S FOCUS AREAS

Our Institute's work concerns the interaction between people and computers, especially where natural language plays a role in this interaction. Within this wide remit, ITRI's present activities are mainly in the following interrelated areas of scientific research:
  1. Document generation
  2. Knowledge editing
  3. Lexical research
  4. Information Extraction
These activities, which are detailed below, require us to have expertise in a number of areas, including Linguistics, Meaning Representation and software development have been well-represented in the Institute since its founding. Reflecting global trends, we are expanding our expertise in the other areas.

Document Generation

The generation of documents has been one of ITRI's main areas of expertise since the early nineties (see e.g. the DRAFTER project). Document generation involves a number of issues that are central to linguistics and language technology. For example, to design a Document Generation system well, one has to understand what distinguishes a well- written document from a bad (incoherent, confusing, unreadable) one, and what distinguishes one document `style' from another. In addition, one would like to understand how authors think about documents, to allow natural, quick, and transparent interaction between the document, the author and the system.

Perhaps primarily, Document Generation systems generate natural language text, hence the importance of Natural Language Generation (NLG) for Document Generation. In recent years, however, document generation has also taken `nonlinguistic' issues involving punctuation, layout, and modality into account, because each of these can contribute to the readability of the document. Our activities in this area presently focus mainly on

In addition, there are incipient activities in the area of multimedia (e.g., 'text plus pictures') documents.

Knowledge Editing

Knowledge editing is the process whereby a person creates or modifies a knowledge base. The knowledge base can contain information of any kind, and this information may be represented in any machine-readable way. Knowledge editing is usually a difficult process that requires much expertise not only about the domain itself, but also about the formal language in which the knowledge is represented. `What You See Is What You Meant' editing is a novel approach to knowledge editing, born and raised at ITRI, that tries to make Knowledge Editing easier. Crucially, WYSIWYM uses natural language generation to facilitate the knowledge editing process by constantly feeding back the content of the knowledge base by means of automatically generated bits of natural language. WYSIWYM editing can be applied to many areas of human-machine communication. Presently, its strengths and limitations are being explored in connection with two applications. The first is Document Generation (for more information, see Technical Report ITRI 98-05), where the author of the document uses WYSIWYM to specify the content and form of the document. The second application is Question-answering, where WYSIWYM is used to enable someone to construct a query in a formal language. In this case, the query plays the role of the Knowledge Base. (For more information, go to our Technical Reports and download ITRI 98-11 or visit the WYSIWYM homepage).

Our main activities in the area of Knowledge Editing are related to

Please note that the first two activities in the list have important Document Generation as well as Knowledge Editing aspects, which is why they are common to areas (1) and (2).

Lexical research

Language engineering needs lexicons; the lexicon contains information about the sound, structure, grammar and meaning of words that is a core input for any language engineering system. The shift over the last decade from 'toy' to large-scale systems, alongside a theoretical impulse to merge grammar and lexicon, has focused attention on lexicon form and content worldwide and ITRI has been in the vanguard of the movement. Recently, the group has been applying the lexicon representation language, DATR (as developed by Evans, at ITRI and University of Sussex, and Gazdar at Sussex) to the description of grammar, valency, and multilingual lexicons; on the interface between lexicography and traditional dictionaries, and Language Engineering; on the use of language corpora for purposes of automatic acquisition of lexical information; and on the core lexical issue of the nature of word meanings, and their disambiguation: how the computer can tell which meaning of a word applies, in a given context. The work of the lexicon group is focused on four key areas:

Lexical organisation
There are many reasons for being interested in how lexicons are or should be organised. For the theorist, the challenge is to construct a linguistically motivated organisation which concisely captures the `right' generalisations about lexical phenomena. More practically, a well-organised lexicon is easier to understand, maintain and extend. And for the applied language engineer, a well-organised lexicon is the most appropriate basis for the development of `hard-coded' modules that actually do some particular lexical access task. Our research in this area, in collaboration with the University of Sussex, is concerned with fundamental issues of organisation, with a current focus on multilingual lexicons in particular. This work centres on the continuing development and use of lexicon description language DATR, a non-monotonic knowledge representation language designed specifically for lexicons.

Dictionaries and Lexicons
Real language processing systems need large-scale lexicons. As has often been noted, existing dictionaries, produced for people, are a good starting point for developing large lexicons.

The CONCEDE project aims to develop lexical knowledge bases for six Eastern European languages (Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene) drawing on standards developed in the Text Encoding Initiative.

Word sense disambiguation (WSD)
Open a dictionary at random: many words have more than one meaning. Usually, when a word is used, only one of those senses applies. Word sense disambiguation (WSD) is the task of determining which applies. ITRI organised SENSEVAL, the first open exercise for evaluating WSD programs, in 1998 and is now developing a WSD system in the EPSRC project WASPS.

Corpora
Computer-readable text is available as never before. This makes it possible to study language in many new ways. Our research focuses on the characterisation of bodies of text, or corpora, according to: how homogeneous they are and how similar to each other; word- and word-class-frequency distributions, and what they tell us about language structure; automatic and semi-automatic acquisition of lexicons from corpora. We have been using the 100 million word British National Corpus for our investigations, and are currently leading further developments of this major national resource.

For further information, please contact Roger Evans (+44 1273 642902) - see our contact page for full contact details.

Information Extraction

This page is under construction.


Maintained by the ITRI webmaster (webmaster@itri.brighton.ac.uk).
Last updated 22 March 2000

©Information Technology Research Institute

ITRI home page