ITRI'S FOCUS AREAS
Our Institute's work concerns the interaction between people and computers,
especially where natural language plays a role in this interaction. Within
this wide remit, ITRI's present activities are mainly in the following
interrelated areas of scientific research:
-
Document generation
-
Knowledge editing
-
Lexical research
-
Information
Extraction
These activities, which are detailed below, require us to have expertise
in a number of areas, including
-
Theoretical and computational linguistics,
-
Meaning Representation,
-
Software development and management,
-
Document layout and multimedia,
- The study of language corpora
-
The evaluation of NLP systems.
Linguistics, Meaning Representation and software development have been
well-represented in the Institute since its founding. Reflecting global
trends, we are expanding our expertise in the other areas.
Document Generation
The generation of documents has been one of ITRI's main areas of expertise
since the early nineties (see e.g. the DRAFTER
project). Document generation involves a number of issues that are central
to linguistics and language technology. For example, to design a Document
Generation system well, one has to understand what distinguishes a well-
written document from a bad (incoherent, confusing,
unreadable) one, and what distinguishes one document `style' from another.
In addition, one would like to understand how authors think about documents,
to allow natural, quick, and transparent interaction between the
document, the author and the system.
Perhaps primarily, Document Generation systems generate natural language
text, hence the importance of Natural
Language Generation (NLG) for Document Generation. In recent years,
however, document generation has also taken `nonlinguistic' issues involving
punctuation, layout, and modality into account, because each of these can
contribute to the readability of the document. Our activities in this area presently focus mainly on
-
Architectures for NLG ( RAGS
project)
-
Constraints on layout and style (ICONOCLAST
project)
-
Generation of nominals ( GNOME
project)
- Multilinguality (e .g. AGILE
project)
In addition, there are incipient activities in the area of
multimedia (e.g., 'text plus pictures') documents.
Knowledge Editing
Knowledge editing is the process whereby a person creates or modifies
a knowledge base. The knowledge base can contain information of any
kind, and this information may be represented in any machine-readable
way. Knowledge editing is usually a difficult process that requires
much expertise not only about the domain itself, but
also about the formal language in which the knowledge is
represented. `What You See Is What You Meant'
editing is a novel approach to knowledge editing, born and raised at
ITRI, that tries to make Knowledge Editing easier. Crucially, WYSIWYM
uses natural language generation to facilitate the knowledge editing
process by constantly feeding back the content of the knowledge base
by means of automatically generated bits of natural language. WYSIWYM
editing can be applied to many areas of human-machine
communication. Presently, its strengths and limitations are being
explored in connection with two applications. The first is Document
Generation (for more information, see Technical Report ITRI 98-05),
where the author of the document uses WYSIWYM to specify the content
and form of the document. The second
application is Question-answering, where WYSIWYM is used to enable
someone to construct a query in a formal language. In this case, the
query plays the role of the Knowledge Base. (For more information, go
to our Technical Reports and download ITRI 98-11 or visit the WYSIWYM homepage).
Our main activities in the
area of Knowledge Editing are related to
- Constraints on layout, style, and
modality (ICONOCLAST project)
-
Question-answering (CLIME
project)
Please note that the first two activities in the list have important Document
Generation as well as Knowledge Editing aspects, which is why they are
common to areas (1) and (2).
Lexical research
Language engineering needs lexicons; the lexicon contains information
about the sound, structure, grammar and meaning of words that is a
core input for any language engineering system. The shift over the
last decade from 'toy' to large-scale systems, alongside a theoretical
impulse to merge grammar and lexicon, has focused attention on lexicon
form and content worldwide and ITRI has been in the vanguard of the movement.
Recently, the group has been applying the lexicon representation
language, DATR (as developed by Evans, at ITRI and University of
Sussex, and Gazdar at Sussex) to the description of grammar, valency,
and multilingual lexicons; on the interface between lexicography and
traditional dictionaries, and Language Engineering; on the
use of language corpora for purposes of automatic acquisition of
lexical information; and on the core lexical issue of the nature of
word meanings, and their disambiguation: how the computer can tell
which meaning of a word applies, in a given context.
The work of the lexicon group is focused on four key areas:
Lexical organisation
There are many reasons for being interested in how lexicons are or
should be organised. For the theorist, the challenge is to construct
a linguistically motivated organisation which concisely captures
the `right' generalisations about lexical phenomena. More practically,
a well-organised lexicon is easier to understand, maintain and
extend. And for the applied language engineer, a well-organised
lexicon is the most appropriate basis for the development of
`hard-coded' modules that actually do some particular lexical access
task. Our research in this area, in collaboration with the
University of Sussex, is concerned with fundamental issues of
organisation, with a current focus on multilingual lexicons in
particular. This work centres on the continuing development and
use of lexicon description language DATR, a non-monotonic
knowledge representation language designed specifically for lexicons.
Dictionaries and Lexicons
Real language processing systems need large-scale lexicons. As has
often been noted, existing dictionaries, produced for people, are a
good starting point for developing large lexicons.
The CONCEDE project aims to
develop lexical knowledge bases for six Eastern European languages
(Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene) drawing on
standards developed in the Text Encoding Initiative.
Word sense disambiguation (WSD)
Open a dictionary at random: many words have more than one meaning.
Usually, when a word is used, only one of those senses applies. Word
sense disambiguation (WSD) is the task of determining which applies.
ITRI organised SENSEVAL, the first open exercise for evaluating WSD
programs, in 1998 and is now developing a WSD system in the EPSRC
project WASPS.
Corpora
Computer-readable text is available as never before. This makes it
possible to study language in many new ways. Our research focuses on the
characterisation of bodies of text, or corpora, according to: how
homogeneous they are and how similar to each other; word- and
word-class-frequency distributions, and what they tell us about language
structure; automatic and semi-automatic acquisition of lexicons from
corpora. We have been using the 100 million word British National Corpus
for our investigations, and are currently leading further developments
of this major national resource.
For further information, please contact
Roger Evans (+44
1273 642902) - see our contact page for full
contact details.
Information Extraction
This page is under construction.
Maintained by the ITRI webmaster (webmaster@itri.brighton.ac.uk).
Last updated 22 March 2000
©Information Technology
Research Institute
ITRI home page