ITRI/NLCL Student Workshop 2004
Location ITRI W107 (how to get there)
Date Friday May 21, 2004
Organization ITRI Research Student Division (Paul Piwek) at the University of Brighton and NLCL Doctoral Programme (John Carroll) at the University of Sussex
Timing There will be short (maximum 10 minutes) and long talks (maximum 25 minutes). The aforementioned times include discussion time. Speakers are encouraged to reserve 5 minutes or more for discussion.
Schedule
9:30 Tea/Coffee
9:45 Welcome
10:00 Jon Herring, Orthography and the lexicon (long)
10:25 Thapelo Otlogetswe, Corpus Construction for Languages with Limited Written Traditions (long)
10:50 Mark McLauchan, Semi-supervised parse selection (long)
11:15 Break with coffee/tea
11:35 Albert Gatt, Some empirical issues in the study of reference (and why a generator must face them). (long)
12:00 Marina Santini, Exploring Web genres with cluster analysis (short)
12:10 Eva Esteve Ferrer, Title tba (short)
12:20 Ling Yin, Topic Study and Answering Non-Factoid Questions (short)
12:30 Lunch (will be provided)
13:30 Tim Cumming, Using Topic-Sentiment Vectors to Interpret Sentiment "in Colour" (long)
13:55 Norton Trevisan Roman, Politeness and Bias in Automatic Dialogue Summarization (long)
14:20 Jason Teeple, Semantic content permutation and its effect on surface linguistic form in natural language generation (long)
14:45 Break with coffee/tea
15:05 Daoud Clarke, Towards a Natural Language Algebra (long)
15:30 Chi-Ho (Henry) Li, Monolingual and Bilingual Grammar Induction (short)
15:45 Xinglong Wang, Automatically Acquiring Semi-Sense-Tagged Corpora (long)
16:10 CAKES
Participants John Carroll, Paul Piwek, Ling Yin, Marina Santini, Albert Gatt, Jason Teeple, Thapelo Otlogetswe, Jon Herring, Norton Trevisan Roman, David Martul, Eva Esteve Ferrer, Daoud Clarke, Tim Cumming, Chi-Ho (Henry) Li, Mark McLauchlan, Xinglong Wang, Richard Power, Roger Evans, Kees van Deemter, Adam Kilgarriff, Lynne Cahill, David Weir, Bill Keller
Abstracts Daoud Clarke, Towards a Natural Language Algebra

Semantic formalisms for natural language, such as Montague semantics, generalised quantifiers or Boolean semantics, are well respected by logicians, however, I will argue that they fail to capture some fundamental features of meaning in language. For example, measures of word similarity seem to indicate a less black and white view of meaning than the strict hypernymy relations in WordNet, however there is currently no formalism to provide a basis for 'fuzzy' hypernymy relations, nor any way to integrate them into more traditional representations. Another problem with such formalisms is their lack of an explicit representation of semantic ambiguity.

The hope is that an algebraic approach to natural language will be more amenable to incorporating such ideas. I will show how a simple algebraic construction leads to the same entailment relations as traditional semantic formalisms, and introduce some work in progress.

Tim Cumming, Using Topic-Sentiment Vectors to Interpret Sentiment "in Colour"

My research focuses on analysing commercial, corporate and expert texts for sentiment "colour". To date, most techniques related to sentiment focus on the "monochrome" classification of positive or negative sentiment within certain boundaries. My three key research aims are: (1) Adjacency: associating common attributes of a single topic in multiple texts (eg a leader is visionary in text 1, far-sighted in text 2 - while there is a positive connotation to both, the attribute of "forward-thinking" provides the colour); (2) Gradation: we are interested to know the degree to which the leader is forward-thinking - either a little or a lot; (3) Marshalling: if another document implies the leader is arrogant, we observe diverse opinions, so we must find an easy way to report both the pattern and diversity. My work will focus on summarising opinions from multiple texts, for the purpose of producing a single summary, which distills opinion, and identifies notable exceptions to the norm.

Albert Gatt, Some empirical issues in the study of reference (and why a generator must face them).

In this short presentation, I will discuss some of the problems in current work on Generation of Referring Expressions (GRE), and argue that they call for more empirical research, especially corpus-based and psycholinguistic experimentation. The issues discussed are :

a. The problem of incrementality and description complexity: Dale and Reiter' s (1995) incremental algorithm predicts that informational redundancy in descriptions falls naturally out of the strategy used by speakers in content selection This becomes problematic in the generation of complex NPs, which have a high degree of redundancy and/or logical complexity. Recent proposals dealing with this problem have maintained an intuitionistic or vague definition of "description complexity". I will discuss some problems in defining this notion, arguing that it must ultimately take into account both speaker and hearer-oriented perspectives. While 'complexity' from the speaker's point of view is related to the difficulty of finding identifying attributes of an intended referent, from the hearer's point of view it involves the incremental interpretation of a description and the circumscription of the referential domain based on its information content. These two notions of complexity are not necessarily isomorphic. This calls for more empirical research since very little is known about the thresholds beyond which (i) the search for an identifying description fails in production; (ii) the search for an intended referent on the basis of an identifying description fails in comprehension.

b. Domain structuring: One outcome of (a) is that the generation process could capitalise on knowledge and information shared by speaker and hearer. Most approaches to GRE have assumed either that the referential domain is made up of raw information, or that it is structured in advance by imposing a preference ordering on domain attributes. This is one of the causes of excessive complexity of descriptions, since content selection procedures are extremely rigid. I will suggest that a more realistic approach would involve a dynamic process of domain structuring. This calls for investigation of how interlocutors organise a domain on the basis of intentional and perceptual cues. This process is logically prior to content selection proper.

Jon Herring, Orthography and the lexicon

In this presentation, I will review my research into the representation of orthographic structures in the lexicon. The work took as its starting point hierarchical lexicon architectures such as PolyLex (Cahill and Gazdar 1999) whose output is the phonological form of inflected words. My hypothesis was that the corresponding orthographic forms would not need to be specified separately in this type of lexicon. This assumes that there is enough linguistic information encoded in the lexical entries to derive the spelled forms directly, and thereby reduce redundancy. Taking a relatively 'deep' orthography such as French, I will discuss how phonological and morphological information may be enough to predict the correct spelling, and show the types of mechanism that need to be implemented in the lexicon itself in order to produce the orthographic forms.

Yin Ling, Topic Study and Answering Non-Factoid Questions

Much of the current research in automatic question answering (QA) is driven by evaluation exercises such as TREC or CLEF, all of which focus on answering factoid questions (questions that can be answered by a short phrase). For factoid questions, sentences that contain the answer often obey some predictable patterns. Compared to factoid questions, non-factoid questions is much more difficult both in judging the relevance of information and in information integration. In this talk, we introduce our intention of answering one class of complex questions. We will briefly present some work that we have done on question (can be seen as a kind of topic expressions) study; identify difficulties in answering non-factoid questions; and introduce how these difficulties are tackled or dodged in other similar work.

Mark McLauchlan, Semi-supervised parse selection

Unsupervised and semi-supervised approaches to parse selection are attractive since they avoid the need for expensive manually-annotated treebanks. Several approaches have been suggested, from the Expectation-Maximisation algorithm to active learning and co-training. These are all iterative approaches where increasingly more accurate parse selection models are created in each learning iteration. However, such iterative procedures are usually very time consuming. I will show that by identifying good training examples we can achieve significant improvements in parser accuracy in just a single iteration. Time permitting I will also show how my technique can be incorporated into the co-training framework to create even more accurate parse selection models.

Thapelo Otlogetswe, Corpus Construction for Languages with Limited Written Traditions

On this talk we give our PhD thesis outline. We argue that lexicography needs large corpora that cover substantial samples of each significant variety of a language, so that a lexicographer does not miss any words or patterns of use of words from whatever genre (cf. Biber 1990: 263). What such varieties are and in what proportion they have to appear in a corpus is usually not clear.

While languages like English have huge corpora that have benefited linguistic research of all sorts and the compilation of many dictionaries, many of the world’s languages do not have corpora on which to rely for dictionary construction, nor do they have models to construct and test corpora. It is the aim of the proposed research to develop a well-formed corpus model for languages with a limited written tradition [LWT languages]. Such languages present unique challenges since they are characterised by greater levels of diglossia, code switching and borrowing.

Our model will not only capture the language varieties but provide guidance on the proportion of varieties to be included in a corpus. Central to our argument is that the capturing of different varieties in a corpus can be determined quantitatively and that quantitative methods can be used in the calculation of corpus proportions. The thesis will therefore explore quantitative approaches to judging how well different corpus collection strategies are at providing good coverage.

Norton Trevisan Roman, Politeness and Bias in Automatic Dialogue Summarization

Is impolite behaviour in a dialogue so important that it is worth mentioning in the dialogue summary? Apparently yes. Pilot studies have suggested that when dialogues are very impolite, the impolite behaviours are generally mentioned in the dialogue summaries. And more, the way they are reported is biased according to the point of view the dialogue is summarized under. In this presentation I address the rsults of these pilot studies, as well as future directions for the research and, more specifically, how these results will be used in my research.

Marina Santini, Exploring Web genres with cluster analysis

In my short presentation, I will provide the first results of an experiment I am carrying out using cluster analysis and small random samples extracted from an unclassified collection of Web documents, the SPIRIT collection. The experiment is merely exploratory and will be restricted to Web documents in English.

I will use standard statistical packages to run cluster analysis (SAS, Minitab, SPSS).

For the clustering, I am using simple features that have been proved effective for genre identification before (frequencies of the most common words, POS tags, HTML tags, etc.).

The clusters will be evaluated against a human classification of the samples.

Jason Teeple, Semantic content permutation and its effect on surface linguistic form in natural language generation

This research project investigates several interdependent issues in natural language generation. Document structuring, as defined in the three stage reference architecture (Dale and Reiter 2000), determines how units of information will be grouped and related to each other in a text. In this investigation we hold constant a number of variables at the document structuring stage: 1) 'deep' semantic content and 2) rhetorical relations between information units in order to isolate ordering variation and thus understand its effects more fully. We take as input a set of information units and the rhetorical relations between them. We hold these invariant and permute the information units to determine what changes to surface linguistic forms are required to accomodate these different orderings. If the goal of generation (and langauge generally) is to communicate, then a sub-goal in service of this is to communicate clearly and fluently. Coherence and cohesion are two well-studied factors that contribute to fluency. Variation in four surface level features account for much of the coherence and cohesion of a text: 1) referring expressions 2) verb form 3) explicitly or implicitly expressed discourse (rhetorical) relations 4) syntactic alternations in topic or focus. Thus constraints on local variation are motivated by the global factor of fluency. This research seeks to illuminate some of the dependencies necessary between stages in an NLG architecture: what contextual information is needed to make local decisions. The research project addresses the narrative domain and takes primitive events within the story as information units. We hope that the theory and techniques developed will be general enough to apply to other domains of natural language generation.

Xinglong Wang, Automatically Acquiring Semi-Sense-Tagged Corpora

In the context of Word Sense Disambiguation, performance of supervised algorithms is strongly affected by the quantity and quality of training and testing materials. Algorithms trained on large sense-tagged corpora tend to overperform those trained on small ones. Unfortunately, large sense-annotated corpora are rare and manually producing them is extremely expensive and time-consuming. This problem is the so-called knowledge acquisition bottleneck. In this talk, I will present a method to automatically acquiring English semi-sense-tagged corpora by processing the very large collection of Chinese texts gathered from the Internet using search engines and hidden web-sites. I will also discuss possible problems lie in this approach.
Last Modified on May 20 2004 by

© Information Technology Research Institute at the University of Brighton