| ITRI/NLCL Student Workshop 2003 | |
| Location | ITRI W107 (how to get there) |
| Date | Friday May 9, 2003 |
| Organization | ITRI Research Student Division (Paul Piwek) at the University of Brighton and NLCL Doctoral Programme (John Carroll) at the University of Sussex |
| Timing | There is 35 minute slot for each talk. This includes discussion. We assume that there is between 5 and 15 minutes after each talk for discussion. Thus, talks can last between 20 and 30 minutes depending on how much time the speaker would like to use. |
| Schedule | |
| 9:30 | Tea/Coffee |
| 9:45 | Welcome |
| 10:00 | Julie Weeds, Measures of Distributional Similarity |
| 10:35 | Ivandre Paraboni, Using logical redundancy to facilitate search in the generation of referring expressions |
| 11:10 | Break |
| 11:30 | Mark McLauchan, Bootstrapping a better parser |
| 12:05 | Yin Ling, Capturing topic for helping Information Retrieval |
| 12:40 | Lunch (will be provided) |
| 13:40 | Xinglong Wang, Automatically Acquiring Semi-Sense-Tagged Corpora |
| 14:15 | Marina Santini, Automatic Web Genre Classification |
| 14:50 | Break |
| 15:10 | Chi-Ho (Henry) Li, Automatic Grammar Induction by Distributional Clustering |
| 15:45 | Jon Herring (Title: TBA) |
| Participants | John Carroll, Paul Piwek, Donia Scott, Julie Weeds, Ivandre Paraboni, Mark McLauchlan, Yin Ling, Xinglong Wang, Marina Santini, Chi-Ho (Henry) Li, Jon Herring, Malin Bergenstrahle, Chong-Hoon Ha and Daniel Paiva. |
| Abstracts |
Julie Weeds, Measures of Distributional Similarity Two words are said to be distributionally similar if they appear in similar contexts. A measure of distributional similarity has many potential applications of which I will present a brief survey. However, there have been many different measures proposed over the years and this work investigates whether the properties which lead to high performance in one application area necessarily lead to high performance in others. Ivandre Paraboni, Using logical redundancy to facilitate search in the generation of referring expressions This talk concerns the computational generation of referring expressions (GRE), which is one of the key components of any Natural Language Generation System. In much of the existing work, GRE is understood as the task of producing a sufficient set of properties such that it uniquely distinguishes the referent from other entities in a certain context. Descriptions produced in this way aim exclusively at the uniqueness of the reference, and hence convey little or no logical redundancy. When the domain is large or complex, however, it may be necessary to include logically redundant properties in order to facilitate search. For example, it may be necessary to say "the car park next to the *South* entrance of the building" even in a context in which there is only one car park near one of the entrances of the referred building, therefore making the information about the position of the intended entrance (i.e., the property 'South') unnecessary for disambiguation. Based on these insights, we propose a number of algorithms for generating descriptions in a structurally complex domain, some of which add redundant information to facilitate search, and we report the results of a psycholinguistic experiment confirming our hypotheses about the usefulness of redundant information. Mark McLauchan, Bootstrapping a better parser My research builds on work by John Carroll and Ted Briscoe on an accurate, wide coverage statistical parser. This parser has good coverage partly because it relies mostly structural information rather than lexical features. This is sufficient to resolve many kinds of ambiguity, but other work has shown that words are necessary to disambiguate some structures such as prepositional phrases. In this talk I will describe a lexicalised parsing model that reranks the output of this non-lexicalised parser. It produces a slight increase in accuracy, which is surprising given that the training data is based on output from that same parser. The key is to screen the training data and use only the unambiguous parse constituents. This approach gives us access to much larger training sets, although at the expense of accuracy. I will discuss the merits of this approach and some possible applications. Yin Ling, Capturing topic for helping Information Retrieval In the current implementations of search engines for the Web or for digital libraries, word-based Information Retrieval is still the most commonly used strategy. One inherent problem with this strategy is that the semantic relationships between words (or concepts) are lost when using keywords to represent the query and the documents to be retrieved. Another approach is semantic structure indexing and matching. Since semantic structure can encode the literal meaning of a document, so this approach solves the problem mentioned above. However, to get related information, this approach can only enable users to query in describing the similar detail the document has, which is unrealistic in some situations, especially for querying academic literature. We suggest that sometimes users tend to query detailed information with description of its topic. To enable such topic queries, we intend to create a schema for formulating topics of textual units and to study automatic topic extraction methods on the basis of this schema. In this talk, we will elaborate the origin of the idea, identify the common vocabulary used in describing topics, discuss the relationship between the semantic content and the pragmatic features of a textual unit and its topic, and propose possible methods for automatic topics extraction. Marina Santini, Automatic Web Genre Classification Genre may be crucial in many areas: Web Navigation, Document Management, Digital Libraries, Search Engines, etc., and can be integrated into content-based systems to make Information Retrieval more efficient. Genres are very seldom a pure instantiation of just one fixed typology. They are unstable and changeable, especially on the Web: there is a tension between continuity and change that gives rise to evolving genres and new genres. For this reason, in my project genres will be seen as bundles of properties, and not as monolithic units. An analysis of Web documents based on linguistic, structural and layout features will be attempted in order to have a deeper insight into Web texts: the Web is a huge repository of running documents and a challenging new kind of corpus. My project includes several interrelated interests. The most important are the following:
In the context of Word Sense Disambiguation, performance of supervised algorithms is strongly affected by the quantity and quality of training and testing materials. Algorithms trained on large sense-tagged corpora tend to overperform those trained on small ones. Unfortunately, large sense-annotated corpora are rare and manually producing them is extremely expensive and time-consuming. This problem is the so-called knowledge acquisition bottleneck. In this talk, I will present a method to automatically acquiring English semi-sense-tagged corpora by processing the very large collection of Chinese texts gathered from the Internet using search engines and hidden web-sites. I will also discuss possible problems lie in this approach. Chi-Ho (Henry) Li, Automatic Grammar Induction by Distributional Clustering A complete syntactic analysis system comprises a parser and a set of grammar rules. Compared to the vast amount of research on parsing algorithms, the problem of grammar acquisition has received relatively less attention. Moreover, most of the existing grammar induction work are focused on the acquisition of grammar parameters (rule probabilities) rather than that of grammar structure (the rule themselves). In the presentation I will outline an unsupervised approach to the induction of grammar structure that makes use of 1) some clustering algorithms in machine learning, and 2) the notion of syntactic distribution in structural linguistics. Some encouraging results of my pilot experiment will also be presented to support the approach. |
|
Last Modified on May 20 2004 by
© Information Technology Research Institute at the University of Brighton |