ITRI-04-14

Ling Yin

Topic Analysis and Answering Procedural Questions

The thesis develops a notion of extended topic with the aim of facilitating the process of finding relevant information on the Web or in large digital repositories. An extended topic resembles a query to an information system. It indicates the kind of information contained in a discourse, just as a query implies the kind of information that the questioner requires. It is formulated without addressing the detailed facts, just as a questioner can formulate a query without knowing the answer. This characteristic is referred to as a kind of generality or indicativeness. With regard to this characteristic, different constituents of an extended topic are not equal. Specifically, extended topics tend to share a common structure: a specific part referring to some concrete entities or problems, and a generic part that denotes a general perspective from which the specific part is approached. We can find a set of contexts where extended topics might be explicitly phrased, including WH-questions, indicative summaries, entries in an index, topic expressions describing the plan of an academic paper. The theory of extended topic can be applied to a list of applications. Specifically, we observe that the specific part of extended topics will be kept in the discourse while the generic part will be replaced by detailed facts. This indicates, with respect to information retrieval or question answering, that word-frequency-based approaches may not be as effective in retrieving the generic part of a topic expression as in retrieving the specific part. We suggest that many concepts that are typically used in the generic part (e.g., 'history', 'procedure') are associated with a list of detailed elements (e.g., 'procedure' is associated with 'actions', 'precondition', 'causal' relationships between 'actions', etc.). For the purpose of detecting whether a text is about a general concept, we suggest first extracting these detailed elements based on some discourse patterns. Our major practical goal is to automatically retrieve the generic concept. We will also study strategies for combining the ranking of a document according to the specific part and the ranking according to the generic part.