ITRI-05-02
Marina Santini
Linguistic Facets for Genre and Text Type Identification: A Description of Linguistically-Motivated Features
In this report we propose a new set of features for automatic genre and text type identification: linguistic facets. The label linguistic facet has been created to stress the fact that each of the features in this new set highlights a facet, i.e. an aspect in the communicative context that is reflected in the use of language. Linguistic facets subsume two set of features: functional cues and syntactic patterns. Functional cues are not completely new. Syntactic patterns, instead, have never been tried before in any automatic approach. Both functional cues and syntactic patterns are linguistically-motivated features that can be interpreted functionally. The effort is to derive from text-internal linguistic cues the communicative context in which the text has been produced/consumed. With linguistically-motivated and functionally-interpretable features it is possible to combine qualitative analysis and quantitative findings, and get a more accurate text analysis together with genre and text type identification. Preliminary results show that linguistic facets have a robust discriminating power for web genre classification.