ITRI-03-19

Adam Kilgarriff

Linguistic Search Engine

Also published in Proceedings of Corpus Linguistics, Lancaster, UK, March 2003, pp.53-58

We propose to build a linguistic search engine, similar in overall design to Google or Altavista but meeting the specifications and requirements of researchers into language.

 

Language scientists and technologists are increasingly turning to the web as a source of language data, because other resources are not large enough, because they do not contain the types of language the researcher is interested in, or simply because it is free and instantly available.  The default means of access to the web is through a search engine such as Google. While the web search engines are dazzlingly efficient pieces of technology and excellent at the task they set themselves, for the linguist they are frustrating.  The search engine results do not present enough instances (Google sets a limit of 2000) or enough context for each instance (Google generally provides a ca 10-word fragment), they are selected according to criteria which are, from a linguistic perspective, distorting, and they do not allow searches to be specified according to linguistic functions such as lemmatization and word class.

 

Each of these goals could straightforwardly be resolved, but approaches to Google to date have gone unanswered. However this suggests a better solution: rather than depend upon existing search engines, it would be far better to set up a linguistic search engine, dedicated to linguists’ interests. Then the kinds of processing and querying would be designed explicitly to meet linguists’ desiderata, without any conflict of interest or ‘poor relation’ role.  Once this is set up, large numbers of possibilities open out. All those processes of linguistic enrichment and `linguistic data mining’ which have been applied with impressive effect to smaller corpora could be applied to the web so that web searches could be specified in terms of linguistically interesting units such as lemmas, word classes, and constituents (e.g. noun phrase) rather than strings. Thesauruses and lexicons could be developed directly from the web.  The way would be open for further anatomizing of web text types and domains, both a topic of interest in itself and one where strategies would be needed so that web-based lexical resources could be developed for specific text types or domains, or so that the biases of the web could be countered to provide ‘general languages’ and ‘sublanguage’ resources from the web.  All of this can potentially be done for all of the many languages for which there is ample data on the web.

 

The web, teeming as it is with language data, of all manner of varieties and languages, in vast quantity and freely available, is potentially a fabulous linguists’ playground.  The Linguistic Search Engine will bring that dream closer to reality.