SENSEVAL-2: Hong Kong Planning Meeting Notes

Friday 6 Oct
Hong Kong Convention and Exhibition Center, at ACL.

Present: Nicoletta Calzolari (representing 'languages other than English'), Phil Edmonds, Adam Kilgarriff (chair), Martha Palmer, David Yarowsky (representing participants), Antonio Zampolli (for part of meeting)

  1. Exact dates of meeting

    It had already been agreed that the SENSEVAL-2 workshop should be held in conjunction with ACL-01 in Toulouse. ACL-01 has set aside Friday 6th and Saturday 7th for workshops (with Sun 8th being for tutorials, and the main conference 9th-11th). Patrick St-Dizier had offered us an extra day, on Thurs 5th. Julio Gonzalo had suggested that at least part of the meeting should be open only to SENSEVAL participants and organisers, to provide a slightly-less-public forum where, eg, complaints about the organisation could be aired.

    We opted for a two-day workshop, with the first day, on Thurs 5th, open to participants only, and the second day fully open (subject to this being acceptable to ACL and the senseval-discuss list).

    We noted that US participants committed to spending 4th July with their families in the US would have trouble getting there for the first day. We thought it unlikely this would effect many people, but await responses/objections following the mailouts.

  2. Standardisation across languages

    We agreed that this would be a good thing. It would be useful to be able to answer the question, "what format should I use?" for questions from language-task-co-ordinators. It would facilitate participants participating for several language-tasks. (David hoped to participate for all language tasks!) It would make the resources produced more attractive at the end of the day.

    Tagged corpora: Martha and Scott Cotton at UPenn will work out a standard for the tagged datasets, using Linguistic Data Consortium (LDC) best practice and XML Corpus Encoding Standard (XCES) as appropriate.

    Format for participants to return answers in: it was not necessary for participants to return fully tagged corpora. The quantities of data being passed around would be large, and this might well cause problems. The format for answers, for all languages, would be exactly as for SENSEVAL-1-English, that is, one-line-per-answer, with the line comprising a unique reference for the token being tagged, and one or more sense-identifiers, optionally associated with probabilities. Field separator is 'space', separator between a sense-identifier and its probability is ':'. Full specification, as drawn up by Joseph Rosenzweig, is already available from SENSEVAL-1 on the SENSEVAL website. (In SENSEVAL-1-Eng, the unique reference comprised two parts; the "task name" and the reference number within that task. This model will be reviewed, particularly in relation to the all-words task.)

    We would like as many language tasks as possible to adopt the standard formatting. Phil will communicate this to the task organisers, and will also assist the task organisers in adopting the formatting.

    Scoring: The same scoring software to be used for all language tasks. Penn will set up a web-based scorer, where participants can submit two files, typically 'their system's answers' and 'the key/correct answers' and get back a score. Joseph R's scoring software from SENSEVAL-1 to be re-used. Web page describing how to use the web scoring software, what options there are, etc, to be provided.

    We should aim to get the tagged corpora format and web-based scorer running by Christmas, as they are key for people to be able to start working on their systems.

    Options for 'truth': All language tasks should have the same possibilities for the gold standard. In addition to the senses that the dictionary identifies for the word, for English SENSEVAL-1 the three tags PROPER-NAME, TYPO and UNASSIGNABLE (P, T, U) were always available. Also, a disjunction of tags (where disjuncts can be P, T, U as well as dictionary tags) was acceptable. This model was proposed for all languages for SENSEVAL-2. Guidelines for when human taggers should specify disjunctions, or P, T and U tags, are available.

  3. Sample words

    The sample words for the English task have been identified. These will be made available under strictest confidence to other language-task organisers, to give them the option of selecting translation-equivalents for their language task.

    As for English SENSEVAL-1, the English lexical sample is a stratified random sample, comprising high, medium and low polysemy and frequency nouns, verbs and adjectives (adjusted to ensure overlap between lexical sample and all-words words).

  4. Datasets

    Given limited manual tagging resources, what should be tagged? We worked out the following scheme, which will be pursued for English and Italian, and can be presented as a model that organisers for other languages can follow if the specification fits the resources they have available (and it suits their ideas of how they wish to proceed - we are not dictating that this model has to be followed).

    For the 'all-words' task (which will be undertaken for English and Italian, though we do not expect it to be set up in many other languages) - sense-tag all predicates, noun which are heads of noun-phrase arguments to those predicates, and adjectives modifying those nouns, in a text. [[Martha, have I got this right?]] The text(s) to be used will add up to 5,000 running words, of which about 2,000 are predicates, etc, which will be tagged.

    Lexical sample The following formula was adopted:

    For each word (lemma) in the sample
    - if it has n senses
    - we tag 75 + 15n instances (so, eg, we have 120 instances for a 3-sense word, and 225 instances for a 10-sense word).

    These are then divided in a ratio of 2:1 between training set and test set:
    - training set: 50+10n (80 instances for a 3-sense word)
    - test set: 25+5n (40 instances for a 3-sense word).

    For languages other than English, this can be seen as a goal, which it may or may not be possible to realise, for example because there are not enough occurrences in the corpus.

    [Post-meeting addition by AK:
    Generally, n, the number of senses, should be counted looking at the finest-grain distinctions the dictionary offers. Also, multiword expressions containing the lemma (in any of its inflectional forms), if present in the dictionary, should be counted in n. For words which feature in many multiwords and where the dictionary lists many multiwords (eg, the WordNet entry for lemon, as discussed in the run-up to SENSEVAL-1), this will give rise to disconcertingly large values of n. Where this happens, the formulae may well need revision, perhaps along the lines of

    where n is the number of 'single-word senses' and m is the number of 'multiword senses' (eg where at least one word in addition to the nodeword must be present in order for the sense to hold.)
    ]

    The target lexical sample size is 100 (50 nouns, 25 verbs, 25 adjectives), but if the budget will not stretch that far, the numbers of lemmas in the sample should be reduced, with the formulae for numbers of instances per lemma held constant.

    David suggested that, for maybe two words, there should be far more training data (say, 500 instances) so that performance improvement with increasing training set size can be investigated. Resource permitting, this will be done for English.

  5. Treebanks

    Th English and Italian all-words tasks would probably use treebank data. This raised issues of markup, as treebank markup tends to be complex. It was agreed that the treebanked version of the text and the simpler version would be provided as two separate files, one that is just text and one that also has postags and brackets. Participants wanting to make use of the tree structures would need to make the links between the two files.

    Treebank material will be provided only for testing. There will not be a training corpus of manually-tagged, treebanked data, beyond what Martha had already put in the public domain (as in SIGLEX-99 and ELRA2-2000 papers) and small samples to show the markup conventions, for Italian and, if the conventions have changed at all, for English.

  6. Dictionaries

    English to use WordNet 1.7, currently in preparation (and to include changes made so that it is better suited to SENSEVAL). This should be available by Christmas or shortly after -- Adam to check. Italian to use ItalWordNet (pending consensus from the National Project partners).

  7. Manual tagging methods

    There was discussion concerning whether it is appropriate to run a WSD system (or similar) over the text before it is manually tagged, and then to sort instances so that ones which the WSD system thought looked similar, were presented to the human tagger together. It seemed likely that this would speed up tagging, but it would also introduce a risk of prejudicing the human.

    Adam to do some trialling of both the speedup and the risk.

    For English, there would be two tagging teams, one working on the all-words task at UPenn, the other on the lexical sample task in the UK. To gauge consistency between the two, the UK taggers would tag the all-words instances of the lexical sample words and the US ones would tag some of the UK lexical sample data.

  8. Where does the data come from?

    It was desirable that the corpus data used for evaluation sets was not already in the public domain, and was not likely to have been used already for training WSD systems. However, it was not crucial, on a 'needles in haystacks' argument. If evaluation data is drawn from the LDC's 200M-word North American News corpus, then the odds of a WSD system gaining advantage through having used exactly the same 10,000 tokens to train on, as were then used for evaluation, are long.

    It is desirable that the text type of the evaluation data is a text type for which there is plenty of data easily obtainable. If, for example, evaluation data is taken from Le Monde, that means that teams wanting lots of data to train on can simply go out and buy CD-ROMs of Le Monde, without task co-ordinators needing to get involved. (That is the position for unsupervised training, eg, where the system trains on raw data. For supervised training, which requires sense-tagged data as input, the situation is different as the resource will never be in ample supply.)

    For English lexical-sample, Adam is considering using the web as a source of evaluation data, probably constrained to be, eg, within a certain domain area within the mozilla.com or yahoo.com directory structures. (This more experimental approach to be balanced by another subset of the data probably taken from the LDC NANews corpus, and/or the BNC. The wider the range of sources for data selection, the more likely we are to get senses that aren't covered in the inventory: there is likely to be more use of the "unassignable" tag.)

    For lexical-sample tasks, one important parameter is the quantity of context for each instance to be tagged. This was just two sentences in English SENSEVAL-1, which was not adequate to deploy some algorithms. A suitable target is, roughly, a `page', or 100 words (where no suitable textual unit is locatable). This has implications for the bulk of datasets to be distributed. If there are 100 items in the lexical sample, an average of 150 instances per lemma, and 100 words of context per instance, the dataset size is 1.5M words - for English, the order of 10MB uncompressed, or 3MB compressed.

    While this is a substantial quantity of data to download, we felt that, given a high bandwidth Internet connection (which we believe all participants would have), this was not a problem.

    [Post-meeting addition by AK:
    Following discussions related to Escudero, Marquez and Rigau's paper at EMNLP, and suggestions from Ken Church and John Carroll, we are considering a cross-domain aspect to the task, where the data is taken from two distinct domains, and the WSD system trained on the one domain is evaluated on the other, and vice versa. Contribution/responses to this idea most welcome!
    ]

  9. Extra tagging

    David proposed that a superset of the evaluation set was presented to participants for sense-tagging, without telling them which of the instances were part of the evaluation set, which ones not. Thus, although only 75+n tokens per word would have been manually evaluated, we could acquire an additional set of a further, say, 200 instances per word which had not been manually disambiguated but had been disambiguated by a number of state of the art WSD systems, thereby providing an interesting resource. All were agreed this was a good plan.

  10. Timetabling and data distribution

    Dates for preparation of datasets, their distribution, return of results etc were set. Phil to prepare and post a schedule with exact dates.

    Date-stamping: We decided to use an automatic date-stamping procedure to ensure that all participants got long enough, but not too long, to work with the data (even if they happened to be away from the office for part of the critical period). So there will be a six-week critical period. Within that period, each participant, for each language they are participating in:

    1. downloads the lexical-sample word list and the training data
    2. downloads the test data
    3. uploads their system results; this must be
      • not more than 7 days after downloading the test data
      • not more than 21 days after downloading the lexical sample and training data
      • by the end of the critical period

    It seems appropriate that this plan, and the same server, is used, across all language tasks. That means that Univ of Pennsylvania would be the hub for all data distribution (as well as results collecting, and scoring). This seems an efficient way to proceed. However if organisers for other tasks would rather distibrute their own data, that will be fine (please let senseval organisers know).

    Martha (delegating again!) will set up the SENSEVAL server at UPenn to implement the downloading, time-stamping regime.

  11. Classification of systems

    In SENSEVAL-1, there was just one two-way classification of systems: supervised, and unsupervised. David pointed out that this overlooked one important distinction: whether there was any human lexicography on a word-by-word basis. Clearly, systems with such input are expected to perform better.

    It was agreed that the classification of systems would note this distinction.

  12. Martha's role

    One outcome of the meeting was that various aspects of the SENSEVAL process are to be centralised at UPenn, thereby giving Martha a key co-ordinating role. To date Adam and Phil have been 'SENSEVAL co-ordinators'. Hereafter, it's Martha, Adam and Phil.


Adam Kilgarriff
Last modified: Fri Oct 20 08:22:10 BST 2000