Present: Nicoletta Calzolari (representing 'languages other than English'), Phil Edmonds, Adam Kilgarriff (chair), Martha Palmer, David Yarowsky (representing participants), Antonio Zampolli (for part of meeting)
We opted for a two-day workshop, with the first day, on Thurs 5th, open to participants only, and the second day fully open (subject to this being acceptable to ACL and the senseval-discuss list).
We noted that US participants committed to spending 4th July with their families in the US would have trouble getting there for the first day. We thought it unlikely this would effect many people, but await responses/objections following the mailouts.
Tagged corpora: Martha and Scott Cotton at UPenn will work out a standard for the tagged datasets, using Linguistic Data Consortium (LDC) best practice and XML Corpus Encoding Standard (XCES) as appropriate.
Format for participants to return answers in: it was not necessary for participants to return fully tagged corpora. The quantities of data being passed around would be large, and this might well cause problems. The format for answers, for all languages, would be exactly as for SENSEVAL-1-English, that is, one-line-per-answer, with the line comprising a unique reference for the token being tagged, and one or more sense-identifiers, optionally associated with probabilities. Field separator is 'space', separator between a sense-identifier and its probability is ':'. Full specification, as drawn up by Joseph Rosenzweig, is already available from SENSEVAL-1 on the SENSEVAL website. (In SENSEVAL-1-Eng, the unique reference comprised two parts; the "task name" and the reference number within that task. This model will be reviewed, particularly in relation to the all-words task.)
We would like as many language tasks as possible to adopt the standard formatting. Phil will communicate this to the task organisers, and will also assist the task organisers in adopting the formatting.
Scoring: The same scoring software to be used for all language tasks. Penn will set up a web-based scorer, where participants can submit two files, typically 'their system's answers' and 'the key/correct answers' and get back a score. Joseph R's scoring software from SENSEVAL-1 to be re-used. Web page describing how to use the web scoring software, what options there are, etc, to be provided.
We should aim to get the tagged corpora format and web-based scorer running by Christmas, as they are key for people to be able to start working on their systems.
Options for 'truth': All language tasks should have the same possibilities for the gold standard. In addition to the senses that the dictionary identifies for the word, for English SENSEVAL-1 the three tags PROPER-NAME, TYPO and UNASSIGNABLE (P, T, U) were always available. Also, a disjunction of tags (where disjuncts can be P, T, U as well as dictionary tags) was acceptable. This model was proposed for all languages for SENSEVAL-2. Guidelines for when human taggers should specify disjunctions, or P, T and U tags, are available.
As for English SENSEVAL-1, the English lexical sample is a stratified random sample, comprising high, medium and low polysemy and frequency nouns, verbs and adjectives (adjusted to ensure overlap between lexical sample and all-words words).
For the 'all-words' task (which will be undertaken for English and Italian, though we do not expect it to be set up in many other languages) - sense-tag all predicates, noun which are heads of noun-phrase arguments to those predicates, and adjectives modifying those nouns, in a text. [[Martha, have I got this right?]] The text(s) to be used will add up to 5,000 running words, of which about 2,000 are predicates, etc, which will be tagged.
Lexical sample The following formula was adopted:
For each word (lemma) in the sample
- if it has n senses
- we tag 75 + 15n instances (so, eg, we have 120 instances for a 3-sense word, and 225 instances for a 10-sense word).
These are then divided in a ratio of 2:1 between training set and test set:
- training set: 50+10n (80 instances for a 3-sense word)
- test set: 25+5n (40 instances for a 3-sense word).
For languages other than English, this can be seen as a goal, which it may or may not be possible to realise, for example because there are not enough occurrences in the corpus.
[Post-meeting addition by AK:
Generally, n, the number of senses, should be counted looking at the finest-grain distinctions the dictionary offers. Also, multiword expressions containing the lemma (in any of its inflectional forms), if present in the dictionary, should be counted in n. For words which feature in many multiwords and where the dictionary lists many multiwords (eg, the WordNet entry for lemon, as discussed in the run-up to SENSEVAL-1), this will give rise to disconcertingly large values of n. Where this happens, the formulae may well need revision, perhaps along the lines of
The target lexical sample size is 100 (50 nouns, 25 verbs, 25 adjectives), but if the budget will not stretch that far, the numbers of lemmas in the sample should be reduced, with the formulae for numbers of instances per lemma held constant.
David suggested that, for maybe two words, there should be far more training data (say, 500 instances) so that performance improvement with increasing training set size can be investigated. Resource permitting, this will be done for English.
Treebank material will be provided only for testing. There will not be a training corpus of manually-tagged, treebanked data, beyond what Martha had already put in the public domain (as in SIGLEX-99 and ELRA2-2000 papers) and small samples to show the markup conventions, for Italian and, if the conventions have changed at all, for English.
Adam to do some trialling of both the speedup and the risk.
For English, there would be two tagging teams, one working on the all-words task at UPenn, the other on the lexical sample task in the UK. To gauge consistency between the two, the UK taggers would tag the all-words instances of the lexical sample words and the US ones would tag some of the UK lexical sample data.
It is desirable that the text type of the evaluation data is a text type for which there is plenty of data easily obtainable. If, for example, evaluation data is taken from Le Monde, that means that teams wanting lots of data to train on can simply go out and buy CD-ROMs of Le Monde, without task co-ordinators needing to get involved. (That is the position for unsupervised training, eg, where the system trains on raw data. For supervised training, which requires sense-tagged data as input, the situation is different as the resource will never be in ample supply.)
For English lexical-sample, Adam is considering using the web as a source of evaluation data, probably constrained to be, eg, within a certain domain area within the mozilla.com or yahoo.com directory structures. (This more experimental approach to be balanced by another subset of the data probably taken from the LDC NANews corpus, and/or the BNC. The wider the range of sources for data selection, the more likely we are to get senses that aren't covered in the inventory: there is likely to be more use of the "unassignable" tag.)
For lexical-sample tasks, one important parameter is the quantity of context for each instance to be tagged. This was just two sentences in English SENSEVAL-1, which was not adequate to deploy some algorithms. A suitable target is, roughly, a `page', or 100 words (where no suitable textual unit is locatable). This has implications for the bulk of datasets to be distributed. If there are 100 items in the lexical sample, an average of 150 instances per lemma, and 100 words of context per instance, the dataset size is 1.5M words - for English, the order of 10MB uncompressed, or 3MB compressed.
While this is a substantial quantity of data to download, we felt that, given a high bandwidth Internet connection (which we believe all participants would have), this was not a problem.
[Post-meeting addition by AK:
Following discussions related to Escudero, Marquez and Rigau's paper at EMNLP, and suggestions from Ken Church and John Carroll, we are considering a cross-domain aspect to the task, where the data is taken from two distinct domains, and the WSD system trained on the one domain is evaluated on the other, and vice versa. Contribution/responses to this idea most welcome!
]
Date-stamping: We decided to use an automatic date-stamping procedure to ensure that all participants got long enough, but not too long, to work with the data (even if they happened to be away from the office for part of the critical period). So there will be a six-week critical period. Within that period, each participant, for each language they are participating in:
It seems appropriate that this plan, and the same server, is used, across all language tasks. That means that Univ of Pennsylvania would be the hub for all data distribution (as well as results collecting, and scoring). This seems an efficient way to proceed. However if organisers for other tasks would rather distibrute their own data, that will be fine (please let senseval organisers know).
Martha (delegating again!) will set up the SENSEVAL server at UPenn to implement the downloading, time-stamping regime.
It was agreed that the classification of systems would note this distinction.