Scoring in SENSEVAL

1 Introduction

This document contains a set of detailed proposals for how systems should be scored in SENSEVAL. It describes a number of practical difficulties that arise, and proposes a policy for each.

This paper should be read in conjunction with A proposal for SENSEVAL scoring scheme by Dan Melamed and Philip Resnik (hereafter MR) which proposes, and presents the mathematics for, a probabilistic approach to scoring, available here.

Terminology: a case or instance refers to a corpus instance of a word to be tagged, with associated context. Word will be used to mean a lemma or dictionary headword.

2 What is the gold standard?

2.1 Disjunctions

**Changed since version of 9 July**

Human taggers will sometimes give disjunctive taggings eg "sense 1 or sense 3". Also, human taggers frequently gave different tags in the first pass and in these cases, the data was sent to a fourth human tagger to edit out errors and wayward interpretations, and to reclassify other differences as disjunctions. Around 15% of gold-standard instances are disjunctions.

PROPOSAL: All scores are calculated in each of three ways:

Minimal
Disjunctive instances not used for scoring
  • PRO: we know that the gold standard is correct
  • CONdoes not use all the data; the data we throw away may well be harder, or have different characteristics, to that we keep
Generous
Any of a set of disjunctive answers is counted as correct
  • PRO simple
  • CON: tends to inflate figures
Stingy
Probability mass is split between disjunctive answers (see MR)
  • PRO: more discriminating than Generous
  • CON: makes it (virtually) impossible to score 100%; hard to interpret

Minimal to be used as the "score-of-reference", as it has the simplest interpretation.

2.2 Suffixes

Human taggers will sometimes give tags with suffixes, to indicate nominal or adjectival uses of senses that are not normally nominal or adjectival, or metaphor, or extended-sense, or that the word is being used with one of its dictionary meanings within a Proper noun. Taggers have used suffixes in 8% of cases.

These suffixes will be useful when we come to analyse the gold standard data. But for scoring, the only one it is viable to use is the "in proper noun" suffix, which can be treated as "either the specified sense or PROPER". All other suffixes to be ignored. One consequence is that there is not always a match between the POS of the sense, and the POS of the corpus instance: where a verbal sense is being used adjectivally (eg as a participle), the POS of the sense will be "v" but the POS of the instance will be "a".

2.4 Insufficient context

On occasions there was simply insufficient context for the human taggers to determine which sense applied (though it may have been possible to rule out some senses). They have identified those cases, which are reasonably distinct from cases of disjunction or other forms of uncertainty. PROPOSAL Treat in the same way as other forms of disjunction.

2.5 Senses/subsenses

Where one tagger specifies, eg, sense 3 and the other two, eg, 3.2, there is an argument for settling on 3 as a least-commitment compromise.

I argue against this approach, as it is often not lexicographically valid to treat a sense as including its subsensenses, even though their meanings are close (see float, senses 13 and 13.1).

PROPOSAL Treat sense/subsense disagreements in the same way as other sorts of disagreements (except when scoring at the ord-and-main-sense levels only --see next section-- in which case they will collapse to the same main sense).

3 Senses and subsenses

HECTOR distinguishes up to four levels of meaning distinction:
  • homograph or ord level
  • main sense level: this includes distinctions between semantically close items of different word classes, eg nominal vs. verbal shake (core meaning)
  • numbered subsenses (1.1) These are often collocationally specific.
  • lettered (sub)subsenses (1a, 1.1a) These are mostly syntactic, as in, eg, the absolute use of a transitive verb.
Cutting across this hierarchy, lexically distinct phrasal verbs are always treated as distinct at the ord level and noun compounds are given distinct lexical entries, so, eg band saw was not in the original lexical entry for band so is outside the hierarchy. (I amalgamated entries for compounds into main entries for SENSEVAL.)

The ord level may look tempting to IR people. However this level of distinction is only used for band, sack, scrap, slight (once its uses for phrasal verbs are excluded) out of the 35 test words.

PROPOSAL:
Two modes of scoring: coarse looks only at ord and main-sense levels, fine looks at all levels. For coarse, all numbered- and lettered-subsense distinctions are collapsed to their parent main sense in both the Gold Standard and the system output. (This `collapsing' in the Gold Standard will mean that the set of instances to be used for Minimal scoring may increase, as subsense-level disjunctions in the gold standard will cease being disjunctions when viewed coarsely.)

In both cases all noun compounds and phrasal verbs are treated as distinct, as in HECTOR.

MR scoring pays heed to the hierarchical structure of the entry. However, phrasal verbs and noun compounds are addressed outside the semantic hierarchy, and ord labels are rare, so it is inappropriate to treat any levels above the main-sense level as hierarchical. Lexical entries will not be treated as having any hierarchy above the main sense level.

SENSEVAL will not make any use of the lettered-vs.-numbered subsense distinction.

3.1 Syntax and word-sense interactions

As mentioned before, the 35 words can be divided into three sets:
  1. Five words where the task will include disambiguation between n, v, a, PROPER: (these are band, bitter, hurdle, sanction, shake)
  2. Words where there will be more than one task, correponding to distinct part-of-speech.
  3. The remaining words, where all instances will be of just one POS.
I do not foresee any particular difficulties arise in scoring the different tasks.

Minor word-class distinctions (eg count vs. mass nouns, trans vs. intrans verbs) play no particular role in the scoring. If HECTOR distinguishes them, they are distinct for evaluation purposes, if not, not.

3.2 "Multi-word expressions" (MWEs), eg idioms, compounds, fixed phrases, phrasal verbs etc.

PROPOSAL Whatever HECTOR treats as distinct (at a given level in the sense tree), SENSEVAL also treats as distinct at that level. This seems straightforward.

The one complication the human taggers have noted has only indirect bearing on the scoring, but I mention it here for completeness. It relates to variability in MWEs., eg, should cook in "Too many cooks!" be classified as the idiomatic sense where the full form of the idiom is "too many cooks spoil the broth"?

3.3 Primacy of semantics over syntax

The HECTOR lexicography takes semantics as primary, with syntax taking a secondary role. Groupings are firstly on the basis of meaning. This seems broadly appropriate for WSD, which is a semantic tagging task. One outcome is that the syntactic coding occurring under the "gr" and "clues" tags in HECTOR lexical entries is not to be taken as definitive. The human taggers have noted many occasions where a corpus instance fits the meaning of a given sense but does not match the grammatical coding. In such cases their instructions have been to give precedence to the meaning. Also the "HECTOR Lexicographical Policy and Procedures" document does not specify whether the default reading for grammatical codes is that they always apply when a word is being used in that sense, or that they are salient for the sense in some weaker way. The taggers' evidence suggests that the always reading would not be appropriate.

Minor word classes as specified under "gr" and "clues" should only be read as indicative, not as a necessary condition for the sense.

4 Form of results that systems return

**Changed since version of 9 July**

Systems may return single tags or multiple tags for an instance, and if multiple tags are returned, these may be either weighted or unweighted. Details of format for returning results here.

5 Dictionary mappings

**Changed since version of 9 July**

Many systems will be disambiguating according to the dictionary they usually use, and then mapping to HECTOR senses. Since such mappings are never entirely one-to-one, the mapping involves information loss.

It is not desirable that systems are scored more harshly because they used a different inventory, and mapped, but, given the problems of dictionary-mapping, it is hard to avoid.

Mapping from WordNet 1.5 and WordNet 1.6 to HECTOR are available.

6 How much of the task did the system do?

Some systems will attempt to disambiguate all instances of all words, others, just nouns, or just verbs, and others again, some smaller and more obscure subsets, eg, "all nouns where they appear as head of object-noun-phrase of a set of high-frequency verbs".

I do not see this as presenting any problem. Research involving less person-months, or less experienced researchers, is likely to cover less of the data, but such research teams are to be encouraged to participate. Percentage-correct scores based on 25% of the whole dataset can be compared with ones based on the whole dataset (though clearly, if, eg, nouns are easier than verbs and a system does only nouns, it should be compared with other systems' performance on nouns only.)

Other systems may fail to attempt to disambiguate an instance, not because it could not handle that kind of case in principle, but because there was insufficient evidence in that particular case. This is a different sort of issue, relating to IR recall. I do not know if any participating systems will operate in this way.

PROPOSAL: for each word, for each system that could have attempted to dis a "percentage attempted" figure is provided alongside the "percentage correct". (Default is 100%.)

7 Types of system

Systems using richer inputs are at an advantage with respect to ones requiring minimal inputs (though the latter are generally closer to being usable in NLP applications).
PROPOSAL As stated in earlier email, three categories of system are distinguished and only comparisons between systems in the same category are valid. The three categories are:
Supervised-training
Supervised training systems require a substantial quantity (eg over 30) sense-tagged instances of each word they are to disambiguate.
Other-training
These systems do not require over 30 tagged training instances, but do require a learning phase to be applied for each word to be disambiguated. Such systems apply only to lexical samples, and scaling up from a system which disambiguates instances of, eg, 35 words to one which disambiguates a full vocabulary of, eg, 20,000 ambiguous words has not been done and would be non-trivial.
All-words
All-words systems disambiguate all content words (or, at least, all content words of a given grammatical category) in a text.

8 Case-by-case, word-by-word and global results

All of the above addresses how we score a particular instance. For each instance there are up to four figures: generous or stingy, and coarse or fine.

The percentage correct for that word is then calculated in six ways: minimal, generous, and stingy, all either coarse or fine.

There will also be four figures for percentage-applicability: an overall figure, then broken down into three classes of reason for instances not counting:

  1. gold-standard disjunction (for "minimal" score only)
  2. outside the system's specification
  3. the system could not make up its mind

The basic form of the results will be this set of 10 figures for each cell in an N-by-41 grid (where N is the number of participating systems and 41 is the number of tasks, each defined as a word and either a POS or "p", signifying "all-POSes"). Some cells will be empty because that system did not attempt that task.) Each of the N tasks will be associated with between 47 and 431 instances.

Global figures, for a class (eg all nouns) or for the whole set can be produced in the same way as word-specific figures. Of course, we should be extremely wary of such figures as they will usually gloss over a multitude of very different results.

Adam Kilgarriff
SENSEVAL Co-ordinator
20 July 1998
Back to Main SENSEVAL page