BulSemCor

The Bulgarian Sense-annotated Corpus (BulSemCor) (in Bulgarian: Български семантично анотиран корпус (БулСемКор)) is a structured corpus of Bulgarian texts in which each lexical item is assigned a sense tag. BulSemCor was created by the Department of Computational Linguistics[1] at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences.

Structure

BulSemCor was created as part of a nationally funded project titled BulNet – A lexico-semantic network for the Bulgarian Language (2005–2010). It follows the general methodology of SemCor[2] combined with some specific principles.[3] The corpus for annotation consists of 101,791 tokens covering an excerpt from the Bulgarian "Brown" Corpus][4] modelled on the Brown Corpus.Kucera, FrancisFrancis Kucera An important feature of BulSemCor is that the samples are selected using heuristics that provide optimal coverage of ambiguous lexis.

BulSemCor is manually sense-annotated according to the Bulgarian WordNet. Its size is comparable to that of other contemporary semantically annotated corpora or pool of acceptable linguistic components. The semantic annotation consists in associating each lexical item in the corpus with exactly one synonym set (synset) in the Bulgarian WordNet that best describes its sense in the particular context. The selection of the best match among the suggested candidates is based on a set of procedures, such as the other synset members, the synset gloss (explanatory definition) and the position of a given candidate in the WordNet structure.

Scale

The number of annotated tokens is 99,480 (the difference in the number of tokens compared to the initial corpus is due to the fact that some of them are not linguistic items). The simple word count is 86,842 and multiword expressions (MWE) are 5,797 (12,638 tokens).

Specific features

All words in BulSemCor are assigned a sense, while according to established practice only simple content words or content word classes (typically nouns and verbs) are annotated. Since 2000 the development of language resources, has broadened to include annotation of function words and multiword expressions covering particular senses or types of words and expressions. In this respect, BulSemCor's annotation is more exhaustive and hence provides greater opportunities for linguistic observations and non-linear programming (NLP) applications.

Annotated items inherit the linguistic information associated with the corresponding synset, which along with morphological and semanic tags may include annotation on one or more of the following additional levels:[5]

See also

References

External links

This article is issued from Wikipedia - version of the 11/19/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.