WikiBiography Corpus

WikiBiography Click here to download WikiBiography.

WikiBiography is a corpus of about 1200 annotated biographies from the German version of Wikipedia. Fully automatic preprocessing includes the following:

sentence boundaries
part-of-speech tags
word lemmas
syntactic dependencies
anaphora resolution*
discourse connectives
classified named entities
temporal expressions

* there is only one coreference chain which links all mentions of the biographee. The annotation is done with freely available software (see references). To visualize the data and access and correct the annotation you should use MMAX2. With MMAX2 API you can access any layer of annotation from your Java programs.

Screenshots

Orange and green fonts are used for temporal expressions (e.g. “7. Oktober 1885”, “später”) and locations (e.g. “Kopenhagen”, “Dänemarks”) respectively. People other than the biographee (e.g. “Chtistian Bohr”, “Harald Bohr”) are highlighted with light-blue. Mentions of the biographee are highlighted with red (e.g. “Niels Henrik David Bohr”, “er”, “Niels Bohr”). The annotation of a selected word (e.g. “Professor”) is displayed in a separate window. The head of the word is highlighted with grey colour then and an ark from the dependent word to its head is displayed.

Code Sample

Download

Click here to download WikiBiography.

References

A CPAN Perl module is used for sentence boundaries identification.

TNT tagger is used for PoS-tagging:
Brants, T.: 2000, ‘TnT – A statistical Part-of-Speech tagger’. In: Proceedings of the 6th Conference on Applied Natural Language Processing, Seattle, Wash., 29 April – 4 May 2000. pp. 224-231.

TreeTagger is used for lemmatization:
Schmid, H.: 1997, ‘Probabilistic part-of-speech tagging using decision trees’. In: D. Jones and H. Somers (eds.): New Methods in Language Processing. London, UK: UCL Press, pp. 154-164.

WCDG parser is used for dependency parsing:
Foth, K. and W. Menzel: 2006, ‘Hybrid parsing: Using probabilistic models as predictors for a symbolic parser’. In: Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 17-21 July 2006. pp. 321-327.

A list of about 300 connectives from IDS Mannheim is used to identify these connectives in our corpus.

Temporal expressions are identified with a set of templates. Named entities are classified as person, location or organization based on the information from Wikipedia.