WikiBiography Corpus

WikiBiography Click here to download WikiBiography.

WikiBiography is a corpus of about 1200 annotated biographies from the German version of Wikipedia. Fully automatic preprocessing includes the following:

sentence boundaries
part-of-speech tags
word lemmas
syntactic dependencies
anaphora resolution*
discourse connectives
classified named entities
temporal expressions

* there is only one coreference chain which links all mentions of the biographee. The annotation is done with freely available software (see references). To visualize the data and access and correct the annotation you should use MMAX2. With MMAX2 API you can access any layer of annotation from your Java programs.

Screenshots

Orange and green fonts are used for temporal expressions (e.g. “7. Oktober 1885”, “später”) and locations (e.g. “Kopenhagen”, “Dänemarks”) respectively. People other than the biographee (e.g. “Chtistian Bohr”, “Harald Bohr”) are highlighted with light-blue. Mentions of the biographee are highlighted with red (e.g. “Niels Henrik David Bohr”, “er”, “Niels Bohr”). The annotation of a selected word (e.g. “Professor”) is displayed in a separate window. The head of the word is highlighted with grey colour then and an ark from the dependent word to its head is displayed.

Code Sample

Download

Click here to download WikiBiography.

References

A CPAN Perl module is used for sentence boundaries identification.

TNT tagger is used for PoS-tagging:
Brants, T.: 2000, ‘TnT – A statistical Part-of-Speech tagger’. In: Proceedings of the 6th Conference on Applied Natural Language Processing, Seattle, Wash., 29 April – 4 May 2000. pp. 224-231.

TreeTagger is used for lemmatization:
Schmid, H.: 1997, ‘Probabilistic part-of-speech tagging using decision trees’. In: D. Jones and H. Somers (eds.): New Methods in Language Processing. London, UK: UCL Press, pp. 154-164.

WCDG parser is used for dependency parsing:
Foth, K. and W. Menzel: 2006, ‘Hybrid parsing: Using probabilistic models as predictors for a symbolic parser’. In: Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 17-21 July 2006. pp. 321-327.

A list of about 300 connectives from IDS Mannheim is used to identify these connectives in our corpus.

Temporal expressions are identified with a set of templates. Named entities are classified as person, location or organization based on the information from Wikipedia.

Name	Borlabs Cookie
Provider	Eigentümer dieser Website
Purpose	Speichert die Einstellungen der Besucher, die in der Cookie Box von Borlabs Cookie ausgewählt wurden.
Cookie Name	borlabs-cookie
Cookie Expiry	1 Jahr

Accept	Matomo
Name	Matomo
Provider	HITS gGmbH
Purpose	Cookie von Matomo für Website-Analysen. Erzeugt statistische Daten darüber, wie der Besucher die Website nutzt.
Cookie Name	_pk_.
Cookie Expiry	13 Monate

Accept	Facebook
Name	Facebook
Provider	Meta Platforms Ireland Limited, 4 Grand Canal Square, Dublin 2, Ireland
Purpose	Wird verwendet, um Facebook-Inhalte zu entsperren.
Privacy Policy	https://www.facebook.com/privacy/explanation
Host(s)	.facebook.com

Accept	Google Maps
Name	Google Maps
Provider	Google Ireland Limited, Gordon House, Barrow Street, Dublin 4, Ireland
Purpose	Wird zum Entsperren von Google Maps-Inhalten verwendet.
Privacy Policy	https://policies.google.com/privacy
Host(s)	.google.com
Cookie Name	NID
Cookie Expiry	6 Monate

Accept	Instagram
Name	Instagram
Provider	Meta Platforms Ireland Limited, 4 Grand Canal Square, Dublin 2, Ireland
Purpose	Wird verwendet, um Instagram-Inhalte zu entsperren.
Privacy Policy	https://www.instagram.com/legal/privacy/
Host(s)	.instagram.com
Cookie Name	pigeon_state
Cookie Expiry	Sitzung

Accept	OpenStreetMap
Name	OpenStreetMap
Provider	Openstreetmap Foundation, St John’s Innovation Centre, Cowley Road, Cambridge CB4 0WS, United Kingdom
Purpose	Wird verwendet, um OpenStreetMap-Inhalte zu entsperren.
Privacy Policy	https://wiki.osmfoundation.org/wiki/Privacy_Policy
Host(s)	.openstreetmap.org
Cookie Name	_osm_location, _osm_session, _osm_totp_token, _osm_welcome, _pk_id., _pk_ref., _pk_ses., qos_token
Cookie Expiry	1-10 Jahre

Accept	Twitter
Name	Twitter
Provider	Twitter International Company, One Cumberland Place, Fenian Street, Dublin 2, D02 AX07, Ireland
Purpose	Wird verwendet, um Twitter-Inhalte zu entsperren.
Privacy Policy	https://twitter.com/privacy
Host(s)	.twimg.com, .twitter.com
Cookie Name	__widgetsettings, local_storage_support_test
Cookie Expiry	Unbegrenzt