NLP Group


POS Annotation for ICSI Meeting Recorder Data

Here you find the Part of Speech annotation for the ICSI Meeting Recorder Data. Please note, that the files only contain the POS information and no word information. You already have to have the ICSI corpus to use this data. When using this data, please cite the following paper: Margot Mieskes and Michael Strube:
Part-of-Speech Tagging of Transcribed Speech Proceedings of the 5th Conference on Language Resources and Evaluation (LREC 2006). Genua, Italy, May 22-28, 2006 (PDF).
 This paper also contains a description of the method used and detailed results. 
The format in the .txt files is one segment per line.

The Gold Standard files are:

The Gold Standard files are:Bed016


Here you find the Gold Standard manual annotation in .txt format.

Here you find the Gold Standard manual annotation in .mmax format.

Here you find the automatic POS annotation for the Gold Standard after retraining the four taggers on the manual data in .txt format.

Here you find the automatic POS annotation for the Gold Standard after retraining the four taggers on the manual data in .mmax format.

Here you find the automatic POS annotation for the whole corpus after retraining the four taggers on the manual data in .txt format.

Here you find the automatic POS annotation for the whole corpus after retraining the four taggers on the manual data in .mmax format.

The four taggers used were the following:

TBL Tagger: Eric Brill Some Advance in transformation based part of speech tagging In Proceedings of the 12th National Conference on Artificial Intelligence, Seattle, Washington 1. – 4. August 1994, pp. 722-727

TnT Tagger: Thorsten Brants TnT – A statistical Part Of Speech tagger In Proceedings of the 6th International Conference on Applied Natural Language Processing, Seattle, Washington 29. April – 4. May 2000, pp. 224-231

Stanford NLP Library Tagger: Kristina Toutanova and Christopher D. Manning Enriching the knowledge sources used in a maximum entropy part-of-speech tagger In Proceedings of the Joint SIGDAT Conference on Empirical methods in Natural Language Processing and very large corpus, Hong Kong 2000, pp. 63-70 

Stanford NLP Library Tagger:Kristina Toutanova, Dan Klein, Christopher D. Manning and Yoram Singer Feature-Rich Part-of-Speech Tagging with a cyclic dependency network. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Edmonton, Alberta, Canada, 27. May – 1. June 2003, pp. 252-259NLP Group

Read more

WikiBiography Corpus

WikiBiography Click here to download WikiBiography.

WikiBiography is a corpus of about 1200 annotated biographies from the German version of Wikipedia. Fully automatic preprocessing includes the following:

  • sentence boundaries
  • part-of-speech tags
  • word lemmas
  • syntactic dependencies
  • anaphora resolution*
  • discourse connectives
  • classified named entities
  • temporal expressions

* there is only one coreference chain which links all mentions of the biographee. The annotation is done with freely available software (see references). To visualize the data and access and correct the annotation you should use MMAX2. With MMAX2 API you can access any layer of annotation from your Java programs.


Orange and green fonts are used for temporal expressions (e.g. “7. Oktober 1885”, “später”) and locations (e.g. “Kopenhagen”, “Dänemarks”) respectively. People other than the biographee (e.g. “Chtistian Bohr”, “Harald Bohr”) are highlighted with light-blue. Mentions of the biographee are highlighted with red (e.g. “Niels Henrik David Bohr”, “er”, “Niels Bohr”). The annotation of a selected word (e.g. “Professor”) is displayed in a separate window. The head of the word is highlighted with grey colour then and an ark from the dependent word to its head is displayed.

Code Sample


Click here to download WikiBiography.


CPAN Perl module is used for sentence boundaries identification.

TNT tagger is used for PoS-tagging: 
Brants, T.: 2000, ‘TnT – A statistical Part-of-Speech tagger’. In: Proceedings of the 6th Conference on Applied Natural Language Processing, Seattle, Wash., 29 April – 4 May 2000. pp. 224-231.

TreeTagger is used for lemmatization: 
Schmid, H.: 1997, ‘Probabilistic part-of-speech tagging using decision trees’. In: D. Jones and H. Somers (eds.): New Methods in Language Processing. London, UK: UCL Press, pp. 154-164.

WCDG parser is used for dependency parsing: 
Foth, K. and W. Menzel: 2006, ‘Hybrid parsing: Using probabilistic models as predictors for a symbolic parser’. In: Proceedings of the 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, 17-21 July 2006. pp. 321-327.

A list of about 300 connectives from IDS Mannheim is used to identify these connectives in our corpus.

Temporal expressions are identified with a set of templates. Named entities are classified as person, location or organization based on the information from Wikipedia.

Read more


CoCoBi is a Corpus of Comparable Biographies in German and contains 400 annotated biographies of 141 famous people. Automatic annotation was done the same way and with the same tools as in WikiBiography (see this page) for more details). Biographies come from different sources, mainly, from Wikipedia and the Brockhaus Lexikon. The part from the Brockhaus Lexikon is used with their generous permission.DownloadClick here to download CoCoBi.

Read more

Evaluation Metrics for End-to-End Coreference Resolution

ommonly used coreference resolution evaluation metrics can only be applied to key mentions (i.e.already annotated mentions). We here propose two variants of the BCubed (Bagga and Baldwin, 1998) and CEAF (Luo, 2005) coreference resolution evaluation algorithms which can be used to evaluate coreference resolution systems dealing with system mentions (i.e. automatically determined mentions).The algorithms and relevant analysis can be found in details in our SIGDIAL 2010 paper.Both BCubedsys and CEAFsys java classes are available for download, along with a couple of necessary parent classes. They should be used within the BART framework, whose repository is available here.


Click here to download BCubedsys and CEAFsys.

Publications related to evaluation metrics for end-to-end coreference resolution

  • SIGDIAL’10 (PDF)
    Cai, Jie; Strube, Michael (2010).
    Evaluation metrics for end-to-end coreference resolution systems.
    InProceedings of the SIGdial 2010 Conference: The 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue,Tokyo, Japan, 24-25 September 2010, pages 28-36.
Read more



Click here to download binary relations obtained from processing Wikipedia category names and the category and page network. They were generated by Benjamin Boerschinger based on the technique described in the paper below, without the use of a parser (based only on POS information).

Publications related to WikiRelations:

AAAI ’08 (PDF)
Nastase, Vivi and Strube, Michael (2008).
Decoding Wikipedia Categories for Knowledge Acquisition.
In: Proceedings of the 22nd National Conference on Artificial Intelligence, Chicago, Ill., 13-17 July 2008 , pages 1219-1224.

Read more


Download WikiNet


We are developing WikiNet – a multi-language ontology by exploiting several aspects of Wikipedia. If you want to make your own WikiNet, here are the (Perl) scripts. Click here to download the current version, built from the 20120104 version of the English Wikipedia, with added lexicalizations from the Dutch (20120119), French (20120117), German (20120115), Italian (20120126), Arabic (20120123), Bulgarian (20120129), Farsi (20120124), Japanese (20120121), Korean (20120122), Russian (20120130), Turkish (20120124) and Chinese (20120128) versions (but it contains lexicalizations in many more languages — check the language statistics file for that). It contains a direct ( and a reversed index ( (both multi-lingual), a file with relations (, definitions ( and more.The structure is as follows:

  • direct index: ConceptName ConceptID1 ConceptID2 …
  • reversed index: ConceptID1 NEType ConceptName1 ConceptName2 …
  • relations file: ConceptID1 Relation1 ConceptID11 ConceptID12 … ConceptID1n Relation2 ConceptID21 ConceptID22 …

A bit more details in a README file, the relation statistics, the language statistics (number of lexicalizations and number of entries covered for each language represented), and a paper. Additional files include in-/out-going links between concepts, corresponding to the hyperlinks in the article bodies.

There are approximately 3 million concepts, and 38+ million relations.

We have a toolkit for visualizing and extracting information from WikiNet: WikiNetTK.

A precursor of the resource in simple text format (in English) is WikiRelations.


WikiNetTK is a tool that allows you to visualize WikiNet, and embed it in your NLP applications.  Below are a few screenshots from the visualization component (click to enlarge).

Starting point — choose the concept to visualize, by inputting first the name, and then choosing from the candidates found the one you want:

Expand the relations surrounding a concept:

Visualize and browse information for a concept in text format:

Visualize the paths between concept pairs:

Dependencies and selectional preferences

Download here a description of concepts in terms of their grammatical relations to open-class words, and selectional preferences for open-class words in terms of (general) concepts

A multi-lingual dictionary extracted from Wiktionary

Download here a multi-lingual dictionary extracted from the English dump of Wiktionary (20100403). The formatting is tab separated values (tsv) as follows:

ENTRY    ID    DIS    POS    VAR1    VAR2…

  • ENTRY, VARi (i=1,2) have the same form: “LANG”:”EXPRESSION” where LANG is a language code. The difference between ENTRY and VARi is that ENTRY is built from the article title in Wiktionary, while VARi is built from the cross-language links in the article.
  • ID is the numeric ID of the article.
  • DIS is a “disambiguation” expression extracted from the article — when an expression can have multiple meanings (each corresponding to a different translation), the article groups the translations for each meaning and labels the group with this (DIS) expression.
  • POS is the part of speech of the entry.

This dictionary contains only entries that have at least one translation. The total number of entries is 74,568, obtained by processing 1,741,886 articles. In the future we will combine this with the multi-lingual expressions extracted from Wikipedia. (or from WikiNet can also work as a parallel dictionary. They both contain also entries that have names only in English.

Read more

ISNotes Corpus

Download of ISNotes corpus

We provide a new MMAX2 annotation layer to 50 documents of the Wall Street Journal portion of the OntoNotes corpus. This layer extends the OntoNotes annotation with fine-grained information status and bridging relations. For more information, see the README file included in the download and have a look at our two papers relevant to this corpus.

Go to the ISNotes GitHub repository, or download directly from GitHub.

Publications related to the ISNotes corpus


Hou, Yufang; Markert, Katja; Strube, Michael (2013)
Global inference for bridging anaphora resolution.
In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Atlanta, Georgia, 9-14 June 2013.

ACL’12 (PDF)

Markert, Katja; Hou, Yufang; Strube, Michael (2012).
Collective Classification for Fine-grained Information Status.
InProceedings of the 50th Annual Meeting of the Association for Computational Linguistics,Jeju Island, Korea, 8-14 July 2012. pages 795-804.

Read more

Click here to go to the German home page.