We are developing WikiNet – a multi-language ontology by exploiting several aspects of Wikipedia. If you want to make your own WikiNet, here are the (Perl) scripts. Click here to download the current version, built from the 20120104 version of the English Wikipedia, with added lexicalizations from the Dutch (20120119), French (20120117), German (20120115), Italian (20120126), Arabic (20120123), Bulgarian (20120129), Farsi (20120124), Japanese (20120121), Korean (20120122), Russian (20120130), Turkish (20120124) and Chinese (20120128) versions (but it contains lexicalizations in many more languages — check the language statistics file for that). It contains a direct (index.wiki) and a reversed index (reversed_index.wiki) (both multi-lingual), a file with relations (data.wiki), definitions (defs.wiki) and more.The structure is as follows:
A bit more details in a README file, the relation statistics, the language statistics (number of lexicalizations and number of entries covered for each language represented), and a paper. Additional files include in-/out-going links between concepts, corresponding to the hyperlinks in the article bodies.
There are approximately 3 million concepts, and 38+ million relations.
We have a toolkit for visualizing and extracting information from WikiNet: WikiNetTK.
A precursor of the resource in simple text format (in English) is WikiRelations.
WikiNetTK is a tool that allows you to visualize WikiNet, and embed it in your NLP applications. Below are a few screenshots from the visualization component (click to enlarge).
Starting point — choose the concept to visualize, by inputting first the name, and then choosing from the candidates found the one you want:
Expand the relations surrounding a concept:
Visualize and browse information for a concept in text format:
Visualize the paths between concept pairs:
Download here a description of concepts in terms of their grammatical relations to open-class words, and selectional preferences for open-class words in terms of (general) concepts
Download here a multi-lingual dictionary extracted from the English dump of Wiktionary (20100403). The formatting is tab separated values (tsv) as follows:
ENTRY ID DIS POS VAR1 VAR2…
This dictionary contains only entries that have at least one translation. The total number of entries is 74,568, obtained by processing 1,741,886 articles. In the future we will combine this with the multi-lingual expressions extracted from Wikipedia.