Collection, organization and annotation of Hebrew corpora

Project description

Collect, maintain, organize and annotate (written) Hebrew corpora
Shlomo Yona, Shuly Wintner
Israeli Ministry of Science and Technology, as part of the Knowledge Center for Hebrew Language Telecommunication


The goal of this project is to maintain a collection of written Hebrew texts, taken mostly from newspapers, organize them structurally using XML, and annotate them morphologically (syntactic annotation may follow in the future). We currently have over 2500 newspaper articles (mostly from HaAretz, Maariv and Yediot) and over 40,000 short newswire articles from Arutz 7, totalling over one million word tokens.

The texts are annotated morphologically using an automatic morphological analyzer; two versions of the corpus exist: one in which each word is assigned all its analyses, and another in which morphological ambiguity is resolved. We are currently working on a manually annotated subset of the corpus, whose analyses are verified. The articles are represented in XML, using dedicated schemas that we have designed.




