Collection, organization and annotation of Hebrew corpora

Project description

Objective: Collect, maintain, organize and annotate (written) Hebrew corpora
Researchers: Shlomo Yona, Shuly Wintner
Status: Complete
Funding: Israeli Ministry of Science and Technology, as part of the Knowledge Center for Hebrew Language Telecommunication

Abstract

The goal of this project is to maintain a collection of written Hebrew texts, taken mostly from newspapers, organize them structurally using XML, and annotate them morphologically (syntactic annotation may follow in the future). We currently have over 2500 newspaper articles (mostly from HaAretz, Maariv and Yediot) and over 40,000 short newswire articles from Arutz 7, totalling over one million word tokens.

The texts are annotated morphologically using an automatic morphological analyzer; two versions of the corpus exist: one in which each word is assigned all its analyses, and another in which morphological ambiguity is resolved. We are currently working on a manually annotated subset of the corpus, whose analyses are verified. The articles are represented in XML, using dedicated schemas that we have designed.

Resources

All corpora are available from the Knowledge Center for Processing Hebrew

Publications

Shuly Wintner and Shlomo Yona, Resources for Processing Hebrew, in Proceedings of the MT Summit IX Workshop on Machine Translation for Semitic Languages, New Orleans, September 2003.

Contact

Mailing address	Shuly Wintner Department of Computer Science University of Haifa 31905 Haifa, Israel.
Phone	+972-4-8288180
Fax	+972-4-8249331
E-mail	`shuly@cs.haifa.ac.il`