Collection, organization and annotation of Hebrew corpora
Project description
- Objective
- Collect, maintain, organize and annotate (written) Hebrew corpora
- Researchers
- Shlomo Yona,
Shuly Wintner
- Status
- Complete
- Funding
- Israeli Ministry of Science and Technology, as part of the
Knowledge Center for Hebrew Language Telecommunication
Abstract
The goal of this project is to maintain a collection of written Hebrew
texts, taken mostly from newspapers, organize them structurally using
XML, and annotate them morphologically (syntactic annotation may
follow in the future). We currently have over 2500 newspaper articles
(mostly from HaAretz, Maariv and Yediot) and over 40,000 short
newswire articles from Arutz 7, totalling over one million word
tokens.
The texts are annotated morphologically using an automatic
morphological analyzer; two versions of the corpus exist: one in which
each word is assigned all its analyses, and another in which
morphological ambiguity is resolved. We are currently working on a
manually annotated subset of the corpus, whose analyses are
verified. The articles are represented in XML, using dedicated schemas
that we have designed.
Resources
Publications
- Shuly Wintner and Shlomo Yona, Resources for Processing Hebrew, in Proceedings of the MT Summit IX Workshop on Machine Translation for Semitic Languages, New Orleans, September 2003.
Contact
Computational Linguistics Group,
http://cl.haifa.ac.il/
Department of Computer Science,
University of Haifa
Maintained by
shuly@cs.haifa.ac.il
,
modified Sunday November 24, 2013.