Hebrew Multi-word Expressions: Definition, Processing and Acquisition
Project description
- Objective
- To define and classify multi-word expressions in Hebrew; develop a
methodology for their lexical representation; incorporate them
in an existing lexicon and a morphological processing system
based upon it; and develop techniques for automatic acquisition
of MWEs from corpora.
- Researchers
- Hassan Al-Haj, Yulia Tsvetkov, Hanna Fadida, (Technion) Kayla Jacobs (Technion) and
Shuly Wintner. Joint project
with Alon Itai at the Technion.
- Status
- Ongoing
- Funding
- ISF (grant 1269/07)
Abstract
Mutli-word expressions (MWE) are lexical words
consisting of more than a single orthographic word. Semantically,
their meaning is non-compositional (i.e., cannot be established from
the meanings of their components); syntactically, they may function as
words or as phrases; morphologically, their behavior is many times
idiosyncratic; and orthographically, they are written with intervening
spaces. Oftentimes, MWE are named entities.
The identification of MWE is an important task for a variety of NLP
applications, ranging from information retrieval and building
ontologies to machine translation. MWE are a challenge for
computational processing of natural languages because they combine
properties of words and phrases, and because phonological,
morphological and orthographic processes apply to them differently
than to ordinary tokens. In Hebrew, this challenge is paramount due to
the complex morphology and orthography of the language: morphological
and orthographic processes in Hebrew apply to MWE in unique ways,
complicating morphological processing and automatic extraction of
MWE.
We will develop theories and techniques for representing, analyzing
and acquiring Hebrew MWE. Specifically, we will:
-
Develop an architecture for lexical specification of MWE in Hebrew,
and extend an existing lexicon of the language with capabilities to
store MWE;
- Develop techniques for morphological processing of MWE in Hebrew,
and extend an existing morphological processor (anaylzer/generator)
with capabilities to process MWW;
- Develop techniques to extract MWE from monolingual and bilingual
corpora, and populate the lexicon with automatically acquired MWE;
- Evaluate the quality of the tools using state-of-the-art evaluation
measures, and investigate their applicability to other languages
with complex morphology and orthography, notably Arabic.
Resources
A small (250,000-sentence) Hebrew-English parallel corpus. An annotated list of noun-noun constructions, marked as either noun compounds or compositional.
Publications
- Yulia Tsvetkov and Shuly Wintner.
Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources.
Proceedings of the 2011 Conference on Empirical Methods in
Natural Language Processing (EMNLP 2011), pages 836-845,
Edinburgh, Scotland, July 2011.
PDF.
-
Yulia Tsvetkov and Shuly Wintner.
Extraction of Multi-word Expressions from Small Parallel Corpora
Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pages 1256-1264, Beijing, August 2010.
PDF.
-
Hassan Al-Haj and Shuly Wintner.
Identifying Multi-word Expressions by Leveraging Morphological and Syntactic Idiosyncrasy
Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pages 10-18, Beijing, August 2010.
PDF.
-
Yulia Tsvetkov and Shuly Wintner.
Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content.
Proceedings of the seventh international conference on Language Resources and Evaluation (LREC-2010), pages 3389-3392, Malta, May 2010.
PDF.
Contact
Computational Linguistics Group,
http://cl.haifa.ac.il/
Department of Computer Science,
University of Haifa
Maintained by
shuly@cs.haifa.ac.il,
modified Thursday December 29, 2011.