Hebrew Multi-word Expressions: Definition, Processing and Acquisition
Project description
- Objective
- To define and classify multi-word expressions in Hebrew; develop a
methodology for their lexical representation; incorporate them
in an existing lexicon and a morphological processing system
based upon it; and develop techniques for automatic acquisition
of MWEs from corpora.
- Researchers
- Hassan Al-Haj, Yulia Tsvetkov, Hanna Fadida, (Technion) Kayla Jacobs (Technion) and
Shuly Wintner. Joint project
with Alon Itai at the Technion.
- Status
- Complete
- Funding
- ISF (grant 1269/07)
Abstract
Mutli-word expressions (MWE) are lexical words consisting of more than
a single orthographic word. Semantically, their meaning is
non-compositional (i.e., cannot be established from the meanings of
their components); syntactically, they may function as words or as
phrases; morphologically, their behavior is many times idiosyncratic;
and orthographically, they are written with intervening
spaces. Oftentimes, MWE are named entities. The identification of
MWE is an important task for a variety of NLP applications, ranging
from information retrieval and building ontologies to machine
translation. MWE are a challenge for computational processing of
natural languages because they combine properties of words and
phrases, and because phonological, morphological and orthographic
processes apply to them differently than to ordinary tokens. In
Hebrew, this challenge is paramount due to the complex morphology and
orthography of the language: morphological and orthographic processes
in Hebrew apply to MWE in unique ways, complicating morphological
processing and automatic extraction of MWE.
We will develop
theories and techniques for representing, analyzing and acquiring
Hebrew MWE. Specifically, we will:
- Develop an architecture
for lexical specification of MWE in Hebrew, and extend an existing
lexicon of the language with capabilities to store MWE;
- Develop
techniques for morphological processing of MWE in Hebrew, and extend
an existing morphological processor (anaylzer/generator) with
capabilities to process MWW;
- Develop techniques to extract MWE
from monolingual and bilingual corpora, and populate the lexicon with
automatically acquired MWE;
- Evaluate the quality of the tools
using state-of-the-art evaluation measures, and investigate their
applicability to other languages with complex morphology and
orthography, notably Arabic.
Resources
A small (250,000-sentence) Hebrew-English parallel corpus. An
annotated list of noun-noun constructions, marked as either noun
compounds or compositional.
Verb-complement lexicons.
Publications
-
Kayla Jacobs, Alon Itai and Shuly Wintner.
Acronyms: Identification, Expansion and Disambiguation.
Annals of Mathematics and Artificial Intelligence 8:(5-6): 517-532, 2020.
PDF (This is a post-peer-review, pre-copyedit version
of an article published in Annals of Mathematics and Artificial
Intelligence. The final authenticated version is available online
at Springer.
-
Livnat Herzig Sheinfux, Tali Arad Greshler, Nurit Melnik and Shuly
Wintner. Verbal multiword expressions: Idiomaticity and
flexibility. In Yannick Parmentier and Jakub Waszczuk (eds.),
Representation and parsing of multiword expressions: Current
trends, chapter 2, pages
35-68, Berlin: Language Science Press. 2019.
PDF.
-
Hanna Fadida, Alon Itai and Shuly Wintner.
A Hebrew Verb--Complement Dictionary.
Language Resources and Evaluation 48(2):249-278, June 2014.
PDF
(The original publication is available from Springer).
-
Hassan Al-Haj, Alon Itai and Shuly Wintner.
Lexical Representation of Multiword Expressions in
Morphologically-complex Languages.
International Journal of Lexicography 27(2):130-170, June
2014.
PDF.
-
Yulia Tsvetkov and Shuly Wintner.
Identification of Multi-word Expressions by Combining Multiple
Linguistic Information Sources.
Computational Linguistics 40(2):449-468, June 2014.
PDF.
- Hanna Fadida. Automatic Extraction of Subcategorization Frames for Hebrew,
M.Sc. thesis, Technion, June 2012. PDF.
-
Yulia Tsvetkov and Shuly Wintner.
Extraction of Multi-word Expressions from Small Parallel
Corpora.
Natural Language Engineering 18(4):549-573, October 2012.
PDF (Copyright Cambridge University Press, official version here).
- Yulia Tsvetkov and Shuly Wintner.
Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources.
Proceedings of the 2011 Conference on Empirical Methods in
Natural Language Processing (EMNLP 2011), pages 836-845,
Edinburgh, Scotland, July 2011.
PDF.
-
Yulia Tsvetkov and Shuly Wintner.
Extraction of Multi-word Expressions from Small Parallel Corpora
Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pages 1256-1264, Beijing, August 2010.
PDF.
-
Hassan Al-Haj and Shuly Wintner.
Identifying Multi-word Expressions by Leveraging Morphological and Syntactic Idiosyncrasy
Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pages 10-18, Beijing, August 2010.
PDF.
-
Yulia Tsvetkov and Shuly Wintner.
Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content.
Proceedings of the seventh international conference on Language Resources and Evaluation (LREC-2010), pages 3389-3392, Malta, May 2010.
PDF.
Contact
Computational Linguistics Group,
http://cl.haifa.ac.il/
Department of Computer Science,
University of Haifa
Maintained by
shuly@cs.haifa.ac.il
,
modified Monday June 08, 2020.