Supervised Learning of Roots
- To design and implement machine learning techniques for automatic
extraction of root consonants from words in Semitic languages.
- Ezra Daya
and Shuly Wintner,
with Dan Roth at UIUC
- The Caesarea Edmond Benjamin de Rothschild Foundation Institute for Interdisciplinary Applications of Computer Science
The major word formation mechanism of Semitic languages is
root-and-pattern: words are formed by inserting the
consonants of a root into the open slots of a
pattern, where the root is a sequence of (usually three, but
sometimes 2, 4 or -- in extreme cases -- 5 or more) consonants and the
pattern is a sequence of vowels and consonants which includes "open
slots" for the root. For example, when the Hebrew root k.t.b
is inserted into the pattern hit_a_e_ the resulting lexeme is
the verb hitkateb. As is usual in natural language
morphology, combinations of roots and patterns sometimes involve
morphological and morpho-phonological alternations. For example,
insertion of the root s.p.r into the same pattern results in
histaper rather than the expected hitsaper.
Extracting the root of a Semitic word is therefore a non-trivial task,
especially given that there are thousands of roots and hundreds of
patterns in a typical Semitic language. The objective of this project
is to explore machine learning techniques for this task. We will
investigate a variety of (supervised) learning algorithms for the
problem, starting with linear classifiers (using the SNoW
architecture) and memory-based learning (using TiMBL).
This task is especially interesting for existing machine learning
technology due to the combination of two facts: on one hand, the
number of targets is huge (approximately 2500 roots in the case of
Hebrew); on the other hand, separating the problem into a small number
of independent tasks (such as learning each consonant of the root in
isolation) misses the obvious interdependencies between the root's
letters. We will investigate several approaches to overcome these
The result of this project will be an automatic function for
extracting the root of a given word in both Hebrew and Arabic (and, in
principle, in any Semitic language). Such a function is useful in a
variety of applications, including information retrieval and natural
language processing tasks for Semitic languages.
- Ezra Daya, Dan Roth, and Shuly Wintner.
Identifying Semitic Roots: Machine Learning with Linguistic
Computational Linguistics 34(3):429-448,
Ezra Daya, Dan Roth, and Shuly Wintner.
Learning to Identify Semitic Roots.
In Antal van den Bosch, Guenter Neumann, and Soudi Abdelhadi,
editors, Arabic Computational Morphology: Knowledge-based and
Methods, Text, Speech, and Language Technology. Springer,
Ezra Daya. Learning to Identify Semitic Roots. November 2005. University of Haifa M.Sc. Thesis. [pdf]
- Ezra Daya, Dan Roth and Shuly Wintner.
Learning Hebrew Roots: Machine Learning with Linguistic Constraints.
Proceedings of EMNLP'04, Barcelona, July 2004.
Computational Linguistics Group,
Department of Computer Science,
University of Haifa
modified Sunday November 24, 2013.