[CLG logo] Computational Linguistics Group
Department of Computer Science
University of Haifa

[Haifa logo]

Supervised Learning of Roots

Project description

To design and implement machine learning techniques for automatic extraction of root consonants from words in Semitic languages.
Ezra Daya and Shuly Wintner, with Dan Roth at UIUC
The Caesarea Edmond Benjamin de Rothschild Foundation Institute for Interdisciplinary Applications of Computer Science


The major word formation mechanism of Semitic languages is root-and-pattern: words are formed by inserting the consonants of a root into the open slots of a pattern, where the root is a sequence of (usually three, but sometimes 2, 4 or -- in extreme cases -- 5 or more) consonants and the pattern is a sequence of vowels and consonants which includes "open slots" for the root. For example, when the Hebrew root k.t.b is inserted into the pattern hit_a_e_ the resulting lexeme is the verb hitkateb. As is usual in natural language morphology, combinations of roots and patterns sometimes involve morphological and morpho-phonological alternations. For example, insertion of the root s.p.r into the same pattern results in histaper rather than the expected hitsaper.

Extracting the root of a Semitic word is therefore a non-trivial task, especially given that there are thousands of roots and hundreds of patterns in a typical Semitic language. The objective of this project is to explore machine learning techniques for this task. We will investigate a variety of (supervised) learning algorithms for the problem, starting with linear classifiers (using the SNoW architecture) and memory-based learning (using TiMBL).

This task is especially interesting for existing machine learning technology due to the combination of two facts: on one hand, the number of targets is huge (approximately 2500 roots in the case of Hebrew); on the other hand, separating the problem into a small number of independent tasks (such as learning each consonant of the root in isolation) misses the obvious interdependencies between the root's letters. We will investigate several approaches to overcome these difficulties.

The result of this project will be an automatic function for extracting the root of a given word in both Hebrew and Arabic (and, in principle, in any Semitic language). Such a function is useful in a variety of applications, including information retrieval and natural language processing tasks for Semitic languages.



Mailing address Shuly Wintner
Department of Computer Science
University of Haifa
31905 Haifa, Israel.
Phone +972-4-8288180
Fax +972-4-8249331
E-mail shuly@cs.haifa.ac.il

Computational Linguistics Group, http://cl.haifa.ac.il/
Department of Computer Science, University of Haifa
Maintained by shuly@cs.haifa.ac.il, modified Sunday November 24, 2013.