Supervised Learning of Roots

Project description

Objective: To design and implement machine learning techniques for automatic extraction of root consonants from words in Semitic languages.
Researchers: Ezra Daya and Shuly Wintner, with Dan Roth at UIUC
Status: Complete
Funding: The Caesarea Edmond Benjamin de Rothschild Foundation Institute for Interdisciplinary Applications of Computer Science

Abstract

The major word formation mechanism of Semitic languages is root-and-pattern: words are formed by inserting the consonants of a root into the open slots of a pattern, where the root is a sequence of (usually three, but sometimes 2, 4 or -- in extreme cases -- 5 or more) consonants and the pattern is a sequence of vowels and consonants which includes "open slots" for the root. For example, when the Hebrew root k.t.b is inserted into the pattern hit_a_e_ the resulting lexeme is the verb hitkateb. As is usual in natural language morphology, combinations of roots and patterns sometimes involve morphological and morpho-phonological alternations. For example, insertion of the root s.p.r into the same pattern results in histaper rather than the expected hitsaper.

Extracting the root of a Semitic word is therefore a non-trivial task, especially given that there are thousands of roots and hundreds of patterns in a typical Semitic language. The objective of this project is to explore machine learning techniques for this task. We will investigate a variety of (supervised) learning algorithms for the problem, starting with linear classifiers (using the SNoW architecture) and memory-based learning (using TiMBL).

This task is especially interesting for existing machine learning technology due to the combination of two facts: on one hand, the number of targets is huge (approximately 2500 roots in the case of Hebrew); on the other hand, separating the problem into a small number of independent tasks (such as learning each consonant of the root in isolation) misses the obvious interdependencies between the root's letters. We will investigate several approaches to overcome these difficulties.

The result of this project will be an automatic function for extracting the root of a given word in both Hebrew and Arabic (and, in principle, in any Semitic language). Such a function is useful in a variety of applications, including information retrieval and natural language processing tasks for Semitic languages.

Publications

Ezra Daya, Dan Roth, and Shuly Wintner. Identifying Semitic Roots: Machine Learning with Linguistic Constraints Computational Linguistics 34(3):429-448, September 2008. PDF
Ezra Daya, Dan Roth, and Shuly Wintner. Learning to Identify Semitic Roots. In Antal van den Bosch, Guenter Neumann, and Soudi Abdelhadi, editors, Arabic Computational Morphology: Knowledge-based and Empirical Methods, Text, Speech, and Language Technology. Springer, 2007.
Ezra Daya. Learning to Identify Semitic Roots. November 2005. University of Haifa M.Sc. Thesis. [pdf]
Ezra Daya, Dan Roth and Shuly Wintner. Learning Hebrew Roots: Machine Learning with Linguistic Constraints. Proceedings of EMNLP'04, Barcelona, July 2004. PDF.

Contact

Mailing address	Shuly Wintner Department of Computer Science University of Haifa 31905 Haifa, Israel.
Phone	+972-4-8288180
Fax	+972-4-8249331
E-mail	`shuly@cs.haifa.ac.il`