Frequencies of letter n-grams in Hebrew and Arabic

Project description

Objective: To provide corpus-computed frequencies of letter n-grams in Hebrew and Arabic.
Researchers: Ella Rabinovich and Shuly Wintner.
Status: Complete
Funding: None

Abstract

Psycholinguistic research uses (both actual and nonce) words as linguistic cues, triggers or objects of investigation. Often, it is important to know how frequent such words are in texts. To facilitate an assessment of such frequencies, we provide the frequencies of letter sequences: unigrams, bigrams and trigrams, in Modern Hebrew and Modern Standard Arabic. The frequencies were computed from two corpora: in Hebrew, from a 12-million word corpus of the HaAretz daily newspaper, containing articles from 1991; and in Arabic, from a 10-million word corpus of the Omani Al-Watan newspaper (presumably from 2004).

In the future, we intend to compute frequencies from more (and more diverse) corpora, and to develop also an on-line tool for computing the probability of given words (both actual and nonce).

Resources

In the files below, Hebrew is encoded in UTF-8 and Arabic in Windows-1256. The symbols "<" and ">" represent the beginning and the end of words, respectively.

Publications

If you are using this resource, please cite the following:

Ella Rabinovich and Shuly Wintner. Frequencies of letter n-grams in Hebrew and Arabic. Unpublished manuscript, University of Haifa, August 2016. URL.

Contact

Mailing address	Shuly Wintner Department of Computer Science University of Haifa 31905 Haifa, Israel.
Phone	+972-4-8288180
Fax	+972-4-8249331
E-mail	`shuly@cs.haifa.ac.il`