[CLG logo] Computational Linguistics Group
Department of Computer Science
University of Haifa

[Haifa logo]

Frequencies of letter n-grams in Hebrew and Arabic

Project description

To provide corpus-computed frequencies of letter n-grams in Hebrew and Arabic.
Ella Rabinovich and Shuly Wintner.


Psycholinguistic research uses (both actual and nonce) words as linguistic cues, triggers or objects of investigation. Often, it is important to know how frequent such words are in texts. To facilitate an assessment of such frequencies, we provide the frequencies of letter sequences: unigrams, bigrams and trigrams, in Modern Hebrew and Modern Standard Arabic. The frequencies were computed from two corpora: in Hebrew, from a 12-million word corpus of the HaAretz daily newspaper, containing articles from 1991; and in Arabic, from a 10-million word corpus of the Omani Al-Watan newspaper (presumably from 2004).

In the future, we intend to compute frequencies from more (and more diverse) corpora, and to develop also an on-line tool for computing the probability of given words (both actual and nonce).


In the files below, Hebrew is encoded in UTF-8 and Arabic in Windows-1256. The symbols "<" and ">" represent the beginning and the end of words, respectively.


If you are using this resource, please cite the following:


Mailing address Shuly Wintner
Department of Computer Science
University of Haifa
31905 Haifa, Israel.
Phone +972-4-8288180
Fax +972-4-8249331
E-mail shuly@cs.haifa.ac.il

Computational Linguistics Group, http://cl.haifa.ac.il/
Department of Computer Science, University of Haifa
Maintained by shuly@cs.haifa.ac.il, modified Monday August 01, 2016.