Frequencies of letter n-grams in Hebrew and Arabic

Project description

To provide corpus-computed frequencies of letter n-grams in Hebrew and Arabic.
Ella Rabinovich and Shuly Wintner.


Psycholinguistic research uses (both actual and nonce) words as linguistic cues, triggers or objects of investigation. Often, it is important to know how frequent such words are in texts. To facilitate an assessment of such frequencies, we provide the frequencies of letter sequences: unigrams, bigrams and trigrams, in Modern Hebrew and Modern Standard Arabic. The frequencies were computed from two corpora: in Hebrew, from a 12-million word corpus of the HaAretz daily newspaper, containing articles from 1991; and in Arabic, from a 10-million word corpus of the Omani Al-Watan newspaper (presumably from 2004).

In the future, we intend to compute frequencies from more (and more diverse) corpora, and to develop also an on-line tool for computing the probability of given words (both actual and nonce).


In the files below, Hebrew is encoded in UTF-8 and Arabic in Windows-1256. The symbols "<" and ">" represent the beginning and the end of words, respectively.


If you are using this resource, please cite the following:


