Computational Linguistics Group Department of Computer Science University of Haifa
|
Psycholinguistic research uses (both actual and nonce) words as linguistic cues, triggers or objects of investigation. Often, it is important to know how frequent such words are in texts. To facilitate an assessment of such frequencies, we provide the frequencies of letter sequences: unigrams, bigrams and trigrams, in Modern Hebrew and Modern Standard Arabic. The frequencies were computed from two corpora: in Hebrew, from a 12-million word corpus of the HaAretz daily newspaper, containing articles from 1991; and in Arabic, from a 10-million word corpus of the Omani Al-Watan newspaper (presumably from 2004).
In the future, we intend to compute frequencies from more (and more diverse) corpora, and to develop also an on-line tool for computing the probability of given words (both actual and nonce).
In the files below, Hebrew is encoded in UTF-8 and Arabic in Windows-1256. The symbols "<" and ">" represent the beginning and the end of words, respectively.
Mailing address | Shuly Wintner Department of Computer Science University of Haifa 31905 Haifa, Israel. |
Phone | +972-4-8288180 |
Fax | +972-4-8249331 |
shuly@cs.haifa.ac.il |
http://cl.haifa.ac.il/
shuly@cs.haifa.ac.il
,
modified Monday August 01, 2016.