[CLG logo] Computational Linguistics Group
Department of Computer Science
University of Haifa

[Haifa logo]

Frequencies of letter n-grams in Hebrew and Arabic

Project description

Objective
To provide corpus-computed frequencies of letter n-grams in Hebrew and Arabic.
Researchers
Ella Rabinovich and Shuly Wintner.
Status
Complete
Funding
None

Abstract

Psycholinguistic research uses (both actual and nonce) words as linguistic cues, triggers or objects of investigation. Often, it is important to know how frequent such words are in texts. To facilitate an assessment of such frequencies, we provide the frequencies of letter sequences: unigrams, bigrams and trigrams, in Modern Hebrew and Modern Standard Arabic. The frequencies were computed from two corpora: in Hebrew, from a 12-million word corpus of the HaAretz daily newspaper, containing articles from 1991; and in Arabic, from a 10-million word corpus of the Omani Al-Watan newspaper (presumably from 2004).

In the future, we intend to compute frequencies from more (and more diverse) corpora, and to develop also an on-line tool for computing the probability of given words (both actual and nonce).

Resources

In the files below, Hebrew is encoded in UTF-8 and Arabic in Windows-1256. The symbols "<" and ">" represent the beginning and the end of words, respectively.

Publications

If you are using this resource, please cite the following:

Contact

Mailing address Shuly Wintner
Department of Computer Science
University of Haifa
31905 Haifa, Israel.
Phone +972-4-8288180
Fax +972-4-8249331
E-mail shuly@cs.haifa.ac.il

Computational Linguistics Group, http://cl.haifa.ac.il/
Department of Computer Science, University of Haifa
Maintained by shuly@cs.haifa.ac.il, modified Monday August 01, 2016.