[CLG logo] Computational Linguistics Group
Department of Computer Science
University of Haifa

[Haifa logo]

Non-native Language

Project description

To study the special characteristics of higly-advanced, fluent non-native speakers.
Gili Goldin (University of Haifa), Ella Rabinovich (University of Haifa and IBM Haifa Research Labs) and Shuly Wintner. Collaborating with Yulia Tsvetkov (LTI, CMU).


The task of Native Language Identification (NLI) aims at determining the native language (L1) of an author given only text in a foreign language (L2). NLI has gained much popularity recently, usually with an eye to educational applications. However, the NLI task is not limited to the language of learners; it is relevant also, perhaps even more so, in the (much more challenging) context of highly-fluent, advanced non-native speakers. While the English language dominates the internet, native English speakers are far outnumbered by speakers of English as a foreign language. Consequently, a vast amount of static and dynamic web content is continuously generated by non-native writers. Therefore, developing methodologies for identifying the native language of non-native English authors on social media outlets is an important and pertinent task.

As a first result, we present a computational analysis of cognate effects on the spontaneous linguistic productions of advanced non-native speakers. Introducing a large corpus of highly competent non-native English speakers, and using a set of carefully selected lexical items, we show that the lexical choices of non-natives are affected by cognates in their native language. This effect is so powerful that we are able to reconstruct the phylogenetic language tree of the Indo-European language family solely from the frequencies of specific lexical items in the English of authors with various native languages.


If you are using the following resources, please cite Rabinovich et al. (2018) and/or Goldin et al. (2018).

The Reddit-L2 corpus (8GB)

Reddit-L2 corpus cleanup code

Reddit-L2 chunks as in Goldin et al. (2018). See the readme file.

Same dataset, where the data of all authors with the same L1 constitute one file. See the readme file.


If you are using these resources, please cite the following:


Ella Rabinovich, Shuly Wintner.
Computational Linguistics Group, http://cl.haifa.ac.il/
Department of Computer Science, University of Haifa
Maintained by shuly@cs.haifa.ac.il, modified Tuesday January 28, 2020.