Computational Linguistics Group
Department of Computer Science
University of Haifa
The task of Native Language Identification (NLI) aims at determining the native language (L1) of an author given only text in a foreign language (L2). NLI has gained much popularity recently, usually with an eye to educational applications. However, the NLI task is not limited to the language of learners; it is relevant also, perhaps even more so, in the (much more challenging) context of highly-fluent, advanced non-native speakers. While the English language dominates the internet, native English speakers are far outnumbered by speakers of English as a foreign language. Consequently, a vast amount of static and dynamic web content is continuously generated by non-native writers. Therefore, developing methodologies for identifying the native language of non-native English authors on social media outlets is an important and pertinent task.
As a first result, we present a computational analysis of cognate effects on the spontaneous linguistic productions of advanced non-native speakers. Introducing a large corpus of highly competent non-native English speakers, and using a set of carefully selected lexical items, we show that the lexical choices of non-natives are affected by cognates in their native language. This effect is so powerful that we are able to reconstruct the phylogenetic language tree of the Indo-European language family solely from the frequencies of specific lexical items in the English of authors with various native languages.
If you are using the following resources, please cite Rabinovich et al. (2018) and/or Goldin et al. (2018).
The Reddit-L2 corpus (8GB)
Reddit-L2 corpus cleanup code
Reddit-L2 chunks as in Goldin et al. (2018). See the readme file.
Same dataset, where the data of all authors with the same L1 constitute one file. See the readme file.
firstname.lastname@example.org, modified Tuesday January 28, 2020.