Non-native Language

Project description

Objective: To study the special characteristics of higly-advanced, fluent non-native speakers.
Researchers: Gili Goldin (University of Haifa), Ella Rabinovich (University of Haifa and IBM Haifa Research Labs) and Shuly Wintner. Collaborating with Yulia Tsvetkov (LTI, CMU).
Status: Ongoing
Funding: None

Abstract

The task of Native Language Identification (NLI) aims at determining the native language (L1) of an author given only text in a foreign language (L2). NLI has gained much popularity recently, usually with an eye to educational applications. However, the NLI task is not limited to the language of learners; it is relevant also, perhaps even more so, in the (much more challenging) context of highly-fluent, advanced non-native speakers. While the English language dominates the internet, native English speakers are far outnumbered by speakers of English as a foreign language. Consequently, a vast amount of static and dynamic web content is continuously generated by non-native writers. Therefore, developing methodologies for identifying the native language of non-native English authors on social media outlets is an important and pertinent task.

As a first result, we present a computational analysis of cognate effects on the spontaneous linguistic productions of advanced non-native speakers. Introducing a large corpus of highly competent non-native English speakers, and using a set of carefully selected lexical items, we show that the lexical choices of non-natives are affected by cognates in their native language. This effect is so powerful that we are able to reconstruct the phylogenetic language tree of the Indo-European language family solely from the frequencies of specific lexical items in the English of authors with various native languages.

Resources

If you are using the following resources, please cite Rabinovich et al. (2018) and/or Goldin et al. (2018).

The Reddit-L2 corpus (8GB)

Reddit-L2 corpus cleanup code

Reddit-L2 chunks as in Goldin et al. (2018). See the readme file.

Same dataset, where the data of all authors with the same L1 constitute one file. See the readme file.

Publications

If you are using these resources, please cite the following:

Gili Goldin. Native Language Identification with User Generated Content. MSc thesis, Department of Computer Science, University of Haifa. August 2019. PDF.
Ella Rabinovich, Yulia Tsvetkov and Shuly Wintner. Native Language Cognate Effects on Second Language Lexical Choice. Transactions of the Association for Computational Linguistics 6:329-342, 2018, PDF.
Gili Goldin, Ella Rabinovich and Shuly Wintner. Native Language Identification with User Generated Content. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), pages 3591-3601, Brussels, Belgium, November 2018. PDF.

Contact

Ella Rabinovich, Shuly Wintner.