Computational Linguistics Group
Department of Computer Science
University of Haifa
Among many factors that mold the makeup of a text, gender and other authorial traits play a major role in our perception of the content we face. Many studies have shown that these traits can be identified by means of automatic classification methods. We investigate a related but different question: we are interested to understand what happens to personality and demographic textual markers during the translation process. It is generally agreed that a good translation goes beyond the transformation of the original content, by preserving more subtle and implicit characteristics inferred by author's personality, as well as era, geography, and various cultural and sociological aspects. In this work we explore whether translations preserve the stylistic characteristic of the author and, furthermore, whether the prominent signals of the source are retained in the target language.
As a first step, we focus on gender as a demographic trait. We evaluate the accuracy of automatic gender classification on original texts, on their manual translations and on their automatic translations generated through statistical machine translation (SMT). We show that while gender has a strong signal in originals, this signal is obfuscated in human and machine translation. Surprisingly, determining gender over manual translation is even harder than over SMT; this may be an artifact of the translation process itself or the human translators involved in it.
The Europarl bilingual English-French and English-German corpora annotated with speaker personal details: gender and age (380MB).
firstname.lastname@example.org, modified Wednesday December 21, 2016.