readme The Text_chunks directory contains two directories:europe_data and non_europe_data, inside each directory there are directories for each country, inside each country dir there are directories for each user, which contain chunks of 100 tokenized sentences written by that user. The Other main directories have the same structure but the chunks contain the relevant feature pre-proccessing. pos_chunks contain POS 3-grams char_ngrams_chunks contain char 3-grams spell_checker_chunks contain the original word, and if it was marked as a mistake also the edit distance, the character insertions, deletions and replacements non_tokenized_chunks contain the detokenized text chunks The order of the chunks is the same for each category - I used 59 users randomly selected from each language. - For each user I used at most the median number of chunks randomly selected. The median was calculated over all the chunks in the europe data for the in-domain task and for the training data of the out-of-domain task. for the test data of the out-of-domain task it was calculated over all chunks in the non_europe_data. The median was 3 chunks pers user for the europe_data and 17 for the non_europe_data. The countries used and their labels: UK English US English NewZealand English Australia English Ireland English Austria German Germany German Albania Albania Bosnia Bosnia Bulgaria Bulgaria Croatia Croatia Czech Czech Denmark Denmark Estonia Estonia Finland Finland France France Greece Greece Hungary Hungary Iceland Iceland Italy Italy Latvia Latvia Lithuania Lithuania Netherlands Netherlands Norway Norway Poland Poland Portugal Portugal Romania Romania Russia Russia Serbia Serbia Slovakia Slovakia Slovenia Slovenia Spain Spanish Mexico Spanish Sweden Sweden Turkey Turkey Ukraine Ukraine