Native Language Identification for Online Comments



These days, news is often sourced from social media as well as from traditional sources (e.g., print and broadcast media). Given the rising importance of social media as a news source, I wanted to create a system that could give users more context about who was opining on the news by identifying the native language of a user based solely on the content of their posts.

I labelled the native language of reddit comment text, and vectorized it using a normalized (1,4) character tfidf encoder. An XGBoost classifier fit to this data was able to pick out certain idiosyncrasies (grammar and content) of non-native language speakers. The classifier worked quite well, achieving 51% accuracy on a per comment basis and a 90% accuracy on a per user aggregated comment basis.

Future work includes using more labelled data to separate out content from grammar, and adding more Native language classes.

Blog post to come.