Native Language Identification for Online Comments

rawpixel-com-378006.jpg
 
 

Summary

These days, news is often sourced from social media as well as from traditional sources (e.g., print and broadcast media). Given the rising importance of social media as a news source, I wanted to create a system that could give users more context about who was opining on the news by identifying the native language of a user based solely on the content of their posts.

I labelled the native language of reddit comment text, and vectorized it using a normalized (1,4) character tfidf encoder. An XGBoost classifier fit to this data was able to pick out certain idiosyncrasies (grammar and content) of non-native language speakers. The classifier worked quite well, achieving 51% accuracy on a per comment basis and a 90% accuracy on a per user aggregated comment basis.

Future work includes using more labelled data to separate out content from grammar, and adding more Native language classes.

Blog post to come.