Keywords: nlp, linguistics, variation, language, text, speech
All language data, whether text, speech or sign, reflects the social identity of the user and the environment they were in when they produced that language. This systematic social variation in language has been studied in linguistics for decades, but is increasingly important as we build and deploy tools that rely on automatic analysis. Failure to account for sociolinguistic variation can reduce overall system performance or, more worryingly, result in systems that are systematically biased against certain classes of users.
The new field of computational sociolinguistics both extends traditional sociolinguistic research using computational methods and provides methods for evaluating how NLP tools handle sociolinguistic variation. The latter is especially urgent as NLP pipelines are incorporated into high-stakes decision making, like hiring and law enforcement.
This talk will briefly survey the current work in computational sociolinguistics and cover the basic concepts of computational sociolinguistics. I will discuss 1) which social factors affect language production, 2) which factors in language can vary and 3) provide a practical how-to on evaluating the robustness of NLP systems given sociolinguistic variation.