Keywords: Natural Language Processing, linguistic diversity, language technology
Wherever there is human activity involving language which is digitized--either as AV recordings or as text--there is a potential application for natural language processing (NLP). The field of NLP has developed a wide variety of techniques for extracting both linguistic structure and information including world knowledge and speaker intent from such digitized natural language data. This work largely focuses disproportionately on English, but the potential for applications is similar across languages.
Problematically, work in NLP frequently doesn't identify the language of study, with researchers writing as if they are working on language in general. But languages differ in their structure in ways that affect the performance of NLP systems, even those that appear to have been created without any specific linguistic knowledge. By ignoring that structure, the field risks creating solutions that are insufficiently general and thus failing to develop software that works for all language. This failure widens the digital divide, disadvantages speakers of less powerful languages, and increases the likelihood of language loss as the lack of NLP technology contributes to the lack of prestige for such languages.
In this talk I will illustrate how linguistic structure influences the functioning of NLP systems and highlight work that goes against the trend, creating NLP resources for diverse languages and promoting the development of crosslinguistically valid NLP.