Keywords: natural language processing, spark, attention, recurrent networks
This presentation provides an intuitive approach to learning distributed representations of phrases from natural language datasets. Distributed representations of language are a very natural way to encode relationships between words and phrases. Such representations map discrete representations to continuous vectors, and frequently encode useful semantics of the linguistic units of the underlying language corpus, making them ubiquitous in NLP tasks. Building upon the tidy data principles formalized and efficiently crafted in Wickham (2014), Silge and Robinson (2016) have provided the foundations for modeling and crafting natural language models with the tidytext package. In this talk, we'll describe how we can build scalable pipelines within this framework to prototype text mining and neural embedding models in R, and then deploy them on Spark clusters using the sparklyr and the Microsoft ML Server packages.
Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” Journal of Open Source Software 1 (3). doi:10.21105/joss.00037 Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23. doi:10.18637/jss.v059.i10