NEWSLETTER depends-on-the-definition

Hey there, you just received the monthly depends-on-the-definition newsletter for February 2020.

Post of the month: Data validation for NLP machine learning applications

An important part of machine learning applications, is making sure that the data quality is not degenerating while a model is in production. Sometimes downstream data processing changes and machine learning models are very prone to silent failure due to this. So data validation is a crucial step of every production machine learning pipeline.

The case is relatively easy in the case of well-specified tabular data. But in the case of NLP it’s much harder to write down assumptions about the data and enforce them. I’ll show you some approaches to validate text data in machine learning use-cases.

Paper pick

SensEmBERT is a knowledge-based approach that brings together the expressive power of language modelling and the vast amount of knowledge contained in a semantic network to produce high-quality latent semantic representations of word meanings in multiple languages.

Tips & Tricks

Label Studio: a open-source data labeling, annotation and exploration tool

https://labelstud.io/

Recommended reading

In the article "Artificial intelligence: Does another huge language model prove anything?", the author Ben Dickson, is discussing the latest large language model like the Google Chatbot Meena. He goes into what's the point of having larger models and if this is really progress. I enjoied the article a lot.

Newsletter

Post of the month: Data validation for NLP machine learning applications

Paper pick

Tips & Tricks

Recommended reading