Hey there, you just received the monthly depends-on-the-definition newsletter for January. I trust that you all had a good start this New Year and hope that it will continue to unfold successfully.
This is the second post of my series about understanding text datasets. If you read my blog regularly, you probably noticed quite some posts about named entity recognition.
In this posts, we focused on finding named entities and explored different technics to do this. This time we use the named entities to get some information about our data set.
dirty_cat helps with machine-learning on non-curated categories. It provides encoders that are robust to morphological variants, such as typos, in the category strings.
Recommended reading
The transformer architecture gains increasing popularity in natural language processing. For example the model architecture is leveraged by the Bert model by Google. At the core of this model there is the multi-head self attention mechanism. In this post Keita Kurita provides a nice explaination of the Transformer: http://mlexplained.com/2017/12/29/attention-is-all-you-need-explained/