[BDIL] Newsletter #19: Poisoning Classifiers, Optimizing HBase, the School of Data and more...

Remember to send any interesting links my way. You'll get credit and help us diversify the sources and areas of interest. I'm aiming for a crowd-sourced newsletter.

Nuvola, a Cloud Host for the OrientDB Graph DB SNA

Slides (26) http://www.slideshare.net/lvca/orientdb-document-or-graph-select-the-right-model

Nuvola (http://www.nuvolabase.com/site/index.html) is a graph database on the cloud, so you can quickly set it up. The underlying DB is called OrientDB and is an Open Source project (http://www.orientdb.org/index.htm). The slides walk you through the data model. There have been a few graph databases going around now and I haven't heard of a clear winner yet.

Fast Dynamic Time Warping Statistics

Paper (8 pages) http://www.cs.ucr.edu/~eamonn/SIGKDD_trillion.pdf

Thanks to Elias Ladopoulos for the link!

Dynamic Time Warping is a state-of-the-art similarity computing method for time series data, underlying many time series data mining algorithms. Unfortunately it is rather slow for large scale data. This paper suggests four practical methods that when combined, give several orders of magnitude increase in the best time reported for analyzing huge time series datasets. Probably important for anyone undertaking such data mining endeavors at scale.

Optimizing HBase Big Data

Blog: http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know/

Blog: http://www.cloudera.com/blog/2012/07/hbase-log-splitting/

Blog: http://gbif.blogspot.com/2012/07/optimizing-writes-in-hbase.html

HBase seems to have received a bout of attention recently. These 3 posts describe several components of HBase and focus on squeezing out more performance from your cluster by configuring them.

The first blog post is about the HBase Memstore: the write-cache HBase maintains to optimize updating and adding rows so that they remain sorted in the underlying HFile, a key component of HBase's fast random access retrieval.

The second is about preventing data loss due to server crashes using the write log splitting. It is mostly relevant to HBase 0.90 but also discusses the improvements in 0.92.x.

The final blog also involves the Memstore, walking through a case study of debugging a problem in processing Hive queries using Ganglia and understanding the root cause and how to configure HBase to handle it.

Poisoning Attacks against Machine Learning ML

Blog: http://www.i-programmer.info/news/105-artificial-intelligence/4526-poison-attacks-against-machine-learning.html

Paper (7 pp.): http://arxiv.org/abs/1206.6389v1

Being able to feed an online learner data to encourage it to train incorrectly and gradually shift it away from the truth it had learned is something a lot of people in ML conceptually anticipated. The paper goes into great detail about how to optimally implement this concept in the setting of SVMs with the classic zipcode digit recognition dataset. It's interesting in the sense that it reverse engineers the learning algorithm, finding out how to maximize the induced error.

Tuning Extraneous Parameters for ML Algorithms ML

Blog: http://www.johnmyleswhite.com/notebook/2012/07/21/automatic-hyperparameter-tuning-methods/

Many ML algorithms like SVMs, neural networks and so on come with one or more hyper-parameters that have to be tuned for a specific data problem to ensure optimal (read: acceptable) performance. John discusses the current thinking about how to tune these parameters systematically using either grid search, random search or a generalized regression.

Association Rule Mining with R ML

Blog: http://rdatamining.wordpress.com/2012/07/13/examples-and-resources-on-association-rule-mining-with-r/

This blog is sort of what the R community calls a "task view" for association rules mining. It lists several resources, tutorials and R packages for building classifiers and visualizing the output of this highly interpretable classic technique.

Combining Pig with ElasticSearch Big Data

Blog: http://hortonworks.com/blog/search-data-at-scale-in-five-minutes-with-pig-wonderdog-and-elasticsearch/

ElasticSearch is a great Lucene-based, RESTful, JSON-speaking, modern document indexer. Using WonderDog, it can become a Pig Storage Engine, so that you have indexable queries with Pig instead of filtering through all rows to find a few that match an expression. This tutorial quickly walks you through the easy set-up procedure. I got hooked on Elastic Search after a railscast on it - it's certainly an interesting project, even though some heavy users I spoke to say there is headache involved in operations after a while.

School of Data Visualization

Site: http://www.schoolofdata.org

More Site: http://handbook.schoolofdata.org/en/latest/index.html

School of data is an ambitious project which aims to serve as a hub for learning about data science. Having only launched, it contains very little so far save grandiose claims. However, it has the potential to become a major resource, and it could definitely use your help - so contribute something and enjoy having your work become the basis for the education of future data scientists. The book is the only content right now, and it is mostly empty, but I do mention it here because it contains a pretty good comprehensive list of libraries for creating visualizations in various programming languages.

Adios,
Nimrod