[BDIL] Newsletter #22: Worst 8 Predictive Modeling Techniques, Processing Graphs with Akka, and more!

London was fabulous. You can tell Data Science is picking up: The 100% Design festival in London featured a massive visualization installation at the top of the Victoria & Albert Museum Copula, called PRISM (http://www.treehugger.com/urban-design/london-design-festivals-spectacular-prism-eco-view-london-using-open-data.html). In it, 7 projectors, image-mapped to some 30 facets of a 12-foot tall crystal-like structure displayed visualizations made by various programmers of data coming from feeds all around London, like the power use of 10 Downing st. or the number of bikes rented out of the public bicycle stations at the moment. It was truely impressive.

Weka as a Service ML

Site: http://www.sceowebapp.com/overview.php

Sceo is Weka-on-the-cloud, which I guess is nice if you run some heavy computation with Weka and want to keep your computer free for other stuff.

Optimizing Multiple Concurrent HBase Scans Big Data

Blog: http://dev.datasift.com/blog/optimizing-hadoop-jobs

Nice writeup about how DataSift optimized their 2Tb/day HBase cluster using sharding and smart job scheduling to avoid re-scanning the same area for several queries.

Worst Predictive Modeling Techniques ML

Blog: http://www.analyticbridge.com/profiles/blogs/the-8-worst-predictive-modeling-techniques

Very nice short summary of the disadvantages or limiting assumptions of many common ML techniques, and the more advanced variations that provide workarounds to some of these problems.

The Statistics Cheat Sheet Statistics

Book (27 pp.): http://matthias.vallentin.net/probability-and-statistics-cookbook/

A concise cheat sheet covering many aspects of statistics and probability from the very basic definitions of probability spaces to properties of estimators, confidence intervals, time series methods and a really cool last page with the relations of all named distributions to each other.

Crunch - a Java MapReduce Framework Big Data

Site: https://github.com/cloudera/crunch

Like mrjob, cascading, and other such equivalents in other languages, Crunch is a Java framework for expressing MR tasks more concisely.

No Learning without Representation ML Recommended!

Paper (9 pp): http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

A nice paper providing a sweeping introduction the generic conceptual framework of machine learning presented mathematically, from the learning theory viewpoint. Primarily concentrating on classifiers, it discusses bias and variance, the curse of dimensionality, ensemble models, etc.

How to Disambiguate Homographs? ML

Quora Discussion: http://www.quora.com/Google/How-does-Google-match-two-terms-that-are-similarly-spelled

An interesting overview of a bayesian system for spelling correction using term co-occurence. This is a sort of summary for the paper by Google on this subject.

Distributed In-Memory Graph Processing with Akka Big Data SNA

Blog: http://letitcrash.com/post/30257014291/distributed-in-memory-graph-processing-with-akka

Manning just released their 'Akka in Action' book for preview (http://www.manning.com/roestenburg/). Akka is a Scala framework and paradigm for efficient, distributed computation based on efficient message passing between lightweight 'Actor' objects. The blog gives an example of using Akka for calculating the degree of all nodes in a graph efficiently.

Benchmarking ML algorithms ML Oldie

Site: http://mlcomp.org

Blog: http://blog.explainmydata.com/2012/06/ntrain-24853-ntest-25147-ncorrupt.html

MLComp is not getting enough attention, I think, maybe because everyone's comparing algorithms on Kaggle nowadays. Still, I think it had gathered a pretty decent database for evaluating algorithms on standard sets. The blog, somewhat unrelated, summarizes one very artificial scenario (classifying between 10 normally distributed classes in 500 dimensions), with various algorithms, including the speed of training.

The Next Major Google Architecture? Big Data

Blog: http://highscalability.com/blog/2012/9/24/google-spanners-most-surprising-revelation-nosql-is-out-and.html
Paper: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/spanner-osdi2012.pdf
Spanner is a consistent, SQL-like distributed store designed by Google's famous scalability engineers. The blog muses about whether this paper might trigger the same interest and wide adoption by the open source community that the MapReduce and BigTable papers did; However this is impeded by the fact Spanner relies on unique clock hardware.

Regards,
Nimrod