[BDIL] Newsletter #4: What's in your wine, 1-2-3 ain't as easy as A-B-C, and more

Welcome to the 2nd Big Data / Machine Learning / NLP Israeli Linkshare. Thanks to everybody who contributed links! If your link isn't here I just hadn't gone through it yet, but I'll feature it next time. Keep'em coming!

Embedding R in a DB for analytics Big Data

Slides: http://bunsen.credativ.com/~jco/2011/plr-PostgresOpen-2011.pdf

Blog: http://www.r-bloggers.com/the-race-for-speed-at-the-data-layer/?

We see both the community and the big vendors trying to fit existing products to the influx of interest in big data. A short summary of some recent advances in that direction is in the blog. PL/R (slides) is an example I've seen a while back, which gives PostgreSQL embedded R capability via PLSQL-like stored procedures. I personally think there are better architectures for the long run, but it has its advantages - not the least of which being the ability to retrofit existing setups with fast analytics.

It ain't as easy as 1-2-3 Big Data Clojure Java

Blog: http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html

Blog: http://bit.ly/IzFnPJ (This is originally http://amatsukawa.posterous.com/heavy-hitter-detection but I failed to open this page anymore so I sent you the google cache version)

Blog: http://blog.getprismatic.com/blog/2012/4/9/how-prismatic-deals-with-data-storage-and-aggregation.html

Paper: http://www.vldb.org/conf/2002/S10P03.pdf

These three blogs discuss essentially the same thing with slightly different perspective: It's surprisingly hard to count unique occurrences in a stream if you have, say, a billion objects you want to count - An accurate naive count could take even 45Gb of RAM. But very accurate estimates can give you the count up to +/- 3% with as little as 512 bytes. The first has code in Java, the second takes the more mathematical/algorithmic standpoint, while the third gives a general overview of an architecture, with code in Clojure, to enable map reduce on such problems. The Editor is reminded of a similar, NLP-related problem (and its solution) for pruning the lowest counts probabilistically from a stream in a scalable way: That's the fourth link (paper).

Google shows us its big data architectures Big Data

Slides: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf

Talk (1:25 hours): http://stanford-online.stanford.edu/courses/ee380/101110-ee380-300.asx

Revisiting our recurrent theme of long tail effects and large scale architectures, these two separate talks by Jeff Dean, chief infrastructure scientist for Google (one is a link from a link I sent last week) cover what Google does (again, fan-out architectures and so on) to deal with issues of handling massive volume while maintaining rapid responses.

Can a snake eat an elephant? Big Data Python
Talk (49 min): http://www.kdnuggets.com/2012/04/python-for-big-data.html
Overview of Hadoop, NumPy, SciPy and how they play together in the python environment. It's mainly a higher-level feature review, with some nice simple code examples he flashes through just to explain the concepts. The Editor, admittedly not being a huge hadoop+python user himself, learned some new stuff from it.

Full bodied with a hint of Naive Bayes NLP

Blog: http://languagelog.ldc.upenn.edu/nll/?p=3887

Apparently it's a common knowledge among oenophiles that wine should not taste like grape. But did you know it's really good for wine to taste like bananas, or Filet Mignon? This analysis of wine reviews attaches sentiment to each term. Don't miss the comments (where I stole the tag line from).

Israeli Association of Grid Technologies Meetup Event Big Data
Thanks to Matan Zinger of 3i-Mind!
Meetup: http://www.meetup.com/IGTCloud/
This grid/cloud computing focus group holds tons of Meetups apparently. The Editor is in exile and therefor unable to attend, but he trusts Matan enough (as should you) to feel unfortunate about that.

World's first Data Science Hackathon Event
Site: http://datasciencehackathon.com/hello-world/
Next Thursday, April 19th, Data Science London is holding a data science hackathon/competition around the world (participation can be by arrival to several worldwide destinations or online via Kaggle.com, all events start simultaneously at 1pm GMT). The Editor would've went to the NY event if he hadn't had enough of hackathons for the time being after attending HackNY, but would love to co-operate with people from the linkshare group if any of your are interested.

Regards,
Nimrod