Copy
Data Science Linkshare
It's our 13th issue - I'm sorry for being late (I hope no one waited with bated breath). But the good news is now you only get to wait 6 days until the next issue!


Sifting through the Twitter Firehose Big Data

Datasift goes into fantastic details about how they handle filtering and processing the entire twitter firehose. It's too long to summarize and full of gems including why the rewrote parts of the HTTP stack in C++, where to use Node.js and how many cores does it require to filter 120,000 tweets per second. Plus, they slip in some startup advice - like what to take instead of money from your earliest customers.
 

Basketball deconstructed Visualization

I don't know what the lesson from this is (except, well, plot densities and not points: http://www.chrisstucchio.com/blog/2012/dont_use_scatterplots.html), but it's just such an awesome culmination of the times' great foray into the data journalism and visualizations that it really shouldn't be missed. 
 

CalTech's "Learning From Data" ML

I've linked to this in our very first issue, but now that the semester is over it's a good time to review the materials. All lectures and slides are downloadable so you can watch them on the train/bus/Google self-driving car. Yaser goes into depth about some state of the art machine learning techniques like RBFN's, Kernel methods, and a lot more.
 

Introduction to Data Mining with R ML Visualization R

The webinar is worth holding through the first 12 minutes of general introduction (you get to skip them, the joys of a recording...). After he leaves the slides, Joe goes through several examples of exploring and visualizing data, creating decision trees and evaluating prediction performance with R, and provides references to valuable resources and books to learn more from.
 

Don't trust the Bimodal hotels! ML

This paper deals with the notoriously hard problem of detecting fake reviews in an online setting in a novel way: An SVM classifier is built using features that were determined by observing the distributions of scores over several years in hotel reviews on TripAdvisor and product reviews on Amazon. This detects the existence of fake reviews rather than pinpoints the fakes themselves: The latter is a task even humans barely perform better than random at. Their interesting technique is observing the distribution of rank statistics of the review scores and how it shifts between honest and dishonest reviews.
 

Low-dimensional embeddings with Python ML Python

This is a great blog in general - it covers advanced topics in a very concise, clear manner and includes code examples. This post is on using the Isomap algorithm to embed feature vectors into 2-D while maintaining their maximal separation in order to cluster (and be able to visualize the clusters).
 

Distributed message-passing with Giraph Big Data

Jakob Homan, who you might recall as an excellent speaker from LinkedIn introducing the Hadoop Ecosystem in a video we covered here on issue #8 (http://us4.campaign-archive1.com/?u=c21fa41588ae9b5397233b6fa&id=66608a7670), made this short Prezi (how does one measure the length of a Prezi, anyway?) It discusses Apache Giraph, its motivation, implementation and some code examples and future plans. Unfortunately, while Jakob said the talk was recorded, it hadn't been uploaded yet; I'll update if I hear that it had been, because it looks to be a very interesting model for dealing with some of the hardest things to MapReduce efficiently.
 

Classification cheat sheet ML Great old one

My oldie-but-goodie for this week is this one-page summary of 7 different supervised learning and clustering techniques including their common smoothing terms, capability for online learning and other properties. Something to hang on the bathroom wall (but apparently it's not appropriate for the office restrooms. Who knew?)


Jobs

Movenbank

Senior Rails Engineer               New York, NY
Details: http://www.jobscore.com/jobs/movenbank/senior-rails-engineer/aVxBE-MYqr4zwzeJe4bk1X
Movenbank are revolutionizing banking using Big Data analytics and a great mobile experience. If you're looking to start something new, this is a great team to join!


Regards,
Nimrod
Copyright © 2012 Israeli Big Data Linkshare, All rights reserved.
Email Marketing Powered by Mailchimp