Copy
Big Data Linkshare
Welcome to the 10th Big Data / Machine Learning / NLP Linkshare. Sorry it's slightly late today. I could tell you I'm A/B testing the time of arrival for the e-mail, but honestly I just have a lot on my plate.
SNA stands for Social Network Anaylsis, btw.

Google's Concept Lexicon NLP

Blog: http://googleresearch.blogspot.com/2012/05/from-words-to-concepts-and-back.html
Google utilizes Wikipedia to create a dictionary of 7.5 million concepts ("Soft drink", "Football"), each linked to its wikipedia article and disambiguated with various terms referring to the same concept in different languages. They also give you popularity counts for concepts. The editor is reminded of a very similar project he had undertook a few years ago, to power keyword extraction.

Modeling Politics in the Eurovision ML Recommended!

Blog: http://mewo2.github.com/nerdery/2012/05/20/ive-got-eurosong-fever-ted/
This blog piece walks through modeling the eurovision song contest dataset (which includes which countries voted for which songs) as a pairwise-comparison network, and then separating out the political and song quality components, utilizing Gibbs sampling. He also demonstrates the underlying political social network predicted by the data and anecdotally verified by facts. He even predicted Israel will not pass the semi-finals!

Overlapping Communities in Facebook with Excel SNA

Blog: http://beamtenherrschaft.blogspot.com/2012/02/detecting-overlapping-communities-in.html
(You have to scroll down to this article once you click the link, due to a bug with Blogspot.)
This very short post briefly reviews some overlapping clustering techniques (in which every node is attached several clusters with measure of attachment). Interestingly, the analysis and visualization was done with NodeXL (http://nodexl.codeplex.com/), an open source social network analysis add-on for Microsoft Excel. I expect anyone working with Excel for his data should want to check it out.

Limitations of NoSQL and RDBMS Systems Big Data

Blog: http://dbmsmusings.blogspot.com/2012/05/if-all-these-new-dbms-technologies-are.html
While I disagree with the author's decisive claim that "no application needs 500,000 transactions/second", he does go through a (somewhat dogmatic) deep analysis of what the different assumptions of RDBMS and NoSQL are and what ranges of domains they can handle. He pitches his academic paper as a silver bullet to slowdown in distributed RDBMS resulting from synchronization, which I guess only time will tell the validity of.

Community Detection in R SNA R

Blog: http://sieste.wordpress.com/2012/05/18/a-minimal-network-example-in-r/
Blog: http://sieste.wordpress.com/2012/05/21/inferring-the-community-structure-of-networks/
In this two piece series the author demonstrates the mathematical tools in basic social network analysis, namely detecting clusters of related users using the Laplacian of the adjacency matrix and its eigenvectors. He proposes a nice trick - clustering the eigenvectors in order to determine a cutoff for the number of significant ones. Code is in R, but there's considerable mathematical discussion behind it, so it might be interesting even if you aren't an R relisher.

Visualization with Open Source Tools Visualization

Links: http://compulsivedata.com/visualization-tutorials/
This is a compendium of tutorials on how to do visualizations with R, Protovis, D3.js, Google Refine and other frameworks/tools. The tutorials are also rated (on a scale of 3-5). Most tutorials are blog posts though some are in other formats.

Performance analysis of Hadoop jobs Big Data

Blog: http://cto.vmware.com/analyzing-hadoops-internals-with-analytics/
This is an in-depth review of the possible bottlenecks in a Hadoop job, how to get the data on each stage and how to make sense of it in order to track down and kill inefficiencies. There are some cases where doubling the number of nodes just doesn't cut it, and this article can help you make sense of these cases. 



Shorties

HBase 0.94 out
http://www.cloudera.com/blog/2012/05/apache-hbase-0-94-is-now-released/
Mainly IO performance and locking performance increases, along with many bug fixes and enhancements.

Regards,
Nimrod
Copyright © 2012 Israeli Big Data Linkshare, All rights reserved.
Email Marketing Powered by Mailchimp