[BDIL] Newsletter #9: What's wrong with p-values, how to tolerate latency, and more

Thanks for everyone who provided feedback this week! I tried to consider your inputs; There are more blog pieces and less books in this one. Without further ado-

Significant evidence against using p-values Stats

Blog: http://www.johnmyleswhite.com/notebook/2012/05/10/criticism-1-of-nhst-good-tools-for-individual-researchers-are-not-good-tools-for-research-communities/
Blog: http://www.johnmyleswhite.com/notebook/2012/05/12/criticism-2-of-nhst-nhst-conflates-rare-events-with-evidence-against-the-null-hypothesis/
Blog: http://www.johnmyleswhite.com/notebook/2012/05/14/criticism-3-of-nhst-essential-information-is-lost-when-transforming-2d-data-into-a-1d-measure/
John Myles White (of Machine Learning for Hackers fame) has a splendid short 3-post series on his blog detailing the problems with according significance to p-values. The problems of p-values are varied and some are well known, like its overestimation of significance when dependent samples are used, but White discusses some other lesser known effects. The editor is reminded of a parable in which a missile defense system was advertised as having 95% precision, after it hit three times out of five in a test. This claim has been made because if your null hypothesis is that it has 95% precision a one sided test could not show it to be rejected with some p-value. Pearson cringes in his grave but the fallacies of null hypothesis testing do not lie only with the marketers.

Estimating ticket values from noisy data ML

Blog: http://seatgeek.com/blog/dev/the-math-behind-ticket-bargains
Blog: http://seatgeek.com/blog/dev/using-a-kalman-filter-to-predict-ticket-prices
SeatGeek show us how they use maximum likelihood estimation with pairwise comparisons and some smart tricks on their data to fill out a sparse matrix into an estimate of the inherent value of every seat in 4,000+ seat venues. They then explain how Kalman filters can be used in conjunction with the seat quality data to predict seat pricing. The editor found great Bon Iver tickets with SeatGeek, so they must be doing something right.

The future of search NLP

Article (1.5 pp): http://turing.cs.washington.edu/papers/Nature_search_shake-up.pdf
Thanks to Omer Levy!
Oren Etzioni philosophizes on the stagnation of commercial search engines at the keyword and index stage, and urges researchers to fully realize the dream of natural language, general purpose question answering. With the example set by IBM's "Watson" winning Jeopardy, this seems to be closer than ever, but oren discusses some of the challenges such as scale and disambiguation still on the way before we could type in 'What is the answer to life, the universe, and everything?' and expect the answer from Google. Oh, wait, they already do that.

Scalable Machine Learning class ML

Class: http://alex.smola.org/teaching/berkeley2012/index.html
Alex Smola's Berkeley's class on Scalable Machine Learning contains both videos and lecture notes (see the left menu bar). I haven't gone through almost any of it yet, but the lecture notes seem very legible and with that title, well, there was hardly any escape from putting it up here.

Social Network Analysis for Politics SNA

Webcast (60min): http://oreillynet.com/pub/e/2107
Webcast (60min): http://oreillynet.com/pub/e/2207
The webcasts launch in some abominable O'reilly virtual desktop tool which is almost hilarious in its terrible interface (and also, I couldn't get the screen-sharing parts to work). In them, though, Maksim walks through loading tweets on political terms with Python and running some basic analysis on the tweet data. In the second part, we also load publically available campaign finance data, working through some of the parsing steps and then doing some basic bipartite graph analysis and clustering.

Latency toleration techniques in real time services Big Data

Slideshow (83 slides): http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf
Returning to scalability issues discussed in our second and third newsletter, Jeff Dean of Google explains how small random perturbations in the latency of services can cripple the performance of a distributed pipeline. Unfortunately I could not find a video of the talk, but the slides are pretty detailed.

Introduction to Social Network Analysis R SNA

Blog: http://econometricsense.blogspot.com/2012/04/introduction-to-social-network-analysis.html
This short blog covers the basic social graph measures like betweenness and centrality, and includes some R code to plot the graphs and calculate these measures (just calling the function to get eigenvalues, actually). It also has a short list of references for more detailed information.

How many nodes are too many nodes? Big Data

Blog: http://perfdynamics.blogspot.com/2012/04/postgresql-scalability-analysis.html
The blog discusses a concept called "universal scalability law", which is essentially a measure for the relative gain in throughput from having multiple nodes performing computations. Specifically, it postulates a model for the diminishing returns (and after a point, decreasing total throughput) of larger clusters due to locking and data synchronization between nodes. The author discusses an experiment with PostgreSQL. I don't know if I buy into the model but he makes some good arguments about how analysis should be made and how to measure what's your optimal cluster size (which also apparently benefits from Postgres 9.2's fast locking in the case of SQL servers, if you use distributed PostgreSQL).

Regards,
Nimrod