Copy
You're reading the Ruby/Rails performance newsletter by Speedshop.

Should you focus on p95? p99? p75? Why are there so many p's?

The percentile of any data set is a number below which are a certain percentage of the members of that dataset. That is, for the "xth percentile", x percent of the data is less than this number.

We are all quite familiar with one common percentile: the median. The median is the same as the 50th percentile. If the median is 100, 50% of the data is below 100 (and, by corollary, 50% of the dataset's members are above that number).

One reason we work with percentiles is that they are relatively unaffected by outliers, unlike averages.

Consider a dataset of [1,2,3,4,5,6]. The median of this dataset is 3.5. Half of our data is less than this number, half of it is more. The average is also 3.5. Now, what if that dataset was [1,2,3,4,5,600,000]? Our average would be 100,002.5. Is that average representative of the dataset? The median is still 3.5.

Which percentile is the best percentile to look at when analyzing a given distribution?

I don't pay too much attention to median/p50. I don't track it too closely because 50% of the distribution is higher than that median. Half of the data could be really, really bad, but we would never know it! I only track p50 when I'm specifically optimizing the "best case scenario" or the "happy/fast path". 

If you're trying to decide if p75/p90/p95/p99 is the right metric to track, look at traces from each threshold and ask "what failed here?/why is this trace at this threshold, and not at another?" if the answer is interesting, then its a good threshold.

Let's consider a dataset of page load times gathered by real-user monitoring. This data is gathered by client web browsers, who send back to our performance monitoring service what their load time was. What percentiles in this dataset might be interesting?

It turns out, p95 and above is actually very uninteresting data in this dataset. p95 client load times are often the result of bad wi-fi or other client network issues. We can't really do too much about that. In my experience, I find that p75 real-user-monitoring data is much more interesting. These experiences are usually caused by a bad/unhappy path on a backend response, for example.

I find that most developers often over-emphasize the 95 percentile/p95 when looking at latency data. I think this tendency comes from the DevOps/SRE world, where they often focus on p95 or even p99. This is because SREs generally work with quite discrete data ("up" or "down"), unlike performance engineers that work with highly continuous data (latency from 0 seconds to infinity). When looking at discrete distributions that describe the availability of a service, it makes sense to focus on 99th percentiles or above, because the rest of the dataset is usually not very interesting ("the service was... still up.").

It's not "more ambitious" to focus on higher percentiles like p95 or "lazy" to focus on lower ones like p75. It's a judgement as to what portion of the dataset is under your control, and what is not. When we have datasets which are generated by client browsers, we often control much less of that experience than we would like. However, on the backend/serverside, we control that experience to a much greater extent, and p95s and even p99s might be excellent metrics to watch whereas p75 might not be all that interesting.

Another solution is to take large, general metrics and slice out sub-audiences which have specific experiences. For example, your client latency data might be skewed because you have 10% of your audience half-way across the world in Indonesia. Keeping the Indonesian and US latency data in the same dataset will just obscure this distribution's underlying bimodality - keep them separate and analyze them separately, and you'll get more useful insight.

Data is only useful when it leads to action or helps us make decisions. One percentile does not fit all. Pick percentiles which lead to useful action.

Until next time,

Nate



 
You can share this email with this permalink: https://mailchi.mp/railsspeed/which-percentile-is-the-best-percentile?e=[UNIQID]

Copyright © 2022 Nate Berkopec, All rights reserved.


Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.