Measuring Latency (2015)

rdtsc

10 years old and still relevant. Gil created a wrk fork https://github.com/giltene/wrk2 to handle coordinated omission better. I used using his fork for many years. But I think he stopped updating it after a while.

Good load testing tools will have modes to send in data at a fixed rate regardless of other requests to handle coordinated omission. k6 for instance defined these modes are "open" and "closed": https://grafana.com/docs/k6/latest/using-k6/scenarios/concep.... They mention the term "coordinated omission" on the page however I feel like they could have given a nod to Gil for the inventing term.

Fripplebubby

> This is partly a tooling problem. Many of the tools we use do not do a good job of capturing and representing this data. For example, the majority of latency graphs produced by Grafana, such as the one below, are basically worthless. We like to look at pretty charts, and by plotting what’s convenient we get a nice colorful graph which is quite readable. Only looking at the 95th percentile is what you do when you want to hide all the bad stuff. As Gil describes, it’s a “marketing system.” Whether it’s the CTO, potential customers, or engineers—someone’s getting duped. Furthermore, averaging percentiles is mathematically absurd. To conserve space, we often keep the summaries and throw away the data, but the “average of the 95th percentile” is a meaningless statement. You cannot average percentiles, yet note the labels in most of your Grafana charts. Unfortunately, it only gets worse from here.

I think this is getting a bit carried away. I don't have any argument against the observation that that average of a p95 is not something that mathematically makes sense, but if you actually understand what it is, it is absolutely still meaningful. With time series data, there is always some time denominator, so it really means (say) "the p95 per minute averaged over the last hour", which is or can be meaningful (and useful at a glance).

Also, the claim that "[o]nly looking at the 95th percentile is what you do when you want to hide all the bad stuff" is very context dependent. As long as you understand what it actually means, I don't see the harm in it. The author makes this point that, because a load of a single webpage will result in 40 requests or so, you are much more likely to hit a p99 and so you should really care about p99 and up - more power to you, if that's the contextually appropriate, then that is absolutely right, but that really only applies to a webserver serving webpage assets which is only one kind of software that you might be writing. I think it is definitely important to know, for one given "eyeball" waiting on your service to respond, what the actual flow is - whether it's just one request, or multiple concurrent requests, or some kind of dependency graph of calls to your service all needed in sequence - but I don't really think that challenges the commonsense notion of latency, does it?

tomhow

One previous discussion at time of publication:

A summary of how not to measure latency - https://news.ycombinator.com/item?id=10732469 - Dec 2015 (3 comments)

HN

Measuring Latency (2015)

Measuring Latency (2015)