Observability — A 3-Year Retrospective
There is an old railroad folk tale about the increasing complexity of steam locomotives that may be applicable to debugging today’s complex microservice-based architectures.
Back in the early 19th century, when simple steam locomotives were a new technology, if one wasn’t running properly, the problem could be pinpointed easily in three minutes (“leaky boiler”). Fixing that problem, however, could take three weeks or more. By the time steam locomotives were phased out in the mid-20th century and replaced by more efficient diesel and electric engines, that ratio was reversed entirely. By then, engines were complicated beasts, and so it could take three weeks to diagnosis the problem, which then could very likely be fixed in three minutes.
With microservices, with their many interwoven components, we are now entering that latter-day phase of spending more time diagnosing, rather than fixing, a problem or performance issue. Not helping is our current crop of application performance monitoring (APM) tools for monitoring systems, based as they are on the collection of simple metrics. They are insufficient for getting to the root cause of a problem, argues Honeycomb.io’s Charity Majors, in a contributed post this week on The New Stack about observability. “The kinds of systems we were building were fundamentally different than the systems those tools were developed to understand,” she writes.
The problem with metrics alone is that they get aggregated and lose granularity over time, not allowing the sysadmins to pinpoint the problem at hand. Dashboards are likewise useless for the same reason, Majors argues. The bigger issue is that not all issues can easily be identified when there are so many components that could slow operations. We need a new way of debugging and, in her post, she describes this landscape of what data to capture and how we should work with that data. Check it out!
|