One Reason Distributed Debugging Is so Difficult
We’ve heard time and time again that debugging distributed systems is extremely difficult. Last week we got to experience the pain firsthand.
The problem showed up right in the heart of our production — by the editors who move our awesome posts through the WordPress system each day. Trying to open a new post, we instead were greeted with an Error 524 message. Or, sometimes the posts would load OK, but then not save, losing all of our changes. It was very frustrating and nearly brought our latest news of cloud native computing to a standstill.
Fortunately, our crack IT team swiftly diagnosed and fixed the problem (#BlessUp for a good IT team). The problem turned out not to have anything to do with WordPress itself! It actually came from the RSS feed for the site, one that picks up outside posts from our sponsors. Unlike the posts themselves, this RSS feed is not cached on Cloudflare. So our own lowly Apache Web server kept getting stuck waiting on the feed URLs, letting other requests languish in the queue. The IT team tried RSS feed URL caching on NGINX, but Apache started timing out even before the cache could be filled.
Soon, however, the issue was tracked down to a buggy RSS WordPress plugin.
“This plugin was supposed to only load the latest sponsor story, but it turned out to be interfering with the site's feed generation,” TNS Director of DevOps Vinay Shastry noted. The RSS feed fetching was quickly re-implemented without the WordPress plugin.
Diagnosing the problem proved to be tricky because the behavior was not reproducible in the dev environment, or while testing post-deployment. Perhaps an entry from the external sponsor feed had triggered the bug in the third-party plugin, Shastry surmised. We guess that perhaps it was some extended UTF character, or maybe an emoticon in a headline. Those always trip us up, somehow.
Anyway, all is well again, and our stream of awesome posts on cloud native computing continues unabated. But let’s stop and appreciate how data from a tangential third-party system can entirely trip a core production management system. In distributed computing, what causes a problem can be in a different place entirely than where the problem itself lies. And that is one of many reasons it can be hard to debug a distributed system.
|