You're reading the Ruby/Rails performance newsletter by Speedshop.

Queue latencies in background job processing are critical metrics. Here's how to easily track it with Sidekiq:

In any queueing system, the most important metric is usually how long the average unit of work spends waiting in the queue to be serviced. Time spent in the queue is wasted time; just waiting for processing. Nobody likes waiting in line.

Time spent in the queue is also the dependent variable we use to monitor how well we've scaled a system. As we add or remove resources (Sidekiq processes, in this case, but in other cases, Puma or Unicorn processes), the average amount of time spent in a queue will increase or decrease, accordingly.

For background jobs, we generally accept at least some amount of time spent in the queue, although the amount of time depends greatly on the job. Password reset emails, for example, are almost always sent via background job, but we generally want those sent out ASAP, since a user is waiting on that email (probably furiously refreshing their inbox). However, other tasks, such as syncing data with an external provider, need to happen once per hour, but we don't really care if it happens within 1 minute of enqueueing that job or 30 minutes later.

Scout provides average queue latency on a per-job-class basis by default. This is useful, but somewhat misleading, because queue latencies depend on the queue, not on the job class. New Relic and Datadog, the two other most popular performance monitoring tools in the Rails community, do not provide queue latency information at all. However, it's actually quite easy to collect for Sidekiq, even without generating additional Redis calls.

Sidekiq has a concept of "middleware", which is a simple feature for wrapping methods around the execution of a Job, exactly like Rack middleware. Your middleware is provided with the job class, the queue, and the "job payload", which includes the arguments provided and some metadata. Critically, the job payload includes the time the job was enqueued.

We can construct a simple server middleware that reports back to New Relic the difference between the time the job was enqueued and the current time - that's the queue latency for this particular job execution:

class NewRelicQueueLatencyLogger
def call(worker, job, queue)
    queue_latency = Time.now.utc - Time.at(job["enqueued_at"], in: "+00:00")
    ::NewRelic::Agent.add_custom_attributes({ queue_latency: queue_latency, queue: queue })
    yield
end
end

Sidekiq.configure_server do |config|
config.server_middleware do |chain|
    chain.add NewRelicQueueLatencyLogger
end
end

I use New Relic here, but it would trivial to adapt this to a different metrics collector, such as Datadog. The reported information can be used to determine queue latencies over time:

This particular approach does have one drawback: it reports queue latency based on the perspective of a currently executing job. However, imagine a simple case where we have two jobs enqueued. One is very fast, and it's first in line. The second job takes 5 minutes. The first job will report its queue latency as zero, and then the second job will execute and probably also report a very low number. However, at that point in time, with that second job executing, queue latency for any job enqueued at that moment would be at least 5 minutes. Also, in the case of New Relic, this queue latency is an attribute on a "transaction", and the data is not actually reported to New Relic until the job is finished executing. So there's a number of factors combining here to make this data accurate but not nearly "real-time". It's suitable for post-hoc analysis, but not for real-time autoscaling, for example.

However, it's main advantage is that it generates no additional Redis load (or even network calls). We're just using information we already had, so this approach is very scalable.

Track your queue latencies - it's important to know if your email queue is 10 seconds or 10 minutes deep!

-Nate

You can share this email with this permalink: https://mailchi.mp/railsspeed/quick-and-dirty-tracking-of-sidekiq-queue-latency-in-new-relic?e=[UNIQID]

Copyright © 2021 Nate Berkopec, All rights reserved.

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.