Copy
You're reading the Ruby/Rails performance newsletter by Speedshop.
 
I'm working on a new product that will teach you how to scale Sidekiq to 10,000 jobs per second. It's called Sidekiq in Practice.

I did a video on Twitter about how re-using objects makes things faster.

How do we evaluate whether or not our Sidekiq installation is performing well or not?

This email is an excerpt from my upcoming product: Sidekiq in Practice.

Engineering is the process of designing a system that meets certain requirements, while staying within given limitations and constraints. For most background job systems, the only constraint is budget, because the cloud has made access to hardware effectively infinite.

All performance engineering "requirements" can be put into one of three categories - and we will use these three categories to understand scaling Sidekiq.


A good customer experience


This means that the operation is completed with a low latency. A customer's requirements for latency will differ depending upon the operation being completed. For example, web requests should be completed more or less as soon as possible - less than 100 milliseconds is the standard in that realm. In background jobs, latency requirements can differ greatly.

For example, a customer may expect that their data is updated once per day, meaning that the latency requirement for the job that does the date update may be very long and measured in hours. A password reset email, on the other hand, should be sent within at least a few seconds of receiving the request to do so. That's fast, but it's still an order of magnitude slower than a web request. All background jobs have a latency requirement - if they didn't, there would be no reason to run them! When we run a job, "should complete before the heat death of the universe" is generally not what we're thinking.

We'll dig into this more in a minute, but the latency experienced by a customer when it comes to a background job is `wait time` plus `service time`. That is, the total latency is the time spent waiting in queues to be processed, plus the time spent actually running the job.


A scalable, resource-efficient system


Because Sidekiq scales horizontally, we can generally solve almost any scale problem simply by throwing more resources at it. However, budgets are limited and pockets are not infinitely deep.

We can measure the resource efficiency of a background job processing system by dividing its job throughput (jobs per second) by its resource consumption (maybe in terms of vCPUs in the job fleet, or more helpfully, dollars spent). I prefer to think about this in terms of cost per unit. We might track a metric here like "jobs per dollar": how many jobs we process per dollar of server spending.

Scalability also means a system that automatically responds to changes in load. Having to manually pause or manage queues and scale up or down worker nodes is not a scalable system, because it requires constant on-call human intervention. A scalable background job processor can scale from 0 to 1000 jobs per second without human intervention.
 

A stable system when load rapidly increases


As load increases (jobs are enqueued more rapidly), the length of the queue will increase exponentially. Rapid growth of a queue can effectively cause a "brownout", where the system is not technically down (it's still online and processing jobs), but the queue has become so deep that by the time jobs execute, they may be completely irrelevant!

Of course, there are ways rapid increases in load can simply take down the background job service entirely, too. The most common in Sidekiq-land is to exceed your Redis database's memory limits. What happens next depends on your database's eviction policy. An eviction policy is what the database does when it runs out of memory. The best eviction policy for Redis when working with Sidekiq is usually `noeviction`: when the memory limit is reached, Redis will reject any new data. This is a very loud failure, and will show up immediately on all of your monitoring dashboards: exceptions will be raised by clients, error rates will go through the roof. This is generally better than the alternative: eviction policies like `allkeys-lru` will silently drop data. Most Sidekiq installations cannot afford to randomly lose data, so a loud failure is better than a silent one. In recent versions of Sidekiq, using any eviction policy other than `noeviction` logs a warning on startup.

Hope you enjoyed that excerpt - until next week,

-Nate
You can share this email with this permalink: https://mailchi.mp/railsspeed/three-performance-requirements?e=[UNIQID]

Copyright © 2021 Nate Berkopec, All rights reserved.


Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.