You're reading the Ruby/Rails performance newsletter by Speedshop.

You've got 4 CPUs. How many Unicorn workers should you use?

One of the most common decisions that Ruby deployments get wrong is the number of web server processes they deploy per server. It's often too many, leading to extremely bad behavior under heavy load. Or, it's too few: that leads to high bills and low utilization. It's a tough ratio to get right, but an important one.

Let's get some definitions out of the way. I am talking about the number of worker processes per virtual server.

Worker processes are the number of processes spawned by the webserver which can process requests. In a pre-forking design, like Puma or Unicorn, this is actually 1 less than the number of processes actually running. These preforking servers will boot a "master process" first. This process does not actually process requests. The master process uses the fork system call to create worker (or "child") processes which listen on the socket and actually accept requests.

The virtual server is actually a bit difficult to define nowadays. What I'm talking about is one machine, virtual (or quite rarely? physical) which has an allocation of CPU and RAM. So, a container, a dyno, a VPS, anything that you can actually run software on directly.

Heroku is the easiest example. The virtual server unit is the dyno, and we usually control the number of web processes we want per dyno using the WEB_CONCURRENCY environment variable.

It's also worth defining what we mean by CPU. These days, pretty much 100% of x86-architecture CPUs you can buy on a public cloud are hyperthreaded. This means that when these companies (AWS, etc) say they're selling you 1 CPU, really what they're selling you is 1 hyperthread on a CPU, not the entire physical CPU (which has 2 hyperthreads or "logical cores"). For ARM architectures, there is no hyperthreading, so you just get 1 physical CPU.

What's the perfect ratio of worker processes to CPU? Say it with me: it depends (yay!).

Setting this ratio is about the tradeoff between latency and utilization. A very low ratio of workers to CPUs (say, 1 worker per 4 CPUs) will have low latency, but also very low utilization (there's 3 cores sitting around doing nothing). A high ratio (say, 8 workers per 4 CPUs) will have very high utilization (CPU utilization will approach 80-90% under load at least, so you're using all the CPU you paid for) but also high latency (when utilization is that high, CPU tasks queue waiting for an 'empty' CPU/CPU time).

I think the answer I hear most often here is "well, we'll just deploy 1 to 1. 1 CPU, 1 worker process." This seems optimal enough at first glance. With 1 worker process per CPU, you'll experience minimal additional latency (each process gets it own CPU, so the only costs/latency they impose on each other will be cache/memory related).

However, especially for Unicorn users, this configuration can be less than optimal.

Unicorn uses a single-threaded design. So, when a Unicorn process is not actively using CPU, that CPU is idle. Of course, a web request spends lots of time, often 25% or more, not actively using the CPU and instead idling while waiting on I/O calls to your database to return. So, for an application that uses Unicorn and spends on average 25% of it's time waiting on I/O, CPU utilization can never exceed 75%, even under full load.

For this reason, many people using Unicorn often run more than a 1 to 1 ratio. It could be something like 5 to 4, or 3 to 2. It's not an exact science. A good starting point is probably "1/% of time average request is running CPU-bound work". So, if your app spends 25% of its time in I/O (waiting on the DB, etc) and 75% of its time running Ruby, then a ratio of something like 4 to 3 is probably fine.

If you set this ratio too high, however (say, 3 workers to 1 core), you'll end up with extremely high latency under full load. This is sneaky, too, because the problem won't show up until you're actually under full load. But if your Unicorn setup can injest far more load than its CPU can handle, it's a recipe for extremely slow response times.

Puma is another story when running multithreaded. When using a multi-threaded application container like you do in Puma, you're often running 3 or 5 threads per CPU core. In this case, we expect each Puma child process to completely saturate 1 CPU under load, because the additional threads in each Puma process will also run on the same core. So, when using Puma, always run at a 1:1 ratio of CPUs to processes, and instead tune the thread count up and down until you reach 80-100% CPU utilization under full load.

That's the basics of tuning child process counts in Ruby - I hope you learned something! See you next week.

-Nate

You can share this email with this permalink: https://mailchi.mp/railsspeed/how-many-unicornpuma-workers-per-cpu?e=[UNIQID]

Copyright © 2022 Nate Berkopec, All rights reserved.

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.