[003.4] Sidekiq Pro: Reliability and Pausing Queues

Exploring what can go wrong with Sidekiq when workers die unceremoniously and how Sidekiq Pro resolves the issue. Also, examining the ability to pause queues.

Subscribe now

Sidekiq Pro: Reliability and Pausing Queues [05.13.2016]

Sidekiq offers a commercially-supported version known as Sidekiq Pro. Sidekiq Pro offers a few features that aren't provided in the open source offering. The first features we'll be looking at are Reliability and the ability to pause queues. Let's get started.

Project

Reliability

So let's talk about reliability. Sidekiq uses BRPOP to pop a job off of the queue in Redis. This is a really nice primitive to use when building a job queue, but the downside is that when it pops the item off of the queue in Redis...the item is off the queue. If Sidekiq were to crash while processing this item, it's gone forever. Let's see this in action.

I've tagged the example repo with before_episode_003.4 if you're following along. I added a 20 second sleep to the VisitorMailer to demonstrate the problem. If sidekiq fetches this job but crashes before it's completed working it, the job will be forever lost. We'll demonstrate this:

rails c
require 'sidekiq/api'
Sidekiq::Queue.new.size
# We can see there's nothing in the queue
# We'll add our mailer job:
VisitorMailer.contact_email("Josh", "josh@dailydrip.com", "Some message").deliver_later
Sidekiq::Queue.new.size
# We see it was added

Now we aren't yet running sidekiq, so it will stay in that queue. We'll run sidekiq in a new terminal:

sidekiq

We can see it fetched the job. Now we'll kill -9 sidekiq:

# We can see sidekiq's pid from its log
kill -9 <PID>

The job never completed, and it wasn't pushed back into the queue because the worker process was ruthlessly killed. We can see that the queue is empty now:

Sidekiq::Queue.new.size

OK, so that mail's never going to be sent. Sidekiq takes quite a few pains to ensure that it pushes unfinished jobs back onto the queue, but the only way to guarantee reliability is for jobs to not be removed from the queue until completed.

Sometimes, the default reliability is just fine. For businesses that need guaranteed reliability, it's a dealbreaker. For those use cases, Sidekiq Pro offers improved reliability in a couple of different ways. The way that it worked from the beginning was by using the RPOPLPUSH command in Redis. This behaviour can be achieved by using the reliable_fetch algorithm. This is the ideal algorithm if you're running your own servers, but should be avoided for containers or Platforms-as-a-Service.

We're deploying to Heroku, so we'll end up using the timed_fetch algorithm instead. It's less scalable and has an additional requirement that all the jobs it runs have to complete within a globally configured timeout, but it's easier to set up and can be used with containers and Platforms-as-a-Service. Life's about tradeoffs, so it's worth reading up on the two different algorithms and understanding the tradeoffs at play. We'll move on with timed_fetch.

First off, I've got my project pulling in Sidekiq Pro now. Once you sign up you'll get a custom URL to fetch the gem from. Then you just put it in your Gemfile like so:

vim Gemfile
# gem 'sidekiq'
source ENV["SIDEKIQ_SOURCE_URL"] do
  gem 'sidekiq-pro'
end

I have, of course, hidden this URL since it is specific to each customer, but wanted you to see how easy it was to use. Anyway, on to the project.

bundle

After you've used the gem, to enable Sidekiq Pro's Reliability features you can set up your initializers easily once you understand the options:

vim config/initializers/sidekiq.rb

Reliable push will help guarantee that jobs the client pushes make it to Redis in the event of network failures.

# We don't enable reliable push in test mode to avoid swallowing errors
Sidekiq::Client.reliable_push! unless Rails.env.test?

Sidekiq.configure_server do |config|
  # uncomment ONE of these fetch algorithms:
  # uncomment this if you are on your own servers
  #config.reliable_fetch!
  # uncomment this if you are using containers (Docker, AWS ECS) or on a PaaS like Heroku
  config.timed_fetch! 30
  # One of the tradeoffs with timed fetch is that you must ensure that your jobs complete
  # within a given timeout.  It's 3600 seconds by default, but to make this a
  # bit easier to demonstrate I've set it to 30 seconds.

  config.reliable_scheduler!
end

Alright, that's a lot to take in but at the end of the day it's just a few lines of configuration to get additional reliability. Let's run the same test again:

rails c
require 'sidekiq/api'
Sidekiq::Queue.new.size
# We can see there's nothing in the queue
# We'll add our mailer job:
VisitorMailer.contact_email("Josh", "josh@dailydrip.com", "Some message").deliver_later
Sidekiq::Queue.new.size
# We see it was added

We'll start sidekiq again.

sidekiq

We can see it fetched the job. Now we'll kill -9 sidekiq:

# We can see sidekiq's pid from its log
kill -9 <PID>

So it crashed the process before the job was completed. Let's see if it's still in the queue:

Sidekiq::Queue.new.size
# Well, it's not in the default queue.  Let's fire up sidekiq again to see what
# happens
sidekiq

We can see that TimedFetch pushed a new job in when we restarted sidekiq, and then it ran the job. So even though sidekiq's process crashed, it successfully pushed the job back into the queue when it was restarted.

Pausing Queues

You can also pause queues in Sidekiq Pro. There are various reasons you might want to do this, but a good example would be that you know there's a bug in some code and you want to pause a given queue until the bug is fixed to reduce the number of jobs that are run with the problematic code.

To pause a queue, you can use the API:

require 'sidekiq/pro/api'
q = Sidekiq::Queue.new("default")
q.pause!
# => true
q.paused?
# => true
q.unpause!
# => true
q.paused?
# => false

Summary

So that wraps it up. We covered the first couple of features in Sidekiq Pro: Reliability and Pausing Queues. See you soon!

Resources