[001.4] Error Handling

A discussion of Sidekiq's retry process, exception monitoring, and idempotence.

Subscribe now

Error Handling [05.13.2016]

In the last episode, we built a simple worker. But what happens when our worker has an error? Let's have a look.

Project

We'll modify the existing worker to fail for the super_hard jobs. We'll also put in a shim to assume that it charges someone's credit card before it fails:

class OurWorker
  include Sidekiq::Worker

  def perform(complexity)
    case complexity
    when "super_hard"
      puts "Charging a credit card..."
      raise "Woops stuff got bad"
      puts "Really took quite a bit of effort"
    when "hard"
      sleep 10
      puts "That was a bit of work"
    else
      sleep 1
      puts "That wasn't a lot of effort"
    end
  end
end

Now we'll run the server and create a new super_hard job:

bundle exec sidekiq -r ./worker.rb
bundle exec irb -r ./worker.rb
OurWorker.perform_async "super_hard"

In our server's log, we can see that the job failed and wrote the failure to the log. The server's still ready to take more jobs though:

OurWorker.perform_async "easy"

We can see that the server's still handling requests.

Now a really important thing to keep in mind is that Sidekiq's going to automatically retry jobs that fail. Remember how ours charged a credit card before failing? If that code isn't taking retries into account, then you're going to end up with some really unhappy customers with a ton of repeated charges on their cards. This is why it's important to write your jobs in such a way that they can be retried an arbitrary number of times without causing any problems. This is know as idempotence.

Adding Exception Monitoring

You're not going to be watching these logs for errors in production though! You'll want to add an exception monitoring service to your application, such as HoneyBadger or BugSnag. That way you'll have email notifications and a record of exceptions as they happen in production.

Exception Monitoring is a critical step, because otherwise you'll think your site is humming along while there are exceptions happening and automated processes you thought were working just aren't. This has bitten me in production, and it's really not something you want to explain to your client or your boss.

Web UI

So at present we have some jobs happening, but we don't have any insight into the Sidekiq system as a whole. Sidekiq ships with a Web UI that you can use to view the state of your various job queues. It's just a Rack app, so it's easy to plug into whatever web framework you might be using. In our case, there's no existing web app, so we'll just set up a quick Rack app to run it standalone.

We'll add rack and sinatra to our Gemfile:

gem 'rack'
gem 'sinatra'

We'll write a quick config.ru rackup file:

require 'sidekiq'

Sidekiq.configure_client do |config|
  config.redis = { db: 1 }
end

require 'sidekiq/web'
run Sidekiq::Web

And that's it. Now we can run the dashboard with rackup from the command line:

rackup

And it's running on port 9292. We'll visit the web ui and look through it.

OK, so we can see that there are some job retries in place. We know they'll fail, because we still have that exception in our code. Let's take out the exception and restart the worker, then we'll just wait for it to retry.

OK, so as we saw, Sidekiq retried the job and it completed successfully. This is your most typical workflow:

  • You have an exception of some sort - either because of a bug in your code or because of an external service failure.
  • The problem either goes away or you deploy new code to fix it.
  • Sidekiq retries the job, and the end user is none the wiser.

What happens if you don't resolve the exception before Sidekiq gives up on retrying? Well, let's see. We'll add back in the exception, and we'll modify this worker so that it doesn't retry ever:

class OurWorker
  include Sidekiq::Worker
  sidekiq_options retry: 0
  # ...

  def perform(complexity)
    case complexity
    when "super_hard"
      puts "charging a credit card"
      raise "Woops stuff got bad"
      puts "Really took quite a bit of effort"
    # ...
    end
  end
end

Now we can restart the workers and run this job again.

If you look in the Web UI now, we have a job in the Dead Job Queue. This is where jobs go to die, when their retries have been exhausted. It's useful if you want to keep track of what's gone horribly wrong in your system without being resolved in time. Once you've resolved the issue plaguing these jobs, you can select them and retry them.

Summary

In today's episode, we saw how Sidekiq handles exceptions and discussed how you need to design your jobs to be idempotent in order to avoid problems when they're retried. We also discussed the importance of Exception Monitoring in production. Finally, we enabled the Web UI and we explored how to configure how frequently jobs are retried, as well as what happens to them when they've retried too many times. I hope you enjoyed it. See you soon!

Resources