Exponential Backoff

March 13, 2015

Summary: A common failure in distributed systems is a server with a rate limit or with no limit but begins failing due to load. A standard solution is to retry after waiting a small time, increasing that time after each failure. We create a macro to handle this waiting and retrying.

A few days ago I wrote about a high-level way of handing intermittent errors, particularly in a distributed system. The way was simplistic: when you get an error, try again, up to a few errors. A slightly more nuanced approach is to back off before you try again. Each time there's an error, you wait longer, until some maximum time is reached.

The problem

Let's say you're hitting a service with a rate limit. That rate limit could be enforced or implicit1. You've got lots of computers hitting it, and it's impossible to coordinate. No matter how hard you try to keep under that rate limit (and you should try), you will eventually break the limit. Retrying immediately when the server is too busy will actually make the problem worse. You will give it yet another request to deny. At the same time, it might be hard to distinguish "I'm too busy right now" from "I'm never going to recover".

The solution

I don't know what it's really called. I call it Exponential Backoff. It's also easy to turn into a separate routine:

(defn exponential-backoff [time rate max f]
  (if (>= time max) ;; we're over budget, just call f
    (f)
    (try
      (f)
      (catch Throwable t
        (Thread/sleep time)
        (exponential-backoff f (* time rate) rate max)))))

This one has the same structure as try-n-times but will sleep before recursing. When it recurses, the time is multiplied by the rate. And when the last wait is more than the max, it will try one more time. Failures from that last try will propagate.

How to use it

Same as with try-n-times:

(exponential-backoff 1000 2 10000
  #(http/get "http://rate-limited.com/resource"
             {:socket-timeout 1000
              :conn-timeout   1000}))

This will retry after waiting 1 second (1000 ms) the first time, then double it (the 2) each time. When it waits 10 seconds, it won't retry any more.

Slightly more useful

Ok, so I don't use this exactly. What I use is slightly more complicated. I've found that I often can tell if it's a rate limiting problem if I look at the exception. So, let's pass it a predicate to check.

(defn exponential-backoff [time rate max p? f]
  (if (>= time max) ;; we're over budget, just call f
    (f)
    (try
      (f)
      (catch Throwable t
        (if (p? t)
          (do
            (Thread/sleep time)
            (exponential-backoff f (* time rate) rate max))
          (throw t))))))

This one only recurses if the predicate returns true on the exception. Let's service mentions "queue capacity" in the body of the HTTP response when it's too busy:

(exponential-backoff 1000 2 10000
  (fn [t] ;; the predicate
    (and (instance? clojure.lang.ExceptionInfo t)
         (re-find #"queue capacity" (:error (ex-data t)))))
  #(http/get "http://rate-limited.com/resource"
             {:socket-timeout 1000
              :conn-timeout   1000}))

You can be more selective about your backoff.

A Macro

Well, here's an example macro. It's got a bunch of defaults.

(defmacro try-backoff [[time rate max p?] & body]
  `(exponential-backoff (or ~time 1000) ;; defaults!
                        (or ~rate 2)
                        (or ~max 10000)
                        (or ~p? (constantly true))
                        (fn [] ~@body)))

Here's how you use it:

(try-backoff []
  (println "trying!")
  (do-some-stuff))

Also, add it to your Clojure Emacs config for better formatting, because this one wants the args on the first line:

(put-clojure-indent 'try-backoff 1)

This tells Emacs to make the second argument ((println "trying!")) one indentation in, instead of directly under the first ([]).

Warning

All of the try3 warnings apply. The stuff you're doing inside needs to be idempotent!

Conclusion

This pattern is another cool, reusable component to help build reliability into a distributed system. Small, intermittent failures are pervasive. And a common form of error is a server being too busy. Being able to handle this type of error quickly and systematically is going to make your life easier.

Though Clojure does not have specific solutions to distributed systems problems, coding them up is short and straightforward. If you're interested in learning Clojure, I suggest you check out LispCast Introduction to Clojure. It's a video course that uses animation, storytelling, and exercises to install Clojure into your brain.

Learn Functional Programming using Clojure with screencasts, visual aids, and interactive exercises
Learn more

You might also like


  1. meaning the server can only handle so many jobs at once, and the behavior is undefined at that point

Pre-conj Prep: David Pick

September 30, 2014

Talk: Building a Data Pipeline with Clojure and Kafka

Background

David Pick's talk at the conj is about Kafka, which is a distributed messaging system. It implements a publish/subscribe model and runs in a cluster. Its documentation claims that Kafka has very high throughput.

Why it matters

Braintree, where David Pick works, is a huge payment processor, now owned by PayPal. They must have very strict availability and consistency requirements in addition to the scale. The talk description hints that Clojure was helpful to their success. Some of the best talks I've seen are this kind of experience report.

About David Pick

Twitter - Github


This post is one of a series called Pre-conj Prep, which originally was published by email. It's all about getting ready for the upcoming Clojure/conj, organized by Cognitect. Conferences are ongoing conversations and explorations. Speakers discuss trends, best practices, and the future by drawing on the rich context built up in past conferences and other media.

That rich context is what Pre-conj Prep is about. I want to enhance everyone's experience at the conj by surfacing that context. With just a little homework, we can be better prepared to understand and enjoy the talks and the hallway conversations, as well as the beautiful venue and city of Washington, DC.

Clojure/conj is a conference organized and hosted by Cognitect. This information is in no way official. It is not sponsored by nor affiliated with Clojure/conj or Cognitect. It is simply me curating and organizing public information about the conference.

You might also like

Pre-Conj Interview: David Pick

October 15, 2014

Introduction

David Pick was kind enough to answer a few questions about himself and his upcoming Clojure/conj talk. You may want to read the background before you begin.

Interview

LispCast: How did you get into Clojure?

David Pick: A friend of my mine suggested I read Purely Functional Data Structures and told me the work in the book was the basis for Clojure's persistent data structures. I really enjoyed it and decided to give Clojure a try.

LC: Braintree must process a lot of very important data, since it's a payments service. What were you using before you developed the Clojure and Kafka system?

DP: I'd say for 98% of our systems we still use Ruby and Postgres, we're simply using Clojure and Kafka as a messaging platform to move data between systems.

LC: What is Kafka?

DP: Kafka is a system for doing publish subscribe messaging, that was built by LinkedIn to have extremely high throughput and replay ability of messages. It stores all messages sent through the system for a rolling window. So consumers simply say give me messages starting with offset n or whatever your most recent message is. This way if a receiving system goes down the consumer can just rewind the stream and replay messages that were missed.

LC: Could you describe the problem you were trying to solve? What were the requirements?

DP: There were actually two problems we were trying to solve. The first was to pull our search tool into a separate system that was more scalable than our current one (the current one just runs sql queries on main database). The second was to improve the accuracy and speed that our data warehouse is updated.

For the infrastructure part of the solution we ended up using Kafka, Zookeeper, and Elasticsearch which are all JVM based so that ended up heavily influencing our decision to use Clojure.

LC: You mentioned in the talk description that Clojure had certain advantages. What tradeoffs did Clojure, or your choice of technologies, bring about?

DP: The biggest tradeoff I think we made with Clojure is familiarity, I was the only one on the team who knew Clojure prior to starting this project. So that has certainly slowed us down. Other than that things have gone surprisingly well.

LC: I'm really fascinated by the engineering efforts to build very reliable and robust systems. Are you going to talk about that?

DP: Yep, that's the plan!

LC: Awesome.

Do you have any resources that would help someone new to Clojure or new to Kafka so that they can make the most of your talk? Some pre-reading or pre-watching materials?

DP: For Clojure I definitely recommend both the Clojure Koans and 4Clojure. For Kafka I'd recommend the introduction on their site.

LC: Where can people follow your adventures online?

DP: Twitter is probably the best place @davidpick.

LC: Who would win in a no-holds-barred fight to the finish, Clojure with an Arduino-controlled laser or a T-Rex?

DP: If Rich is writing the Clojure I give it a fighting chance, otherwise it's the T-Rex for sure.

LC: Awesome! Thanks for a great interview. I look forward to seeing you at the Conj.

DP: Awesome, thanks Eric!


This post is one of a series called Pre-conj Prep, which originally was published by email. It's all about getting ready for the upcoming Clojure/conj, organized by Cognitect. Conferences are ongoing conversations and explorations. Speakers discuss trends, best practices, and the future by drawing on the rich context built up in past conferences and other media.

That rich context is what Pre-conj Prep is about. I want to enhance everyone's experience at the conj by surfacing that context. With just a little homework, we can be better prepared to understand and enjoy the talks and the hallway conversations, as well as the beautiful venue and city of Washington, DC.

Clojure/conj is a conference organized and hosted by Cognitect. This information is in no way official. It is not sponsored by nor affiliated with Clojure/conj or Cognitect. It is simply me curating and organizing public information about the conference.

You might also like

Try Three Times

March 05, 2015

Summary: Distributed systems fail in indistinguishable ways. Often, retrying is a good solution to intermittent errors. We create a retry macro to handle the retries in a generic way.

Let's face it: your system is probably a distributed system. All web apps are by definition distributed. They have at least one server, probably a separate database server, and many browser clients. And now microservices are getting popular. Distributed is the current and future normal. While Clojure solves the problems of multiple cores sharing memory at the language level, distributed systems problems are left to be addressed at the application level.

The problem

One big problem that comes up all the time in distributed systems is dealing with failure. Failure happens everywhere. The problem in a distributed system is that you don't know where the failure happened. For example, let's say you make an HTTP GET request and 20 seconds later, you're still waiting for the response. Is it:

It is literally impossible to know what the problem is. And that's ok. There's a lot of machinery between one machine and the next. Even if you could diagnose the problem, are you really going to program each error case?

Metaphor

Let's say you call your friend and they don't pick up. Are they asleep? Is their phone off? Did the call not go through? The phone won't tell you. And you really want to talk to them. So what do you do? You call back. You might even call back a couple of times. If they pick up when you call back, great! If not, then you get tired and give up.

That's a common approach in distributed systems as well: retry your distributed message a few times before you give up. It's easy and fixes a surprising number of problems. What's more, there's a good solution that's simple in the Hickeyan sense.

The solution

Failure in Clojure typically means an Exception. So we'll need to catch exceptions and run code multiple times.

(defn try-n-times [f n]
  (if (zero? n)
    (f)
    (try
      (f)
      (catch Throwable _
        (try-n-times f (dec n))))))

You pass it a function and a number of times to retry it. The base case is when n is 0. In that case, it will just try it (not retry). If it's greater than 0, it will wrap the function call in a try/catch, catch everything, and recurse. If after n retries, is still throws an exception, try-n-times will fail and some other code will have to deal with it. The concern of retrying is separated from what is being retried.

How do you use it?

Wrap your distributed calls in this bad boy and you're good to go.

Instead of this:

(http/get "http://somewhat-reliable.com/resource"
          {:socket-timeout 1000
           :conn-timeout   1000})

You do this:

(try-n-times #(http/get "http://somewhat-reliable.com/resource"
                        {:socket-timeout 1000
                         :conn-timeout 1000}) 2)

Remember, n is the number of retries. So that's 1 try + 2 retries.

Macro, anyone?

Alright, yes, I made a macro for that. It does come in handy to have a macro that you can put code in instead of passing in a function.

(defmacro try3 [& body]
  `(try-n-times (fn [] ~@body) 2))

This one is used like this:

(try3
  (println "trying!")
  (do-some-stuff))

Warning

Now, a little care needs to be taken when you use this. Remember, when you get a failure, it could be a timeout. The server could be processing your request. Or it could have failed halfway through a multi-step process. What that means practically is that your distributed message has to be idempotent. HTTP GET is idempotent, so it's ok. POST generally is not, but sometimes it is. Use your judgment! Also, you should make your call timeout, to turn long waits into errors.

Conclusion

This pattern is just one piece of a larger distributed system puzzle. The network and servers are unreliable. They might work the whole time during development, but in the fullness of time, an always-on distributed system will have some kind of failure eventually. Sometimes the failures are temporary, and in those cases, a quick retry can fix it right away.

Though Clojure does not have specific solutions to distributed systems problems, coding them up is short and straightforward. If you're interested in learning Clojure, I suggest you check out LispCast Introduction to Clojure. It's a video course that uses animation, storytelling, and exercises to install Clojure into your brain.

Learn Functional Programming using Clojure with screencasts, visual aids, and interactive exercises
Learn more

You might also like