There are times when a new customer onboards and adds few a hundred business locations to our platform. That is baptism with fire for our infrastructure. We see variety of errors at the same time.

Some of it boils down to us attempting to make many simultaneous requests to vendor APIs. “Many” here refers to request rate beyond the API’s rate-limit OR the load that vendor API can handle. This is a problem we’ve had sometime now.

These failed requests are retried at a later time for a satisfying HTTP response. Satisfying response is a 200 or some error request-specific error that we can make sense of - like an unsupported city name being passed.

Something that was on the backburner for a while was building a throttler for outgoing requests. This way we could configure outgoing request rate-limits for different sites. Our roadblocks: earlier, it was not knowing where and how to fit this in our current infrastructure without changing much. Later on, it was the time to implement this.

Over the last week, I wrote an API wrapper for XYZ vendor API, tested and deployed. Our background workers start making requests to the API. I noticed an increased error-rate. Unusual for me, because this is probably the first time I’m involved when a lot of locations are being onboarded at once.

I left home on Monday evening, excited and thinking…

“I’ll prototype a throttler tonight. Let me test this out sometime over the next week on staging. Could go to production in about 2 weeks (while I primarily focus on other tasks during the sprints)”

Finished it up before going to bed. Noted where I had to make changes in the project to integrate this. By 11am the next day morning, I sent out a pull-request for review. I was confidently chatting with a friend (Suhas Pai) who had come to visit me and Raison. I had already planned to spend most of the day with him. Was meeting him in-person for the first time.

Around 12 noon: “Our requests to XYZ Vendor API have high failures… Make it work somehow.”

Someone added more business locations. Redis timeouts. 504 Gateway timeouts. 500 internal server errors. All the variety. Just for this vendor.

I could sit down to retry failed jobs and stare at XYZ request stats for the next couple days, ensuring stuff happens. Or…

“Make it work somehow”
That is license to do anything to get it working.
“There are anyway errors in production. Adding a match-stick to the fire won’t make a difference. So let me test the throttler right away… in production”

I would get this throttler working. I bet on my laziness. I don’t want to stare at stats for a couple days. Would rather do it for a few hours to test my throttler.

So the code I wrote an hour ago, which I estimated to be deployed in 2 weeks, is now in production (and works as expected too).

How does the throttling work?

For each site example.com, we maintain a configurable soft-limit of maximum parallel jobs per site.

  • To track the slots available per site, we use keys in Redis.
  • Every time a new job that begins execution, we check if there is a slot available for the site.
  • If yes, then go ahead with executing it. If not, then we push the job to a queue to be executed later.

The most important part: set TTLs on keys. So this way a slot won’t be used for long if a job takes way too long or the code crashes unexpectedly.

Decide on an API

  • Get a lock - set a key with a TTL
  • Remove a lock after submission - delete the key
throttler = Throttler.new
lock_id = throttler.get_lock("example.com", limit: 10, expires_in: 2.minutes)
if lock_id
  # perform the job here
else
  # delay the job
end
throttler.remove_lock("example.com", lock_id)

Dive right into the implementation

class Throttler

  attr_reader :config

  @@default_config = {
    expires_in: 30.seconds
  }

  def initialize(config={})
    @config = @@default_config.merge(config)
  end


  def get_lock(facet, options={})
    facet_limit = options[:limit] || 1
    unique_key = options[:unique_key] || SecureRandom.hex(5)
    expires_in = options[:expires_in] || config[:expires_in]

    scan_cursor, keys = cache.data.scan "0", match: "#{facet}:*", count: 50
    return false unless keys.length < facet_limit

    # NOTE This is a soft-limit
    # So we don't care if one more key gets inserted during
    # the timespace between scan and write

    key = key_name(facet, unique_key)
    cache.write(key, Time.now.to_i, expires_in: expires_in)

    unique_key
  end


  def remove_lock!(facet, unique_key)
    key = key_name(facet, unique_key)
    cache.delete(key)
  end


  # Returns the cache store instance being used
  def cache
    Synup.cache_store
  end


  private

  def key_name(facet, unique_key)
    "#{facet}:#{unique_key}"
  end

end

The famous train gif comes to mind.

View post on imgur.com

The train GIF popularly used for depicting fixing errors in a hurry is from the movie “The General” (1927). The entire scene is here.