Reducing network overhead in Redis calls with Lua

For one of our applications, we cache a variety of data in Redis for use in later calculations. When profiling the application, I noticed that one of the caching steps was taking much longer that any of the others. Looking at the profiler output quickly showed why:

         3863231 function calls in 266.866 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    50992  259.846    0.005  259.846    0.005 {method 'recv' of '_socket.socket' objects}
    50992    1.293    0.000    2.491    0.000
    50994    0.802    0.000    0.802    0.000 {method 'sendall' of '_socket.socket' objects}
    50992    0.716    0.000  261.088    0.005
   836706    0.516    0.000    0.516    0.000 {isinstance}
    50992    0.450    0.000  266.530    0.005
   203968    0.392    0.000    0.717    0.000
    50992    0.249    0.000  261.709    0.005
    50990    0.210    0.000    0.210    0.000
    50992    0.199    0.000    0.199    0.000 {method 'feed' of 'hiredis.Reader' objects}
    50992    0.191    0.000    1.142    0.000
    50992    0.171    0.000    3.804    0.000
    50992    0.164    0.000    0.311    0.000
   254960    0.163    0.000    0.163    0.000 {method 'join' of 'str' objects}
    50992    0.147    0.000    0.257    0.000
   713889    0.142    0.000    0.142    0.000 {len}
    50992    0.138    0.000  261.250    0.005
   101984    0.124    0.000    0.151    0.000

Look at all that time recv’ing!

The problematic piece of application code looked something like this:

for k, v in some_things.iteritems():
    increment_by = 1
    redis_db.zincrby(ZSET, k, increment_by)

It just loops over a dictionary and for each key (k) increments its score by one within the sorted set it’s stored in. So, for dictionaries with many keys, the repeated calls to ZINCRBY really began to add up, as the profiler output indicated. Obviously a quick way to reduce the amount of time spent doing this needless work would would be to push it to the server side, and that’s where Lua comes in.

Since version 2.6.0, Redis has shipped with a built-in Lua interpreter. From what I can tell it was introduced partly as the maintainer’s way of deflecting new feature requests for obscure use cases (“Write your own Lua script to do that, I won’t add it to Redis.”) For more information, see the official documentation and a short introduction to writing Lua scripts for Redis.

Replacing the loop above with a Lua script was pretty straightforward. You can simply define the script as a string, register the script with the server, and then get to using it. As Lua script execution is blocking, it was necessary to add a way to break the items to be incremented into reasonable chunks, so as to avoid asking the server to increment hundreds of thousands of scores at once. I also tested out replacing the original loop with Redis’ pipeline functionality, but the Lua version won out in the end.

lua = """
for i, k in pairs(ARGV) do"ZINCRBY", KEYS[1], 1, k)
increment = redis_db.register_script(lua)
for chunk in generate_chunk_from_iterable(some_things.iterkeys(), CHUNK_SIZE):
    increment(keys=[ZSET], args=chunk)

The results from the profiler were pretty convincing compared to the previous method:

         1048325 function calls in 1.851 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5    0.643    0.129    0.643    0.129 {method 'recv' of '_socket.socket' objects}
      615    0.295    0.000    0.295    0.000 {method 'sendall' of '_socket.socket' objects}
        5    0.252    0.050    0.634    0.127
   101999    0.131    0.000    0.272    0.000
        1    0.125    0.125    0.125    0.125 {zlib.compress}
        1    0.112    0.112    0.112    0.112 {cPickle.dumps}
   347687    0.100    0.000    0.100    0.000 {isinstance}
   102004    0.064    0.000    0.064    0.000 {method 'join' of 'str' objects}
    35914    0.041    0.000    0.041    0.000 {method 'encode' of 'unicode' objects}
   305707    0.032    0.000    0.032    0.000 {len}
   102013    0.014    0.000    0.014    0.000
    50990    0.011    0.000    0.011    0.000 {round}
        1    0.007    0.007    1.504    1.504
        5    0.005    0.001    1.591    0.318

So remember, when inserting or updating many values, avoid making a lot of round-trips, because the network is always slower than RAM.

logstash tuning on AWS

During a recent test where I put up 125+ AWS instances to do some work I ran into an issue. All of the instances are pushing their logs via logstash-forwarder to a load balanced logstash cluster. Things were running fine but logging failed. It was nice to know that logging doesn’t bring the instances to halt, but the logs are being centrally collected for a reason.

After some digging I found that the issue was overrunning the logstash recipients memory. Basically the logs were flooding in at 4,500-5,500/s which exceeded what logstash could process. The events pilled up and boom. Not enough memory:

Error: Your application used more memory than the safety cap of 4G.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace

The logstash instances are running on c3.xlarge instance types and I decided to perform some tests by throwing 250,000 events at them to see how fast they would process. Basically it came in at about 1,800/s. That number seemed low and I started playing around with the logstash settings.

Since we are using the elasticsearch_http output from logstash I experimented with the number of workers (default 1) for that output plugin. 2 was the sweet spot and I managed to increase the throughput to around 2,100/s. Not a great improvement, but I figured more should be possible.

The c3.xlarge comes with 4 cores and 7.5GB of RAM, but when I was testing the load stayed very low at around 0.5. Clearly I wasn’t getting the full value.

Logstash can also adjust the number of filter workers via the -w flag. I figured the filter might just be where things are getting stuck and so I re-ran my tests with various combinations of filter workers and elasticsearch_http workers.

I’ll skip all the details, but for the c3.xlarge instance type I ended up reaching an ingestion rate about 3,500/s or nearly double the original. That rate was achieved by using

  • filter workers = 8
  • elasticsearch_http workers = 4

Changing either of these up or down reduced the overall rate. It also pushed the load to around 3.7+.

I think a lot more experimentation could be done with different instance types and counts, but for now I’m pretty happy with the new throughput which let’s me run a lot fewer instances to get to target rate of 30,000 events/s. For now I think I have a decent formula to drive any further tests:

  • filter workers = 2 * number of cores
  • elasticsearch_http workers = filter workers / 2

There is still a concern  over load balancing the logstash service, which runs as a TCP service and connections from the forwarders persist. That means that if the right number of forwarding instances all tied to the same endpoint start pushing a lot, we might still overrun the box. There are some good ideas around putting a Redis or RabbitMQ layer in between, but that’s an experiment for another day.