During a recent test where I put up 125+ AWS instances to do some work I ran into an issue. All of the instances are pushing their logs via logstash-forwarder to a load balanced logstash cluster. Things were running fine but logging failed. It was nice to know that logging doesn’t bring the instances to halt, but the logs are being centrally collected for a reason.
After some digging I found that the issue was overrunning the logstash recipients memory. Basically the logs were flooding in at 4,500-5,500/s which exceeded what logstash could process. The events pilled up and boom. Not enough memory:
Error: Your application used more memory than the safety cap of 4G. Specify -J-Xmx####m to increase it (#### = cap size in MB). Specify -w for full OutOfMemoryError stack trace
The logstash instances are running on c3.xlarge instance types and I decided to perform some tests by throwing 250,000 events at them to see how fast they would process. Basically it came in at about 1,800/s. That number seemed low and I started playing around with the logstash settings.
Since we are using the elasticsearch_http output from logstash I experimented with the number of workers (default 1) for that output plugin. 2 was the sweet spot and I managed to increase the throughput to around 2,100/s. Not a great improvement, but I figured more should be possible.
The c3.xlarge comes with 4 cores and 7.5GB of RAM, but when I was testing the load stayed very low at around 0.5. Clearly I wasn’t getting the full value.
Logstash can also adjust the number of filter workers via the -w flag. I figured the filter might just be where things are getting stuck and so I re-ran my tests with various combinations of filter workers and elasticsearch_http workers.
I’ll skip all the details, but for the c3.xlarge instance type I ended up reaching an ingestion rate about 3,500/s or nearly double the original. That rate was achieved by using
- filter workers = 8
- elasticsearch_http workers = 4
Changing either of these up or down reduced the overall rate. It also pushed the load to around 3.7+.
I think a lot more experimentation could be done with different instance types and counts, but for now I’m pretty happy with the new throughput which let’s me run a lot fewer instances to get to target rate of 30,000 events/s. For now I think I have a decent formula to drive any further tests:
- filter workers = 2 * number of cores
- elasticsearch_http workers = filter workers / 2
There is still a concern over load balancing the logstash service, which runs as a TCP service and connections from the forwarders persist. That means that if the right number of forwarding instances all tied to the same endpoint start pushing a lot, we might still overrun the box. There are some good ideas around putting a Redis or RabbitMQ layer in between, but that’s an experiment for another day.