augmenting the elasticsearch docker container

We are running lots of things in docker. Elasticsearch is one of those things. It’s a very nice way to go especially since there is an official elasticsearch docker image available.

Since we are running in AWS we need the elasticsearch-cloud-aws plugin to allow for the nodes in the cluster to find each other.

To pull things together we are building a custom docker image based on the official one and simply install the needed plugin. This gives us everything we need to run.

However, to make it all happen there are some caveats.

The official image uses the /data directory for logs, data and plugins. The image also exposes /data as a VOLUME. This makes it possible to point the container at a location on the host to keep the heavy write operations for logging and, of course, the data itself out of the container. It also allows for upgrades etc, by simply pointing a container at the data location.

There is a downside to this. The image also places the plugins under /data/plugins and so when the container starts and sets the volume, the plugins “vanish”. It’s also worth noting that our custom Dockerfile, which extends the official one seemed to work just fine with this command:

RUN /elasticsearch/bin/plugin install elasticsearch/elasticsearch-cloud-aws/2.4.1

There are no errors generated by this, however the plugin does NOT persist into /data/plugins! This seems a bit odd, but in the end the /data location would end up being replaced by the VOLUME regardless.

To work around this our custom Dockerfile creates /elasticsearch/plugins, modifies the config for elasticsearch and then installs the plugin:

FROM dockerfile/elasticsearch
MAINTAINER Matthias Johnson <mjohnson@catalystsecure.com>

# move the ES plugins away from the /data volume where it won't survive ...
RUN mkdir /elasticsearch/plugins
RUN sed -i 's@plugins:\s*/data/plugins@plugins: /elasticsearch/plugins@' /elasticsearch/config/elasticsearch.yml
# install the AWS plugin
RUN /elasticsearch/bin/plugin install elasticsearch/elasticsearch-cloud-aws/2.4.1

# Expose ports.
#   - 9200: HTTP
#   - 9300: transport
EXPOSE 9200
EXPOSE 9300

# start the service
ENTRYPOINT [ "/elasticsearch/bin/elasticsearch" ]

Now, we can use the resulting image to spin up the container to run elasticsearch without having to perform the plugin install to the /data location before starting the container.

This approach should also work nicely for other plugins we may need in the future.

\@matthias

elasticsearch, redis, rabbitmq … so many choices

The other day I had a conversation with a developer here at Catalyst. We are using Elasticsearch, Redis, and RabbitMQ for various things and he was wondering “which one should I choose?”. These tools offer different features and play to different strengths and it’s not always very obvious when to use each one. After I responded to my colleagues email, I thought it might be worth writing up here.

To begin, both Elasticsearch and Redis are tools that loosely are lumped together as NoSQL systems. RabbitMQ on the other hand is a queuing system. The key uses for each are:

  • Elasticsearch is great for storing “documents”, which might just be logs. It offers a powerful search API to find things
  • Redis is a key/value cache or store. It’s very good at storing things that feel very much like data structures you’d find in programming languages. It is very much focused on speed and performance
  • RabbitMQ allows you to queue things and process items in a very efficient manner. It offers a nice and consistent abstraction to other ways of working through things such as looping over log content, traversing the file system or creating another system in a traditional RDBMS

Next I’ll offer some observations of these tools and possible use cases.

Elasticsearch

As I mentioned, Elasticsearch is a document store with excellent searchability. A very common use is for aggregating logs via logstash. In that case you can think of each log event or line as a “document”. While that is a very common use of Elasticsearch these days, the use can go much further. Many companies are using it to index various content structures to make them searchable. For example we are using it to search for content we extract from files.

Elasticsearch stores it’s content as JSON. This makes it possible to leverage the structure. For example fields can be stored, searched and retrieved. This feels a little like a select column from table; statement, thought he comparison looses value quickly.

In general I think of it as a place to persist data for the long term. Elasticsearch also makes operational tasks pretty easy, which includes replication and that reinforces the persistence impression.

If you need full text search and/or want to store things for a long time, Elasticsearch is a good choice.

Redis

I think of Redis as being very much focused on speed. It’s a place to store data that is needed in a lot of places and needed fast. For example storing session data, which will be useful by every service behind a load balancer is a good example.

Another example might be to aggregate and update performance data quickly. The Sensu monitoring framework does just that.

In general Redis is a great choice where you need to get specific values or datasets as you would with variables in a programming language. While there are persistence options, I tend to think of Redis primarily as a tool to speed things up.

In a nutshell, I would use Redis for fast access to specific data in a cache sort of way.

RabbitMQ

RabbitMQ is a queuing service where you put things to be handed to other systems. It allows different systems to communicate with each other without having to build that communication layer.

In our case we frequently need to do things with files. So a message is place in the queue pointing to that file. Another system then subscribes to the queue and when a file shows up, it takes the appropriate action. This could also be a log event or anything else that would warrant an action to be taken somewhere else.

While I’m generally a big fan of RESTful architectures, I’m willing to compromise when it comes to queuing. With a proper RabbitMQ client we get nice things such as the assignment of an item in the queue to a specific client and if the client fails, RabbitMQ will make that item available to another client. This avoids having to code this logic into the clients. Even in cases where a log is parsed and triggers events to the queuing system there is less work to deal with a failure since mostly there is no re-playing of events that have to happen.

RabbitMQ is great if you have a workflow that is distributed.

General thoughts

The great thing about these tools is that they abstract common things we need. That avoids having to build them into different parts of the stack over and over (we have built many versions of queuing, for example). These tools are also intended to scale horizontally, which allows for growth as utilization increases. With many homegrown tools there will always be a limit of the biggest box you can buy. On the flip side it’s also possible to run in a VM or a container to minimize the foot print and isolate the service.

From an operations perspective I also like the fact that all three are easy to set up and maintain. I don’t mean to say that running any service is a trivial task, but having performed installs of Oracle I certainly appreciate the much more streamlined management of these tools. The defaults that come out of the box with Elasticsearch, Redis, and RabbitMQ are solid, but there are many adjustments that can be made to meet the specific use case.

That brings me back to “which one should I use?”

Really, it depends on the use case. It’s likely possible to bend each system for most use cases. In the end I hope that some of these musings will help make the choices that make the most sense.

Cheers,

\@matthias