logstash tuning on AWS

During a recent test where I put up 125+ AWS instances to do some work I ran into an issue. All of the instances are pushing their logs via logstash-forwarder to a load balanced logstash cluster. Things were running fine but logging failed. It was nice to know that logging doesn’t bring the instances to halt, but the logs are being centrally collected for a reason.

After some digging I found that the issue was overrunning the logstash recipients memory. Basically the logs were flooding in at 4,500-5,500/s which exceeded what logstash could process. The events pilled up and boom. Not enough memory:

Error: Your application used more memory than the safety cap of 4G.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace

The logstash instances are running on c3.xlarge instance types and I decided to perform some tests by throwing 250,000 events at them to see how fast they would process. Basically it came in at about 1,800/s. That number seemed low and I started playing around with the logstash settings.

Since we are using the elasticsearch_http output from logstash I experimented with the number of workers (default 1) for that output plugin. 2 was the sweet spot and I managed to increase the throughput to around 2,100/s. Not a great improvement, but I figured more should be possible.

The c3.xlarge comes with 4 cores and 7.5GB of RAM, but when I was testing the load stayed very low at around 0.5. Clearly I wasn’t getting the full value.

Logstash can also adjust the number of filter workers via the -w flag. I figured the filter might just be where things are getting stuck and so I re-ran my tests with various combinations of filter workers and elasticsearch_http workers.

I’ll skip all the details, but for the c3.xlarge instance type I ended up reaching an ingestion rate about 3,500/s or nearly double the original. That rate was achieved by using

  • filter workers = 8
  • elasticsearch_http workers = 4

Changing either of these up or down reduced the overall rate. It also pushed the load to around 3.7+.

I think a lot more experimentation could be done with different instance types and counts, but for now I’m pretty happy with the new throughput which let’s me run a lot fewer instances to get to target rate of 30,000 events/s. For now I think I have a decent formula to drive any further tests:

  • filter workers = 2 * number of cores
  • elasticsearch_http workers = filter workers / 2

There is still a concern  over load balancing the logstash service, which runs as a TCP service and connections from the forwarders persist. That means that if the right number of forwarding instances all tied to the same endpoint start pushing a lot, we might still overrun the box. There are some good ideas around putting a Redis or RabbitMQ layer in between, but that’s an experiment for another day.

\@matthias

Advertisements

why you should embrace a rabbitmq client library

Recently I had to re-run a lot of documents into one of our applications. The app lives on AWS and ingesting content involves the use of a RabbitMQ queue.

I’ve often used the the amqp-tools/rabbitmq-c for quick ad-hoc work in the past and so I wrote a very terse bash script to feed the list of documents to the queue. That script worked just fine, but I was in a hurry and I added quite a few queue clients to get the work done more quickly.

I stalled out in terms of rate and when I looked a bit more closely I found that my bash script wasn’t able to keep the queue fed sufficiently and my clients were going idle.

I also have some Ruby code using the bunny library and decided to re-write my feed script using that.

The results were startling.

Pushing 100,000 messages to the queue using the bash approach took about 28 minutes.

The Ruby version using a RabbitMQ library with persistent connection did the same work 35 seconds!

During a later run I pushed 1 million messages to RabbitMQ from a single client using the Ruby code.  That run took 6.5 minutes for an effective rate of 2500 messages per second.  The server is running on a r3.large and with that push and all the clients reading from it the load pushed up to only around 1.5. That is also a stark contrast to the bash version of the script during which I would see the load rise to 4+.

I didn’t take the time to dig deeply if this was due to process spawning in the bash script or overhead in connection setup/teardown with RabbitMQ. Given the load impact on the RabbitMQ server of the bash script (which ran on a different system) I’m confident that it’s not process spawning, but instead a lot of extra burden on RabbitMQ to deal with all those connection requests.

In the end it just speaks to the practicality of using client library the right way if things are going too slow when interacting with RabbitMQ.

\@matthias

augmenting the elasticsearch docker container

We are running lots of things in docker. Elasticsearch is one of those things. It’s a very nice way to go especially since there is an official elasticsearch docker image available.

Since we are running in AWS we need the elasticsearch-cloud-aws plugin to allow for the nodes in the cluster to find each other.

To pull things together we are building a custom docker image based on the official one and simply install the needed plugin. This gives us everything we need to run.

However, to make it all happen there are some caveats.

The official image uses the /data directory for logs, data and plugins. The image also exposes /data as a VOLUME. This makes it possible to point the container at a location on the host to keep the heavy write operations for logging and, of course, the data itself out of the container. It also allows for upgrades etc, by simply pointing a container at the data location.

There is a downside to this. The image also places the plugins under /data/plugins and so when the container starts and sets the volume, the plugins “vanish”. It’s also worth noting that our custom Dockerfile, which extends the official one seemed to work just fine with this command:

RUN /elasticsearch/bin/plugin install elasticsearch/elasticsearch-cloud-aws/2.4.1

There are no errors generated by this, however the plugin does NOT persist into /data/plugins! This seems a bit odd, but in the end the /data location would end up being replaced by the VOLUME regardless.

To work around this our custom Dockerfile creates /elasticsearch/plugins, modifies the config for elasticsearch and then installs the plugin:

FROM dockerfile/elasticsearch
MAINTAINER Matthias Johnson <mjohnson@catalystsecure.com>

# move the ES plugins away from the /data volume where it won't survive ...
RUN mkdir /elasticsearch/plugins
RUN sed -i 's@plugins:\s*/data/plugins@plugins: /elasticsearch/plugins@' /elasticsearch/config/elasticsearch.yml
# install the AWS plugin
RUN /elasticsearch/bin/plugin install elasticsearch/elasticsearch-cloud-aws/2.4.1

# Expose ports.
#   - 9200: HTTP
#   - 9300: transport
EXPOSE 9200
EXPOSE 9300

# start the service
ENTRYPOINT [ "/elasticsearch/bin/elasticsearch" ]

Now, we can use the resulting image to spin up the container to run elasticsearch without having to perform the plugin install to the /data location before starting the container.

This approach should also work nicely for other plugins we may need in the future.

\@matthias

elasticsearch, redis, rabbitmq … so many choices

The other day I had a conversation with a developer here at Catalyst. We are using Elasticsearch, Redis, and RabbitMQ for various things and he was wondering “which one should I choose?”. These tools offer different features and play to different strengths and it’s not always very obvious when to use each one. After I responded to my colleagues email, I thought it might be worth writing up here.

To begin, both Elasticsearch and Redis are tools that loosely are lumped together as NoSQL systems. RabbitMQ on the other hand is a queuing system. The key uses for each are:

  • Elasticsearch is great for storing “documents”, which might just be logs. It offers a powerful search API to find things
  • Redis is a key/value cache or store. It’s very good at storing things that feel very much like data structures you’d find in programming languages. It is very much focused on speed and performance
  • RabbitMQ allows you to queue things and process items in a very efficient manner. It offers a nice and consistent abstraction to other ways of working through things such as looping over log content, traversing the file system or creating another system in a traditional RDBMS

Next I’ll offer some observations of these tools and possible use cases.

Elasticsearch

As I mentioned, Elasticsearch is a document store with excellent searchability. A very common use is for aggregating logs via logstash. In that case you can think of each log event or line as a “document”. While that is a very common use of Elasticsearch these days, the use can go much further. Many companies are using it to index various content structures to make them searchable. For example we are using it to search for content we extract from files.

Elasticsearch stores it’s content as JSON. This makes it possible to leverage the structure. For example fields can be stored, searched and retrieved. This feels a little like a select column from table; statement, thought he comparison looses value quickly.

In general I think of it as a place to persist data for the long term. Elasticsearch also makes operational tasks pretty easy, which includes replication and that reinforces the persistence impression.

If you need full text search and/or want to store things for a long time, Elasticsearch is a good choice.

Redis

I think of Redis as being very much focused on speed. It’s a place to store data that is needed in a lot of places and needed fast. For example storing session data, which will be useful by every service behind a load balancer is a good example.

Another example might be to aggregate and update performance data quickly. The Sensu monitoring framework does just that.

In general Redis is a great choice where you need to get specific values or datasets as you would with variables in a programming language. While there are persistence options, I tend to think of Redis primarily as a tool to speed things up.

In a nutshell, I would use Redis for fast access to specific data in a cache sort of way.

RabbitMQ

RabbitMQ is a queuing service where you put things to be handed to other systems. It allows different systems to communicate with each other without having to build that communication layer.

In our case we frequently need to do things with files. So a message is place in the queue pointing to that file. Another system then subscribes to the queue and when a file shows up, it takes the appropriate action. This could also be a log event or anything else that would warrant an action to be taken somewhere else.

While I’m generally a big fan of RESTful architectures, I’m willing to compromise when it comes to queuing. With a proper RabbitMQ client we get nice things such as the assignment of an item in the queue to a specific client and if the client fails, RabbitMQ will make that item available to another client. This avoids having to code this logic into the clients. Even in cases where a log is parsed and triggers events to the queuing system there is less work to deal with a failure since mostly there is no re-playing of events that have to happen.

RabbitMQ is great if you have a workflow that is distributed.

General thoughts

The great thing about these tools is that they abstract common things we need. That avoids having to build them into different parts of the stack over and over (we have built many versions of queuing, for example). These tools are also intended to scale horizontally, which allows for growth as utilization increases. With many homegrown tools there will always be a limit of the biggest box you can buy. On the flip side it’s also possible to run in a VM or a container to minimize the foot print and isolate the service.

From an operations perspective I also like the fact that all three are easy to set up and maintain. I don’t mean to say that running any service is a trivial task, but having performed installs of Oracle I certainly appreciate the much more streamlined management of these tools. The defaults that come out of the box with Elasticsearch, Redis, and RabbitMQ are solid, but there are many adjustments that can be made to meet the specific use case.

That brings me back to “which one should I use?”

Really, it depends on the use case. It’s likely possible to bend each system for most use cases. In the end I hope that some of these musings will help make the choices that make the most sense.

Cheers,

\@matthias

replacing 500 error in nginx auth_request

One of the great things about nginx is the auth_request module. It allows you to make a call to another URL to authenticate or authorize a user. For my current work that is perfect since virtuall everything follows a RESTful model.

Unfortunately, there is one problem. If the auth_request fails, the server responds with an HTTP status of 500. That normally is a bad thing since it indicates a much more severe problem than a failed authentication or authorization.

The logs indicate that

auth request unexpected status: 400 while sending to client

and then proceeds to return a 500 to the client.

Nginx offers some ways to trap certain proxy errors for fastcgi_intercept_errors and uwsgi_intercept_errors as described in this post. The suggested proxy_intercept_errors off;, doesn’t seem to do the trick either.

I managed to come up with a way that returns a 401 by using the following in the location block that performs the auth_request:

auth_request /auth;
error_page 500 =401 /error/401;

This captures the 500 returned and changes it to a 401. Then I added another location block for 401:

location /error/401 {
   return 401;
}

Now I get a 401 instead of the 500.

Much better.

On a side note it seems that someone else is also thinking about this.

\@matthias

3 things I spent too much time on in cloudformation

Cloudformation is very powerful. It’s very cool to be able to spin up an entire environment in one step. The servers get spun up with the right bits installed, networking is configured with security restrictions in place and load balancing all work.

With that power comes some pain. Anyone who’s worked with large cloudformation templates will know that I’m referring to. In my case it’s well over a thousand lines of JSON goodness. That can make things more difficult to troubleshoot.

Here are some lessons I’ve learned and, for my taste, spent too much time on.

Access to S3 bucket

When working with Cloudformation and S3 you get two choices to control access to S3. The first is the AWS::S3::BucketPolicy and the other is an AWS::IAM::Policy. Either will serve you well depending on the specific use case. A good explanation can be found in IAM policies and Bucket Policies and ACLs! Oh My! (Controlling Access to S3 Resources).

Where you’ll run into issues is when you’re using both. It took me better part of the day trying to get an AWS::IAM::Policy to work. Everything sure looked great. Then I finally realized that there was also an AWS::S3::BucketPolicy in place.

In that case (as the Oh My link points out), the one with least privilege wins!

Once I removed the extra AWS::S3::BucketPolicy everything worked perfectly.

Naming the load balancer

In Cloudformation you can configure load balancers in two ways. The first kind will be accessible via the Internet at large, while the second will be internal to a VPC. This is configured by setting the "Scheme" : "internal" for the AWS::ElasticLoadBalancing::LoadBalancer.

Now you can also add a AWS::Route53::RecordSetGroup to give that load balancer a more attractive name than the automatically generated AWS internal DNS name.

For the non-internal load balancer this can be done by pointing the AliasTarget to the CanonicalHostedZoneName  and things will work like this:

"AliasTarget": {
  "HostedZoneId": {
     "Fn::GetAtt": ["PublicLoadBalancer", "CanonicalHostedZoneNameID"]
  },
  "DNSName": {
    "Fn::GetAtt": ["PublicLoadBalancer", "CanonicalHostedZoneName"]
  }
}

However, this does not work for the internal type of load balancer.

In that case you need to use the DNSName:

"AliasTarget": {
    "HostedZoneId": {
    "Fn::GetAtt": ["InternalLoadBalancer", "CanonicalHostedZoneNameID"]
  },
  "DNSName": {
    "Fn::GetAtt": ["InternalLoadBalancer", "DNSName"]
  }
}

(Template) size matters

As I mentioned earlier templates can get big and unwieldy. We have some ansible playbooks we started using to deploy stacks and updates to stacks. Then we started getting errors about the template being to large.  Turns out I’m not the only one having an issue with the max size of a uploaded template being 51200 bytes.

Cloudformation can deal with much larger templates, but they have to come from S3. To make this work the awscli is very helpful.

Now for the large templates I use the following commands instead of the ansible playbook:

# first copy the template to S3
aws s3 cp template.json s3://<bucket>/templates/template.json
# validate the template
aws cloudformation validate-template --template-url \
    "https://s3.amazonaws.com/<bucket>/templates/template.json"
# then apply it if there was no error in validation
aws cloudformation update-stack --stack-name "thestack" --template-url \
    "https://s3.amazonaws.com/<bucket>/templates/template.json" \
    --parameters <parameters> --capabilities CAPABILITY_IAM 

Don’t forget the --capabilities CAPABILITY_IAM or the update will fail.

Overall I’m still quite fond of AWS. It’s empowering for development. None the less the Cloudformation templates do leave me feeling brutalized at times.

Hope this saves someone some time.

Cheers,

\@matthias

updating the AMIs to a new version

We’ve been enjoying the use of AWS CloudFormation. While the templates can be a bit of a bear, the end result is always consistent. (That said, I think that Terraform has some real promise).

One thing we do is to lock our templates to specific AMIs, like this:

    "AWSRegion2UbuntuAMI" : {
      "us-east-1" :      { "id" : "ami-7fe7fe16" },
      "us-west-1" :      { "id" : "ami-584d751d" },
      "us-west-2" :      { "id" : "ami-ecc9a3dc" },
      "eu-west-1" :      { "id" : "ami-aa56a1dd" },
      "sa-east-1"      : { "id" : "ami-d55bfbc8" },
      "ap-southeast-1" : { "id" : "ami-bc7325ee" },
      "ap-southeast-2" : { "id" : "ami-e577e9df" },
      "ap-northeast-1" : { "id" : "ami-f72e45f6" }
    }

That’s great, because we always get the exact same build based on that image and we don’t introduce unexpected changes. For those of you who know their AMI IDs very well, you will realize that this is actually for an older version of Ubuntu.

Sometimes, however, it makes sense to bring the AMIs up to a new version and that means having to find all of the new AMI IDs.

Here is a potential approach using the . I’m going to assume you either have it installed already or run on one of the platforms there the installation instructions work. (Side note: if you are on an Ubuntu box I recommend installed the version via pip since it works as advertised, while the version in the Ubuntu repo has some odd issues).

Using the awscli it’s possible to list the images. Since I’m interested in Ubuntu images I search for Canonical’s ID or 099720109477 and also apply some filters to show me only the 64 bit machines with an ebs root device:

aws ec2 describe-images  --owners 099720109477 \
  --filters Name=architecture,Values=x86_64 \
            Name=root-device-type,Values=ebs

That produces a very long dump of JSON (which I truncated):

{
    "Images": [
        {
            "VirtualizationType": "paravirtual", 
            "Name": "ubuntu/images-testing/ebs-ssd/ubuntu-trusty-daily-amd64-server-20141007", 
            "Hypervisor": "xen", 
            "ImageId": "ami-001fad68", 
            "RootDeviceType": "ebs", 
            "State": "available", 
            "BlockDeviceMappings": [
                {
                    "DeviceName": "/dev/sda1", 
                    "Ebs": {
                        "DeleteOnTermination": true, 
                        "SnapshotId": "snap-bde4611a", 
                        "VolumeSize": 8, 
                        "VolumeType": "gp2", 
                        "Encrypted": false
                    }
                }, 
                {
                    "DeviceName": "/dev/sdb", 
                    "VirtualName": "ephemeral0"
                }
            ], 
            "Architecture": "x86_64", 
            "ImageLocation": "099720109477/ubuntu/images-testing/ebs-ssd/ubuntu-trusty-daily-amd64-server-20141007", 
            "KernelId": "aki-919dcaf8", 
            "OwnerId": "099720109477", 
            "RootDeviceName": "/dev/sda1", 
            "Public": true, 
            "ImageType": "machine"
        }, 
......
        {
            "VirtualizationType": "hvm", 
            "Name": "ubuntu/images/hvm/ubuntu-quantal-12.10-amd64-server-20140302", 
            "Hypervisor": "xen", 
            "ImageId": "ami-ff4e4396", 
            "State": "available", 
            "BlockDeviceMappings": [
                {
                    "DeviceName": "/dev/sda1", 
                    "Ebs": {
                        "DeleteOnTermination": true, 
                        "SnapshotId": "snap-8dbadf4a", 
                        "VolumeSize": 8, 
                        "VolumeType": "standard", 
                        "Encrypted": false
                    }
                }, 
                {
                    "DeviceName": "/dev/sdb", 
                    "VirtualName": "ephemeral0"
                }, 
                {
                    "DeviceName": "/dev/sdc", 
                    "VirtualName": "ephemeral1"
                }
            ], 
            "Architecture": "x86_64", 
            "ImageLocation": "099720109477/ubuntu/images/hvm/ubuntu-quantal-12.10-amd64-server-20140302", 
            "RootDeviceType": "ebs", 
            "OwnerId": "099720109477", 
            "RootDeviceName": "/dev/sda1", 
            "Public": true, 
            "ImageType": "machine"
        }
    ]
}

That output is pretty thorough and good for digging through things, but for my purposes it’s too much and lists lots of things I don’t need.

To drill in on the salient input a little more I use the excellent jq command-line JSON processor and pull out the things I want and also grep for the specific release:

aws ec2 describe-images  --owners 099720109477 \
  --filters Name=architecture,Values=x86_64 \
            Name=root-device-type,Values=ebs \
| jq -r '.Images[] | .Name + " " + .ImageId' \
| grep 'trusty-14.04'

The result is something I can understand a little better:

ubuntu/images/ebs-io1/ubuntu-trusty-14.04-amd64-server-20140829 ami-00389d68
ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140926 ami-0070c468
ubuntu/images/ebs/ubuntu-trusty-14.04-amd64-server-20140416.1 ami-018c9568
...
ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140923 ami-80fb51e8
ubuntu/images/ebs-io1/ubuntu-trusty-14.04-amd64-server-20140927 ami-84aa1cec
ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140607.1 ami-864d84ee
ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140724 ami-8827efe0
ubuntu/images/hvm/ubuntu-trusty-14.04-amd64-server-20140923 ami-8afb51e2
ubuntu/images/ebs/ubuntu-trusty-14.04-amd64-server-20140927 ami-8caa1ce4
ubuntu/images/hvm-io1/ubuntu-trusty-14.04-amd64-server-20140923 ami-8efb51e6
ubuntu/images/ebs-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-98aa1cf0
ubuntu/images/hvm/ubuntu-trusty-14.04-amd64-server-20140927 ami-9aaa1cf2
ubuntu/images/hvm-io1/ubuntu-trusty-14.04-amd64-server-20140927 ami-9caa1cf4
ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-9eaa1cf6
ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140816 ami-a0ff23c8
ubuntu/images/hvm-io1/ubuntu-trusty-14.04-amd64-server-20140607.1 ami-a28346ca
ubuntu/images/ebs/ubuntu-trusty-14.04-amd64-server-20140724 ami-a427efcc
...
ubuntu/images/ebs/ubuntu-trusty-14.04-amd64-server-20140813 ami-fc4d9f94
ubuntu/images/hvm-io1/ubuntu-trusty-14.04-amd64-server-20140924 ami-fe338696

After a little more investigation I see that the latest version can be identified based on the datastamp, in this case 20140927. I’ve seen some other ways things are named, but in this case the datastamp works well enough and I can look for ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 in each region for the AMI IDs.

for x in us-east-1 us-west-2 us-west-1 eu-west-1 ap-southeast-1 ap-southeast-2 ap-northeast-1 sa-east-1; do
    echo -n "$x "
    aws --region ${x} ec2 describe-images  --owners 099720109477 --filters Name=architecture,Values=x86_64 \
      Name=root-device-type,Values=ebs \
      Name=name,Values='ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927' \
    | jq -r '.Images[] | .Name + " " + .ImageId'
    done

The result is a nice tidy list with the AMI ID for each region:

us-east-1 ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-9eaa1cf6
us-west-2 ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-3d50120d
us-west-1 ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-076e6542
eu-west-1 ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-f0b11187
ap-southeast-1 ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-d6e7c084
ap-southeast-2 ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-1711732d
ap-northeast-1 ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-e74b60e6
sa-east-1 ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-69d26774

Now, to make this pastable into the the CloudFormation template I run that output through some more shell processing:

cut -f1,3 -d' ' | sed 's/^\(.*\) \(.*\)$/"\1": { "id": "\2" },/'

and end up with

"us-east-1": { "id": "ami-9eaa1cf6" },
"us-west-2": { "id": "ami-3d50120d" },
"us-west-1": { "id": "ami-076e6542" },
"eu-west-1": { "id": "ami-f0b11187" },
"ap-southeast-1": { "id": "ami-d6e7c084" },
"ap-southeast-2": { "id": "ami-1711732d" },
"ap-northeast-1": { "id": "ami-e74b60e6" },
"sa-east-1": { "id": "ami-69d26774" },

I can now paste that into the template and remove the final comma.

Voilà, the new stack will now run with the latest AMIs and can be subjected to testing.

\@matthias

docker data only containers for vendor code

Docker has this idea of “data only containers“. The concept is generally pretty simple. You have a container which contains the data and exposes it via the --volume CLI or VOLUME directive in the Dockerfile. Another container can then access that volume via the --volumes-from command line switch.

One of the interesting things is that the data only container does not need to be running. It can be started and just exit. As long as the container was created, then the volume will be available. Although if you clean up your “exited” containers, you will likely also delete the data only container since it has exited.

In most of the descriptions online there are lots of examples of exposing data sets this way. For example a /var/lib/mysql directory. That can be used to get a consistent data set to run tests against or making setup easier. There are also other examples of where this makes life easier.

As we started playing with docker, one thing was noticable. Images can get big and so they can take a bit of time to ship them around. With some consideration around the structure of the build process, this can be mitigated due to Docker’s use of caching in the process.

None the less, this brought us to another potential use of the data only container: vendor code.

For example, we have a tool we use, which is about 130MB in size. So far we’ve been baking it into the application container. However since this code rarely changes it’s a great candidate to be split into it’s own container.

We’ve been experimenting with the idea and created an image just for that code and then linking to the volume from the application.

So far it’s working rather well. Here is an example.

Say we have our code in the a vendor/ directory and need to access it from the application under /opt/vendor.

The docker build starts with the busybox base image and simply copies the vendor code to the image under /opt/vendor.

Starting the container is easy:

# you don't need the --volume if your Dockerfile expose the volume
docker run -d --volume /opt/vendor --name vendor_name vendor-image

In our case the CMD or run command is simple:

echo &amp;quot;Data only container for access to vendor code&amp;quot;

That means that the container starts, prints that line and exits. That’s good enough to access the data.

Now the application container is started with

docker run -d --volumes-from vendor_name application-image

The --volumes-from will grab the volumes from the “vendor_name” container and bring them into the application container at the same mount point.

We now have the entire vendor code base available in the application container without having to bake it into that image.

One thing of note is that there are two choices in starting the data only container. If you start with an empty container based on the scratch image, the run command will likely fail unless your vendor code let’s run something easily. In our case the code base requires a lot of supporting stuff and the container simply can’t even do an echo successfully. That’s the first choice: put a CMD in your Dockerfile and know that it will always fail on start, but start the container none the less. (Hint: if you just leave the CMD off, it will not start).

The other choice is to start with a slightly bigger base image such as busybox. That’s what we have done and the 2.5MB extra seems worth it to avoid the failure.

\@matthias

query consolidation in elasticsearch

In my last post on a simple way to improve elasticsearch queries I promised a follow up for another way to optimize queries.

This approach didn’t come with the same level of improvement of the order of magnitude from the previous post, but it still offers some benefits.

Once again, I was working on improving my rough first shot of working code. In this case the app I was working on was displaying the search results I mentioned last time, but it also was pulling various facets for display as well.

By the time everything was rendered I had issued somewhere between 12 and 15 calls or queries. Some of these were necessary during authentication or to handle capturing data necessary for the actual query. However there was a clear opportunity for improvement.

My focus was on a couple of sets of queries in particular. The first was a call to capture statistics for a field which would then be used to set up the call for actual facet calls. (Side note: Yep, the facets are going away and are being replaced by aggregations. I’ll likely share some notes on this when I’m done with making that change).

{
    "facets": {
       "date": {
          "statistical": {
             "field": "date"
          }
       }
    }
}

My code has a few of those calls for various numeric fields such as date, size etc.

The other set of queries to focus on was the retrieval for the actual facets.

{
    "facets": {
       "tags": {
          "terms": {
             "field": "tags",
             "size": 10
          }
       }
    }
}

Now the first set of stats related facets are actually used to dynamically create the buckets for some of the actual facet calls. That still lets me combine the first group into one call and the second group into another.

So, I basically end up with two calls to elasticsearch. The first to grab the statistics facets and the second for the facets that are actually used in the application for display.

None, the less rather than issuing a call for each one independently, we can combine them. Like this:

{
    "facets": {
       "date": {
          "statistical": {
             "field": "date"
          }
       },
       "size": {
          "statistical": {
             "field": "size"
          }
       }
    }
}

and then one more call which also includes the actual query:

{
   "query": {
      "query_string": {
         "default_field": "body",
         "query": "test"
      }
   },
   "fields": [
      "title"
   ],
   "facets": {
      "tags": {
         "terms": {
            "field": "tags",
            "size": 10
         }
      },
      "folder": {
         "terms": {
            "field": "folder",
            "size": 10
         }
      }
   }
}

You’ll notice that I’m also just returning the field I need for display as described in the last post.

While this approach doesn’t really reduce the amount of work Elasticsearch has to perform, it reduces the number of individual calls that need to be made. That means that most of the improvement is in the number of calls as well as network roundtrips that need to take place. The later will likley have a bigger impact if the calls made are in sequentially rather than asynchronously. Regardless it does offer some improvement from my experience so far.

\@matthias

 

simple way to improve elasticsearch queries

We use ElasticSearch for some things. I personally have been enjoying working with it as part of a new tool we are building. I’ve learned a couple of things from a querying perspective.

First, I could say a lot about how impressed I am with ElasticSearch from an operations perspective. Out of the box it runs extremely well, but I’ll save that for another post. Here I’ll talk about some rather simple ideas to improve the querying of ElasticSearch.

When developing I often start very basic. It could even be described as simplistic. The first shot is generally not very efficient, but it helps to quickly determine if an idea is workable. This is what I’ve recently done with some code querying ElasticSearch.

The first simple performance improvement was around generating a display of the search results. To get things going quick, I issued the query and grabbed the results. By default ElasticSearch returns the entire document in the _source field. The simple query might look like this:

{
  "query": {
    "match_all": {}
  }
}

The returned results then include the _source field and might look like this

{
  "_index": "test",
  "_type": "doc",
  "_id": "20140806",
  "_score": 1,
  "_source": {
    "title": "some title",
    "body": "the quick brown fox jumps over the lazy dog"
  }
}

My code would then go through the array and grab the title field from the _source for display in the result list. That worked ok, but seemed slow. (Full disclosure: my documents were quite a bit biggger then the simple example above)

Now since I didn’t really need the entire document just to display the title, the obvious choice is to just get the necessary data. Elasticsearch makes this easy via the :

{
  "query": {
    "match_all": {}
  },
  "fields": [
    "title"
  ]
}

That will return something like the following in the hits array:

{
  "_index": "test",
  "_type": "doc",
  "_id": "20140806",
  "_score": 1,
  "fields": {
    "title": [ "some title" ]
  }
}

That lets me skip the retrieval of potentially large chunks of data. The results were quite impressive in my use case. The run time of the queries and display of results dropped by an order of magnitude. Again, this is likely due to the much larger documents I was actually working with. None the less it is a good example of only retrieving the necessary data rather than issuing what amounts to a SELECT * in SQL terms.

The other performance improvement was around consolidating queries, but I’ll save that for a future post.

\@matthias