Beware cURL Plus Files

Sometimes a queue worker needs to be no more glamorous than a shell script. If your queues are HTTP and so are the other services, it’s easy to reach for the shell and the venerable cURL. cURL is the UNIX operators default user agent. if it doesn’t work in cURL, there’s a good chance it won’t work in other situations.

We have a queue worker that interacts with several web services. It follows this rough outline…

1. Check for work
2. Get configs
3. Coordinate, communicate
4. Do work

Repeat forever which could be “a mighty long time” as Prince once told us.

The last step is the most interesting, but a little more background…

It is not hard to envision some generic functions in the shell. A generic logger function handles logging. Here are some near real world snippets of code.

1. RSTOPWATCHBEGIN=$(date +”%s.%N”)
2. curl -ski -H “x-auth-user: ${RUSER}” -H “x-auth-expiry: ${AEXPIRY}” “${THIS_RURL}${BUCKETNAME}/${USAFE}” -XPUT –upload-file “${SOURCE}” > $RRESULTS 2>$SPECIAL_ERR_LOG
3. RSTOPWATCHEND=$(date +”%s.%N”)

You can see from this example that the time to interact with this service is the difference between RSTOPWATCHEND (line 3) and RSTOPWATCHBEGIN (line 1). Because these are more granular than “second” you will need to do floating point math commonly in awk or bc (or hope your shell supports it, most do not). Passing it to the logger function records it for evaluation later.

cURL is a rock star. In this worker script, when doing the work of getting configs and communicating over HTTP, routinely, the work completes in hundredths of seconds. The way the script is set up, that includes the time to invoke cURL.

Here is the output of some of those results…

5320103 GETs
0.016 seconds per transaction

But when that interaction involves grabbing a file that is available locally through an NFS mount, the results go south quickly.

Here are those results…

961375 total seconds
507016 PUTs
1.896 seconds per transaction

What can it be? Clearly, it should not be cURL, too many other services are being interacted with over HTTP with expected results. It must be the web service. It is just slower and more complicated than the other HTTP services.

Here is where the story could have ended.

For a number of reasons, we had other code running against this same service. One was using a mojolicious library. The average interaction time with the same service doing the same work was 0.5 seconds. That is not insignificant when you do an operation 3 million times a day. But this worker was PUTing files already in memory. So it is not quite the same.

A different worker was built using Python and the Request library for HTTP. This code had a much smaller transaction time with the web service too.

Here are those results…

21180 total seconds
127479 PUTs
0.166 seconds per transaction

The timing calls are isolated to the same transaction. The files are still fetched over NFS. The service is still authenticated. The service is still using SSL. Finally, the most important thing is that the Python code was running on the same machines as the scripts using cURL. We can comfortably use the phrase, “holding all other variables equal…”

What can account for the 1.6 second difference?

Now it is hard to ignore cURL. We suspect that there is more overhead than we anticipate for cURL to spawn a child process and pull that file into the PUT. Other influencers may include slower authentication responses or less efficient SSL libraries.

If you love and use cURL, you may want to dig into the logs and check,your performance. It might me worth using a different tool for the heavy lifting.

augmenting the elasticsearch docker container

We are running lots of things in docker. Elasticsearch is one of those things. It’s a very nice way to go especially since there is an official elasticsearch docker image available.

Since we are running in AWS we need the elasticsearch-cloud-aws plugin to allow for the nodes in the cluster to find each other.

To pull things together we are building a custom docker image based on the official one and simply install the needed plugin. This gives us everything we need to run.

However, to make it all happen there are some caveats.

The official image uses the /data directory for logs, data and plugins. The image also exposes /data as a VOLUME. This makes it possible to point the container at a location on the host to keep the heavy write operations for logging and, of course, the data itself out of the container. It also allows for upgrades etc, by simply pointing a container at the data location.

There is a downside to this. The image also places the plugins under /data/plugins and so when the container starts and sets the volume, the plugins “vanish”. It’s also worth noting that our custom Dockerfile, which extends the official one seemed to work just fine with this command:

RUN /elasticsearch/bin/plugin install elasticsearch/elasticsearch-cloud-aws/2.4.1

There are no errors generated by this, however the plugin does NOT persist into /data/plugins! This seems a bit odd, but in the end the /data location would end up being replaced by the VOLUME regardless.

To work around this our custom Dockerfile creates /elasticsearch/plugins, modifies the config for elasticsearch and then installs the plugin:

FROM dockerfile/elasticsearch
MAINTAINER Matthias Johnson <mjohnson@catalystsecure.com>

# move the ES plugins away from the /data volume where it won't survive ...
RUN mkdir /elasticsearch/plugins
RUN sed -i 's@plugins:\s*/data/plugins@plugins: /elasticsearch/plugins@' /elasticsearch/config/elasticsearch.yml
# install the AWS plugin
RUN /elasticsearch/bin/plugin install elasticsearch/elasticsearch-cloud-aws/2.4.1

# Expose ports.
#   - 9200: HTTP
#   - 9300: transport
EXPOSE 9200
EXPOSE 9300

# start the service
ENTRYPOINT [ "/elasticsearch/bin/elasticsearch" ]

Now, we can use the resulting image to spin up the container to run elasticsearch without having to perform the plugin install to the /data location before starting the container.

This approach should also work nicely for other plugins we may need in the future.

\@matthias

elasticsearch, redis, rabbitmq … so many choices

The other day I had a conversation with a developer here at Catalyst. We are using Elasticsearch, Redis, and RabbitMQ for various things and he was wondering “which one should I choose?”. These tools offer different features and play to different strengths and it’s not always very obvious when to use each one. After I responded to my colleagues email, I thought it might be worth writing up here.

To begin, both Elasticsearch and Redis are tools that loosely are lumped together as NoSQL systems. RabbitMQ on the other hand is a queuing system. The key uses for each are:

  • Elasticsearch is great for storing “documents”, which might just be logs. It offers a powerful search API to find things
  • Redis is a key/value cache or store. It’s very good at storing things that feel very much like data structures you’d find in programming languages. It is very much focused on speed and performance
  • RabbitMQ allows you to queue things and process items in a very efficient manner. It offers a nice and consistent abstraction to other ways of working through things such as looping over log content, traversing the file system or creating another system in a traditional RDBMS

Next I’ll offer some observations of these tools and possible use cases.

Elasticsearch

As I mentioned, Elasticsearch is a document store with excellent searchability. A very common use is for aggregating logs via logstash. In that case you can think of each log event or line as a “document”. While that is a very common use of Elasticsearch these days, the use can go much further. Many companies are using it to index various content structures to make them searchable. For example we are using it to search for content we extract from files.

Elasticsearch stores it’s content as JSON. This makes it possible to leverage the structure. For example fields can be stored, searched and retrieved. This feels a little like a select column from table; statement, thought he comparison looses value quickly.

In general I think of it as a place to persist data for the long term. Elasticsearch also makes operational tasks pretty easy, which includes replication and that reinforces the persistence impression.

If you need full text search and/or want to store things for a long time, Elasticsearch is a good choice.

Redis

I think of Redis as being very much focused on speed. It’s a place to store data that is needed in a lot of places and needed fast. For example storing session data, which will be useful by every service behind a load balancer is a good example.

Another example might be to aggregate and update performance data quickly. The Sensu monitoring framework does just that.

In general Redis is a great choice where you need to get specific values or datasets as you would with variables in a programming language. While there are persistence options, I tend to think of Redis primarily as a tool to speed things up.

In a nutshell, I would use Redis for fast access to specific data in a cache sort of way.

RabbitMQ

RabbitMQ is a queuing service where you put things to be handed to other systems. It allows different systems to communicate with each other without having to build that communication layer.

In our case we frequently need to do things with files. So a message is place in the queue pointing to that file. Another system then subscribes to the queue and when a file shows up, it takes the appropriate action. This could also be a log event or anything else that would warrant an action to be taken somewhere else.

While I’m generally a big fan of RESTful architectures, I’m willing to compromise when it comes to queuing. With a proper RabbitMQ client we get nice things such as the assignment of an item in the queue to a specific client and if the client fails, RabbitMQ will make that item available to another client. This avoids having to code this logic into the clients. Even in cases where a log is parsed and triggers events to the queuing system there is less work to deal with a failure since mostly there is no re-playing of events that have to happen.

RabbitMQ is great if you have a workflow that is distributed.

General thoughts

The great thing about these tools is that they abstract common things we need. That avoids having to build them into different parts of the stack over and over (we have built many versions of queuing, for example). These tools are also intended to scale horizontally, which allows for growth as utilization increases. With many homegrown tools there will always be a limit of the biggest box you can buy. On the flip side it’s also possible to run in a VM or a container to minimize the foot print and isolate the service.

From an operations perspective I also like the fact that all three are easy to set up and maintain. I don’t mean to say that running any service is a trivial task, but having performed installs of Oracle I certainly appreciate the much more streamlined management of these tools. The defaults that come out of the box with Elasticsearch, Redis, and RabbitMQ are solid, but there are many adjustments that can be made to meet the specific use case.

That brings me back to “which one should I use?”

Really, it depends on the use case. It’s likely possible to bend each system for most use cases. In the end I hope that some of these musings will help make the choices that make the most sense.

Cheers,

\@matthias

replacing 500 error in nginx auth_request

One of the great things about nginx is the auth_request module. It allows you to make a call to another URL to authenticate or authorize a user. For my current work that is perfect since virtuall everything follows a RESTful model.

Unfortunately, there is one problem. If the auth_request fails, the server responds with an HTTP status of 500. That normally is a bad thing since it indicates a much more severe problem than a failed authentication or authorization.

The logs indicate that

auth request unexpected status: 400 while sending to client

and then proceeds to return a 500 to the client.

Nginx offers some ways to trap certain proxy errors for fastcgi_intercept_errors and uwsgi_intercept_errors as described in this post. The suggested proxy_intercept_errors off;, doesn’t seem to do the trick either.

I managed to come up with a way that returns a 401 by using the following in the location block that performs the auth_request:

auth_request /auth;
error_page 500 =401 /error/401;

This captures the 500 returned and changes it to a 401. Then I added another location block for 401:

location /error/401 {
   return 401;
}

Now I get a 401 instead of the 500.

Much better.

On a side note it seems that someone else is also thinking about this.

\@matthias

3 things I spent too much time on in cloudformation

Cloudformation is very powerful. It’s very cool to be able to spin up an entire environment in one step. The servers get spun up with the right bits installed, networking is configured with security restrictions in place and load balancing all work.

With that power comes some pain. Anyone who’s worked with large cloudformation templates will know that I’m referring to. In my case it’s well over a thousand lines of JSON goodness. That can make things more difficult to troubleshoot.

Here are some lessons I’ve learned and, for my taste, spent too much time on.

Access to S3 bucket

When working with Cloudformation and S3 you get two choices to control access to S3. The first is the AWS::S3::BucketPolicy and the other is an AWS::IAM::Policy. Either will serve you well depending on the specific use case. A good explanation can be found in IAM policies and Bucket Policies and ACLs! Oh My! (Controlling Access to S3 Resources).

Where you’ll run into issues is when you’re using both. It took me better part of the day trying to get an AWS::IAM::Policy to work. Everything sure looked great. Then I finally realized that there was also an AWS::S3::BucketPolicy in place.

In that case (as the Oh My link points out), the one with least privilege wins!

Once I removed the extra AWS::S3::BucketPolicy everything worked perfectly.

Naming the load balancer

In Cloudformation you can configure load balancers in two ways. The first kind will be accessible via the Internet at large, while the second will be internal to a VPC. This is configured by setting the "Scheme" : "internal" for the AWS::ElasticLoadBalancing::LoadBalancer.

Now you can also add a AWS::Route53::RecordSetGroup to give that load balancer a more attractive name than the automatically generated AWS internal DNS name.

For the non-internal load balancer this can be done by pointing the AliasTarget to the CanonicalHostedZoneName  and things will work like this:

"AliasTarget": {
  "HostedZoneId": {
     "Fn::GetAtt": ["PublicLoadBalancer", "CanonicalHostedZoneNameID"]
  },
  "DNSName": {
    "Fn::GetAtt": ["PublicLoadBalancer", "CanonicalHostedZoneName"]
  }
}

However, this does not work for the internal type of load balancer.

In that case you need to use the DNSName:

"AliasTarget": {
    "HostedZoneId": {
    "Fn::GetAtt": ["InternalLoadBalancer", "CanonicalHostedZoneNameID"]
  },
  "DNSName": {
    "Fn::GetAtt": ["InternalLoadBalancer", "DNSName"]
  }
}

(Template) size matters

As I mentioned earlier templates can get big and unwieldy. We have some ansible playbooks we started using to deploy stacks and updates to stacks. Then we started getting errors about the template being to large.  Turns out I’m not the only one having an issue with the max size of a uploaded template being 51200 bytes.

Cloudformation can deal with much larger templates, but they have to come from S3. To make this work the awscli is very helpful.

Now for the large templates I use the following commands instead of the ansible playbook:

# first copy the template to S3
aws s3 cp template.json s3://<bucket>/templates/template.json
# validate the template
aws cloudformation validate-template --template-url \
    "https://s3.amazonaws.com/<bucket>/templates/template.json"
# then apply it if there was no error in validation
aws cloudformation update-stack --stack-name "thestack" --template-url \
    "https://s3.amazonaws.com/<bucket>/templates/template.json" \
    --parameters <parameters> --capabilities CAPABILITY_IAM 

Don’t forget the --capabilities CAPABILITY_IAM or the update will fail.

Overall I’m still quite fond of AWS. It’s empowering for development. None the less the Cloudformation templates do leave me feeling brutalized at times.

Hope this saves someone some time.

Cheers,

\@matthias

updating the AMIs to a new version

We’ve been enjoying the use of AWS CloudFormation. While the templates can be a bit of a bear, the end result is always consistent. (That said, I think that Terraform has some real promise).

One thing we do is to lock our templates to specific AMIs, like this:

    "AWSRegion2UbuntuAMI" : {
      "us-east-1" :      { "id" : "ami-7fe7fe16" },
      "us-west-1" :      { "id" : "ami-584d751d" },
      "us-west-2" :      { "id" : "ami-ecc9a3dc" },
      "eu-west-1" :      { "id" : "ami-aa56a1dd" },
      "sa-east-1"      : { "id" : "ami-d55bfbc8" },
      "ap-southeast-1" : { "id" : "ami-bc7325ee" },
      "ap-southeast-2" : { "id" : "ami-e577e9df" },
      "ap-northeast-1" : { "id" : "ami-f72e45f6" }
    }

That’s great, because we always get the exact same build based on that image and we don’t introduce unexpected changes. For those of you who know their AMI IDs very well, you will realize that this is actually for an older version of Ubuntu.

Sometimes, however, it makes sense to bring the AMIs up to a new version and that means having to find all of the new AMI IDs.

Here is a potential approach using the . I’m going to assume you either have it installed already or run on one of the platforms there the installation instructions work. (Side note: if you are on an Ubuntu box I recommend installed the version via pip since it works as advertised, while the version in the Ubuntu repo has some odd issues).

Using the awscli it’s possible to list the images. Since I’m interested in Ubuntu images I search for Canonical’s ID or 099720109477 and also apply some filters to show me only the 64 bit machines with an ebs root device:

aws ec2 describe-images  --owners 099720109477 \
  --filters Name=architecture,Values=x86_64 \
            Name=root-device-type,Values=ebs

That produces a very long dump of JSON (which I truncated):

{
    "Images": [
        {
            "VirtualizationType": "paravirtual", 
            "Name": "ubuntu/images-testing/ebs-ssd/ubuntu-trusty-daily-amd64-server-20141007", 
            "Hypervisor": "xen", 
            "ImageId": "ami-001fad68", 
            "RootDeviceType": "ebs", 
            "State": "available", 
            "BlockDeviceMappings": [
                {
                    "DeviceName": "/dev/sda1", 
                    "Ebs": {
                        "DeleteOnTermination": true, 
                        "SnapshotId": "snap-bde4611a", 
                        "VolumeSize": 8, 
                        "VolumeType": "gp2", 
                        "Encrypted": false
                    }
                }, 
                {
                    "DeviceName": "/dev/sdb", 
                    "VirtualName": "ephemeral0"
                }
            ], 
            "Architecture": "x86_64", 
            "ImageLocation": "099720109477/ubuntu/images-testing/ebs-ssd/ubuntu-trusty-daily-amd64-server-20141007", 
            "KernelId": "aki-919dcaf8", 
            "OwnerId": "099720109477", 
            "RootDeviceName": "/dev/sda1", 
            "Public": true, 
            "ImageType": "machine"
        }, 
......
        {
            "VirtualizationType": "hvm", 
            "Name": "ubuntu/images/hvm/ubuntu-quantal-12.10-amd64-server-20140302", 
            "Hypervisor": "xen", 
            "ImageId": "ami-ff4e4396", 
            "State": "available", 
            "BlockDeviceMappings": [
                {
                    "DeviceName": "/dev/sda1", 
                    "Ebs": {
                        "DeleteOnTermination": true, 
                        "SnapshotId": "snap-8dbadf4a", 
                        "VolumeSize": 8, 
                        "VolumeType": "standard", 
                        "Encrypted": false
                    }
                }, 
                {
                    "DeviceName": "/dev/sdb", 
                    "VirtualName": "ephemeral0"
                }, 
                {
                    "DeviceName": "/dev/sdc", 
                    "VirtualName": "ephemeral1"
                }
            ], 
            "Architecture": "x86_64", 
            "ImageLocation": "099720109477/ubuntu/images/hvm/ubuntu-quantal-12.10-amd64-server-20140302", 
            "RootDeviceType": "ebs", 
            "OwnerId": "099720109477", 
            "RootDeviceName": "/dev/sda1", 
            "Public": true, 
            "ImageType": "machine"
        }
    ]
}

That output is pretty thorough and good for digging through things, but for my purposes it’s too much and lists lots of things I don’t need.

To drill in on the salient input a little more I use the excellent jq command-line JSON processor and pull out the things I want and also grep for the specific release:

aws ec2 describe-images  --owners 099720109477 \
  --filters Name=architecture,Values=x86_64 \
            Name=root-device-type,Values=ebs \
| jq -r '.Images[] | .Name + " " + .ImageId' \
| grep 'trusty-14.04'

The result is something I can understand a little better:

ubuntu/images/ebs-io1/ubuntu-trusty-14.04-amd64-server-20140829 ami-00389d68
ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140926 ami-0070c468
ubuntu/images/ebs/ubuntu-trusty-14.04-amd64-server-20140416.1 ami-018c9568
...
ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140923 ami-80fb51e8
ubuntu/images/ebs-io1/ubuntu-trusty-14.04-amd64-server-20140927 ami-84aa1cec
ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140607.1 ami-864d84ee
ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140724 ami-8827efe0
ubuntu/images/hvm/ubuntu-trusty-14.04-amd64-server-20140923 ami-8afb51e2
ubuntu/images/ebs/ubuntu-trusty-14.04-amd64-server-20140927 ami-8caa1ce4
ubuntu/images/hvm-io1/ubuntu-trusty-14.04-amd64-server-20140923 ami-8efb51e6
ubuntu/images/ebs-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-98aa1cf0
ubuntu/images/hvm/ubuntu-trusty-14.04-amd64-server-20140927 ami-9aaa1cf2
ubuntu/images/hvm-io1/ubuntu-trusty-14.04-amd64-server-20140927 ami-9caa1cf4
ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-9eaa1cf6
ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140816 ami-a0ff23c8
ubuntu/images/hvm-io1/ubuntu-trusty-14.04-amd64-server-20140607.1 ami-a28346ca
ubuntu/images/ebs/ubuntu-trusty-14.04-amd64-server-20140724 ami-a427efcc
...
ubuntu/images/ebs/ubuntu-trusty-14.04-amd64-server-20140813 ami-fc4d9f94
ubuntu/images/hvm-io1/ubuntu-trusty-14.04-amd64-server-20140924 ami-fe338696

After a little more investigation I see that the latest version can be identified based on the datastamp, in this case 20140927. I’ve seen some other ways things are named, but in this case the datastamp works well enough and I can look for ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 in each region for the AMI IDs.

for x in us-east-1 us-west-2 us-west-1 eu-west-1 ap-southeast-1 ap-southeast-2 ap-northeast-1 sa-east-1; do
    echo -n "$x "
    aws --region ${x} ec2 describe-images  --owners 099720109477 --filters Name=architecture,Values=x86_64 \
      Name=root-device-type,Values=ebs \
      Name=name,Values='ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927' \
    | jq -r '.Images[] | .Name + " " + .ImageId'
    done

The result is a nice tidy list with the AMI ID for each region:

us-east-1 ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-9eaa1cf6
us-west-2 ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-3d50120d
us-west-1 ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-076e6542
eu-west-1 ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-f0b11187
ap-southeast-1 ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-d6e7c084
ap-southeast-2 ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-1711732d
ap-northeast-1 ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-e74b60e6
sa-east-1 ubuntu/images/hvm-ssd/ubuntu-trusty-14.04-amd64-server-20140927 ami-69d26774

Now, to make this pastable into the the CloudFormation template I run that output through some more shell processing:

cut -f1,3 -d' ' | sed 's/^\(.*\) \(.*\)$/"\1": { "id": "\2" },/'

and end up with

"us-east-1": { "id": "ami-9eaa1cf6" },
"us-west-2": { "id": "ami-3d50120d" },
"us-west-1": { "id": "ami-076e6542" },
"eu-west-1": { "id": "ami-f0b11187" },
"ap-southeast-1": { "id": "ami-d6e7c084" },
"ap-southeast-2": { "id": "ami-1711732d" },
"ap-northeast-1": { "id": "ami-e74b60e6" },
"sa-east-1": { "id": "ami-69d26774" },

I can now paste that into the template and remove the final comma.

Voilà, the new stack will now run with the latest AMIs and can be subjected to testing.

\@matthias

docker data only containers for vendor code

Docker has this idea of “data only containers“. The concept is generally pretty simple. You have a container which contains the data and exposes it via the --volume CLI or VOLUME directive in the Dockerfile. Another container can then access that volume via the --volumes-from command line switch.

One of the interesting things is that the data only container does not need to be running. It can be started and just exit. As long as the container was created, then the volume will be available. Although if you clean up your “exited” containers, you will likely also delete the data only container since it has exited.

In most of the descriptions online there are lots of examples of exposing data sets this way. For example a /var/lib/mysql directory. That can be used to get a consistent data set to run tests against or making setup easier. There are also other examples of where this makes life easier.

As we started playing with docker, one thing was noticable. Images can get big and so they can take a bit of time to ship them around. With some consideration around the structure of the build process, this can be mitigated due to Docker’s use of caching in the process.

None the less, this brought us to another potential use of the data only container: vendor code.

For example, we have a tool we use, which is about 130MB in size. So far we’ve been baking it into the application container. However since this code rarely changes it’s a great candidate to be split into it’s own container.

We’ve been experimenting with the idea and created an image just for that code and then linking to the volume from the application.

So far it’s working rather well. Here is an example.

Say we have our code in the a vendor/ directory and need to access it from the application under /opt/vendor.

The docker build starts with the busybox base image and simply copies the vendor code to the image under /opt/vendor.

Starting the container is easy:

# you don't need the --volume if your Dockerfile expose the volume
docker run -d --volume /opt/vendor --name vendor_name vendor-image

In our case the CMD or run command is simple:

echo &amp;quot;Data only container for access to vendor code&amp;quot;

That means that the container starts, prints that line and exits. That’s good enough to access the data.

Now the application container is started with

docker run -d --volumes-from vendor_name application-image

The --volumes-from will grab the volumes from the “vendor_name” container and bring them into the application container at the same mount point.

We now have the entire vendor code base available in the application container without having to bake it into that image.

One thing of note is that there are two choices in starting the data only container. If you start with an empty container based on the scratch image, the run command will likely fail unless your vendor code let’s run something easily. In our case the code base requires a lot of supporting stuff and the container simply can’t even do an echo successfully. That’s the first choice: put a CMD in your Dockerfile and know that it will always fail on start, but start the container none the less. (Hint: if you just leave the CMD off, it will not start).

The other choice is to start with a slightly bigger base image such as busybox. That’s what we have done and the 2.5MB extra seems worth it to avoid the failure.

\@matthias

query consolidation in elasticsearch

In my last post on a simple way to improve elasticsearch queries I promised a follow up for another way to optimize queries.

This approach didn’t come with the same level of improvement of the order of magnitude from the previous post, but it still offers some benefits.

Once again, I was working on improving my rough first shot of working code. In this case the app I was working on was displaying the search results I mentioned last time, but it also was pulling various facets for display as well.

By the time everything was rendered I had issued somewhere between 12 and 15 calls or queries. Some of these were necessary during authentication or to handle capturing data necessary for the actual query. However there was a clear opportunity for improvement.

My focus was on a couple of sets of queries in particular. The first was a call to capture statistics for a field which would then be used to set up the call for actual facet calls. (Side note: Yep, the facets are going away and are being replaced by aggregations. I’ll likely share some notes on this when I’m done with making that change).

{
    "facets": {
       "date": {
          "statistical": {
             "field": "date"
          }
       }
    }
}

My code has a few of those calls for various numeric fields such as date, size etc.

The other set of queries to focus on was the retrieval for the actual facets.

{
    "facets": {
       "tags": {
          "terms": {
             "field": "tags",
             "size": 10
          }
       }
    }
}

Now the first set of stats related facets are actually used to dynamically create the buckets for some of the actual facet calls. That still lets me combine the first group into one call and the second group into another.

So, I basically end up with two calls to elasticsearch. The first to grab the statistics facets and the second for the facets that are actually used in the application for display.

None, the less rather than issuing a call for each one independently, we can combine them. Like this:

{
    "facets": {
       "date": {
          "statistical": {
             "field": "date"
          }
       },
       "size": {
          "statistical": {
             "field": "size"
          }
       }
    }
}

and then one more call which also includes the actual query:

{
   "query": {
      "query_string": {
         "default_field": "body",
         "query": "test"
      }
   },
   "fields": [
      "title"
   ],
   "facets": {
      "tags": {
         "terms": {
            "field": "tags",
            "size": 10
         }
      },
      "folder": {
         "terms": {
            "field": "folder",
            "size": 10
         }
      }
   }
}

You’ll notice that I’m also just returning the field I need for display as described in the last post.

While this approach doesn’t really reduce the amount of work Elasticsearch has to perform, it reduces the number of individual calls that need to be made. That means that most of the improvement is in the number of calls as well as network roundtrips that need to take place. The later will likley have a bigger impact if the calls made are in sequentially rather than asynchronously. Regardless it does offer some improvement from my experience so far.

\@matthias

 

Deploying Python Applications and their Virtual Environments

Introduction

As noted in a past article, we leverage virtualenv and pip to isolate and manage some of our python applications. A natural next question is “How can a python virtual environment and related application be deployed to a production server?”. This article provides a conceptual overview of one way such deployments can be handled.

The Server Environment and Conventions

First, let’s discuss some assumptions about the server environment. In this article, a deployment server, development server(s), and production server(s) are all discussed. It can be assumed that all these servers are running the same operating system (in this case, RHEL 6). This provides a luxury which allows for transporting virtual environments from one host to another with no ill effects and no requirement to build new virtual environments for each host.

Additionally, there are some directory conventions used which help assure consistency from host to host. The virtual environment is located in a standard path such as /opt/companyname/. The code for each python application is then located in a directory inside the virtual environment root. This makes for a set of paths like so:

Example directories:

/opt/company/myapp/   # the virtual env root

/opt/company/myapp/myapp/              # the application root
/opt/company/myapp/myapp/lib/          # the application library
/opt/company/myapp/myapp/bin/appd.py   # the application

The Build

The building of the python application is a two step process. First the virtual environment is created or updated. Next, the desired version of the application is exported from the repository. This work all takes place on the deployment server.

Steps to build the virtual env and application:

# Go to the standard app location
cd /opt/company/

# Create the virtual env if needed
virtualenv ./myapp

# Export the desired copy of the app inside the virtual env root
svn export $repouri /opt/company/myapp/myapp/

# Activate the virtualenv
cd /opt/company/myapp/ && source ./bin/activate

# Install the requirements
cd /opt/company/myapp/myapp/
pip install -r ./requirements.txt

Here’s an example script which would handle such a build:

* build-myapp.sh

The Deploy

Once the virtualenv and application are built, the deployment can be handled with some rsync and scripting work. This same model can be used to deploy to development servers or production servers, maintaining consistency across any environment. It can also be used to deploy your application to a new server. While a bit of a simplification, the deployment can be envisioned as a simple for-loop around rsync.

Example deployment loop:

for host in $devservers; do
    rsync -avz --delete-after /opt/company/myapp $host:/opt/company/myapp
done

Here’s an example script which would handle such a build:

* deploy-myapp.sh

Closing

This describes one of many ways python applications and their virtual environments can be deployed to remote hosts. It is a fairly simple matter to assemble these techniques into shell scripts for semi-automated build and deployment. Such scripts can then be enhanced preferred conventions as well as the more intelligent handling of application restarts, rollbacks, configuration management, and other desired improvements particular to the application.

simple way to improve elasticsearch queries

We use ElasticSearch for some things. I personally have been enjoying working with it as part of a new tool we are building. I’ve learned a couple of things from a querying perspective.

First, I could say a lot about how impressed I am with ElasticSearch from an operations perspective. Out of the box it runs extremely well, but I’ll save that for another post. Here I’ll talk about some rather simple ideas to improve the querying of ElasticSearch.

When developing I often start very basic. It could even be described as simplistic. The first shot is generally not very efficient, but it helps to quickly determine if an idea is workable. This is what I’ve recently done with some code querying ElasticSearch.

The first simple performance improvement was around generating a display of the search results. To get things going quick, I issued the query and grabbed the results. By default ElasticSearch returns the entire document in the _source field. The simple query might look like this:

{
  "query": {
    "match_all": {}
  }
}

The returned results then include the _source field and might look like this

{
  "_index": "test",
  "_type": "doc",
  "_id": "20140806",
  "_score": 1,
  "_source": {
    "title": "some title",
    "body": "the quick brown fox jumps over the lazy dog"
  }
}

My code would then go through the array and grab the title field from the _source for display in the result list. That worked ok, but seemed slow. (Full disclosure: my documents were quite a bit biggger then the simple example above)

Now since I didn’t really need the entire document just to display the title, the obvious choice is to just get the necessary data. Elasticsearch makes this easy via the :

{
  "query": {
    "match_all": {}
  },
  "fields": [
    "title"
  ]
}

That will return something like the following in the hits array:

{
  "_index": "test",
  "_type": "doc",
  "_id": "20140806",
  "_score": 1,
  "fields": {
    "title": [ "some title" ]
  }
}

That lets me skip the retrieval of potentially large chunks of data. The results were quite impressive in my use case. The run time of the queries and display of results dropped by an order of magnitude. Again, this is likely due to the much larger documents I was actually working with. None the less it is a good example of only retrieving the necessary data rather than issuing what amounts to a SELECT * in SQL terms.

The other performance improvement was around consolidating queries, but I’ll save that for a future post.

\@matthias