query consolidation in elasticsearch

In my last post on a simple way to improve elasticsearch queries I promised a follow up for another way to optimize queries.

This approach didn’t come with the same level of improvement of the order of magnitude from the previous post, but it still offers some benefits.

Once again, I was working on improving my rough first shot of working code. In this case the app I was working on was displaying the search results I mentioned last time, but it also was pulling various facets for display as well.

By the time everything was rendered I had issued somewhere between 12 and 15 calls or queries. Some of these were necessary during authentication or to handle capturing data necessary for the actual query. However there was a clear opportunity for improvement.

My focus was on a couple of sets of queries in particular. The first was a call to capture statistics for a field which would then be used to set up the call for actual facet calls. (Side note: Yep, the facets are going away and are being replaced by aggregations. I’ll likely share some notes on this when I’m done with making that change).

{
    "facets": {
       "date": {
          "statistical": {
             "field": "date"
          }
       }
    }
}

My code has a few of those calls for various numeric fields such as date, size etc.

The other set of queries to focus on was the retrieval for the actual facets.

{
    "facets": {
       "tags": {
          "terms": {
             "field": "tags",
             "size": 10
          }
       }
    }
}

Now the first set of stats related facets are actually used to dynamically create the buckets for some of the actual facet calls. That still lets me combine the first group into one call and the second group into another.

So, I basically end up with two calls to elasticsearch. The first to grab the statistics facets and the second for the facets that are actually used in the application for display.

None, the less rather than issuing a call for each one independently, we can combine them. Like this:

{
    "facets": {
       "date": {
          "statistical": {
             "field": "date"
          }
       },
       "size": {
          "statistical": {
             "field": "size"
          }
       }
    }
}

and then one more call which also includes the actual query:

{
   "query": {
      "query_string": {
         "default_field": "body",
         "query": "test"
      }
   },
   "fields": [
      "title"
   ],
   "facets": {
      "tags": {
         "terms": {
            "field": "tags",
            "size": 10
         }
      },
      "folder": {
         "terms": {
            "field": "folder",
            "size": 10
         }
      }
   }
}

You’ll notice that I’m also just returning the field I need for display as described in the last post.

While this approach doesn’t really reduce the amount of work Elasticsearch has to perform, it reduces the number of individual calls that need to be made. That means that most of the improvement is in the number of calls as well as network roundtrips that need to take place. The later will likley have a bigger impact if the calls made are in sequentially rather than asynchronously. Regardless it does offer some improvement from my experience so far.

\@matthias

 

Advertisements

simple way to improve elasticsearch queries

We use ElasticSearch for some things. I personally have been enjoying working with it as part of a new tool we are building. I’ve learned a couple of things from a querying perspective.

First, I could say a lot about how impressed I am with ElasticSearch from an operations perspective. Out of the box it runs extremely well, but I’ll save that for another post. Here I’ll talk about some rather simple ideas to improve the querying of ElasticSearch.

When developing I often start very basic. It could even be described as simplistic. The first shot is generally not very efficient, but it helps to quickly determine if an idea is workable. This is what I’ve recently done with some code querying ElasticSearch.

The first simple performance improvement was around generating a display of the search results. To get things going quick, I issued the query and grabbed the results. By default ElasticSearch returns the entire document in the _source field. The simple query might look like this:

{
  "query": {
    "match_all": {}
  }
}

The returned results then include the _source field and might look like this

{
  "_index": "test",
  "_type": "doc",
  "_id": "20140806",
  "_score": 1,
  "_source": {
    "title": "some title",
    "body": "the quick brown fox jumps over the lazy dog"
  }
}

My code would then go through the array and grab the title field from the _source for display in the result list. That worked ok, but seemed slow. (Full disclosure: my documents were quite a bit biggger then the simple example above)

Now since I didn’t really need the entire document just to display the title, the obvious choice is to just get the necessary data. Elasticsearch makes this easy via the :

{
  "query": {
    "match_all": {}
  },
  "fields": [
    "title"
  ]
}

That will return something like the following in the hits array:

{
  "_index": "test",
  "_type": "doc",
  "_id": "20140806",
  "_score": 1,
  "fields": {
    "title": [ "some title" ]
  }
}

That lets me skip the retrieval of potentially large chunks of data. The results were quite impressive in my use case. The run time of the queries and display of results dropped by an order of magnitude. Again, this is likely due to the much larger documents I was actually working with. None the less it is a good example of only retrieving the necessary data rather than issuing what amounts to a SELECT * in SQL terms.

The other performance improvement was around consolidating queries, but I’ll save that for a future post.

\@matthias