So Many Operations

Often, I will reduce a complex problem into a set of abstract computer “ops”. These ops aren’t meant to be an exact description of how the computer or the network would carry out the task, but a logical abstraction. Let me walk through a short example.

How does a file get to repono (storage)?

Ignoring all of the steps that got us to the point of putting the file in Repono, here’s how I would think of the operational cost of Repono (as of June 2016)

  • pfiles calls stork, stork pushes a job into ins-queue for every file, operational cost is 1*n
  • ins-queue records the job in a queue table, operational cost is 1*n
  • r2 asks for work from ins-queue and marks it WIP, operational cost is 1*n
  • r2 commits the file to repono, operational cost is 1*n
  • repono writes to disk, cost is 2*n for replica
  • r2 logs the outcome to logserver, operational cost is 1*n
  • r2 asks to remove the record from ins-queue, operational cost is 1*n

As such, if we have 1,000,000 documents, in this overview, we have 9,000,000 operations.

But there is much more in the details…

What’s going on in ins-queue?

ins-queue was developed in house. That could make it well tuned for our needs or another tool where we have to keep up with all aspects of it operationally. Your perspective will likely determine how you see this sort of software. Regardless, here’s some of the current stats of ins-queue (June 2016)

 +----+------------+-----------+
 | id |       tool |       cnt |
 +----+------------+-----------+
 | 44 |    roz_kvp |     62649 |
 | 43 |    roz_mdb |     62839 |
 | 1  |        roz |     62931 |
 | 57 | roz_merged |     79696 |
 | 5  |     grimes |    364235 |
 | 6  |     pdfopt |   3825390 |
 | 3  |   smithers |  10607081 |
 | 4  |    s3queue | 667402231 |
 +----+------------+-----------+

Yes, that is 667M documents that have flowed through ins-queue (s3queue) and (mostly) on to Repono. That also means we have done nearly 11M data overlays (smithers) and have optimized 3.8M PDFs (pdfopt). But there’s a little more…how do we know how many records? Every time we take a record out of s3queue, we ask to increment this counter. That’s one additional operation. If we were at 9n before, we’re now at 10n. But the chained events that update the counters also update another logging effort to keep track of activity in deletion mode. Our new count is 11n. So for the 667M documents listed here, we’ve taken 7.3B (yes, Billion) operations to store those documents in Repono (this doesn’t count optimizing, scanning, inventory, and even a full drill down on all of the logging). If we went to a granular level, I suspect that the act of storing a document with logging is probably 20n.

What’s going on in logserver?

We have a REST based generic logging tool called logserver or Ticket Tool.   It is very simple.  It was discussed previously and the source is here.

As adoption has increased, so has our desire to create more detailed logs.  This platform is one of the busiest in our operations (although that is a crowded field of busy platforms).   It can fill quickly and the long term value of the information decays pretty quickly.  So, to save storage, we have cron’d some administration to take the older files and zip them up in place.   A custom error handler will find the compressed file if you have a link to the original file.  That’s going pretty well until we look at our replication hosts.

Yesterday, it went into alarm.  So much data had accumulated.

banshee

Our monitoring has thresholds on operational limits.

The comfortable and generic syncing of all data was never updated to reflect the reaction to the broader adoption of the logging service.  As such, we would have the original file and the zipped file.  After removing the original files, we recovered over 60% of this resource.

All of this is happening behind the scenes and adds to the operational costs, choices, and scale of working in a distributed platform.

Quote

Creating custom rules for ESLint

Creating custom rules for ESLint is one of the more attractive way of building continuity tests. This allows you to set up organization or project specific rules that are unique to your code.

Continuity Tests

The idea of testing is generally split between unit tests and integration tests, where unit tests test specific functions or module and integration tests are higher level abstract tests, often done via QA testers in a mostly manual process.

The third and often overlooked set of testing is continuity testing, also known as static code analysis.

ESLint

ESLint is a powerful linting utility for Javascript and is highly configurable and extendable. Through the use of plugins you can write your own custom set of rules to use in your project.

Creating a custom rule set involves creating a new eslint-plugin-{name} project. There is a Yeoman generator that gets you a quick scaffold to get started.

Were going to setup a new project from scratch.

$ mkdir eslint-plugin-sounds && cd eslint-plugin-sounds
$ npm init

Create your project and accept all the defaults for now.

$ npm install eslint --save-dev

We install eslint as a dev dependency since the plugin itself is is called by eslint so we do not need to package eslint with our plugin.

Create the following directories

$ mkdir lib
$ mkdir lib/rules
$ mkdir tests
$ mkdir tests/lib
$ mkdir tests/lib/rules

I’m not going to cover the rest of a normal github/npm package with readme.md. If you are not familiar with creating projects and publishing to npm I suggest you read GitHub Getting Started and NPM Publishing Packages.

Let’s create our first rule

$ cd lib/rules
$ touch first-rule.js

"use strict";
module.exports = function(context) {
    return {
        // Rule methods - AST Node Type
    }
}
module.schema = [];

The module returns a simple object with methods as properties. Each property is an AST node, ie Identity, CallExpression, MemberExpression.

First we need to know what part of the static tree we are going to lint.

Head over to ASTExplorer.net and put in your code snippit and see the tree.

ice_screenshot_20151208-143351

 

 

 

 

We want to write a rule that checks this method to see if we are sending arguments and the first argument is not null.

We see from the AST we are going to use the CallExpression node.

In that tree we have a MemberExpression with the object (sounds) and property (get) inside the callee property. sounds.call() is the method that would be called.

ice_screenshot_20151208-143656    ice_screenshot_20151208-143713

 

return {
    "CallExpression": function(node) {
        // node = {CallExpression}
        var callee = node.callee;
        // callee.object.name === 'sounds'
        // callee.property.name === 'get'
    }
}

So we now know that we want to test this method so we need to check against the object and property name.

if(callee.object.name === "sounds" &&
    callee.property.name === "get") {
  // Now we only match the methods that we are looking for
}

Lastly we want to check to see if that method is being called without arguments or a null first argument.

ice_screenshot_20151208-143824    ice_screenshot_20151208-143805

 

 

if (!node.arguments[0] || node.arguments[0].value === null) {
    //Report the error
}

Let put it all together

"use strict";

module.exports = function(context) {
    return {
        "CallExpression": function(node) {
            var callee = node.callee;
            if(callee.object.name === "sounds" &&
                callee.property.name === "get" &&
                node.arguments[0] &&
                node.arguments[0].value ==== null) {
                    context.report(node, "Method sounds.get() called without argument or first argument is null");
            }
        }
    };
};

module.schema = [];

There’s your first rule. It matches the method signature sounds.get() and checks to see if the first argument is defined and not null.

To implement your new rule you need to include your plugin in your node_modules and in your eslint config you include your plugin and rule.

{
  "plugins": [
    "sounds"
  ],
  "rules": {
    "sounds/first-rule": 1
  }
}

Notice that in the plugins section we only put our project name (sounds) not the full module name (eslint-plugin-sounds) as eslint expects eslint-plugin-{name} is the plugin path.

From here you should create a test to try out possible valid and invalid code snippets to make sure that everything is still valid. Check out this guide on how to write a test for your new rule.

There are dozens upon dozens of plugins and examples out there with more complex examples; see eslint-plugin-react for some really good ones.

Today we learned how to create your first ESLint plugin and custom rule, and implement it in your lint config file.

The Not So Small Stuff

Part of our application retrieves files from an internal storage system that behaves similarly to AWS S3. The system has four heads and is fronted by a network load balancer. Clients of the storage system, named Repono, retrieve authentication tickets from our vending system.  Repono then does an authorization look up and if everything is OK, the resource (aka “the file”) is handed out over HTTPS. Normally, our performance of this system is really fantastic and looks like this chart:

Response Time

Response time as recorded by our internal monitoring system

About a month ago, the performance of this system really changed. We went from routine performance under 1 second to sporadic responses clocking in over 10 seconds.

Uh Oh!

A different monitoring system using ELK measuring the response time of our middleware to Repono

This problem did not reveal itself easily. The major client of Repono was recently upgraded to a newer version of Microsoft .NET. One step we took was to revert the change and downgrade the version of .NET. For around 20 to 30 minutes, the performance was looking good, and then it reverted to the lumpy 10 second performance.

What else could have changed? This was not easy to see. The HTTPS traffic did not allow easy inspection and alternative testing with our go to baseline client, cURL could not reproduce the results. The problem was clearly affecting production and we couldn’t easily change out our middleware tools in .NET for a set of shell based testing scripts using cURL.

We needed to see inside the HTTP and TCP conversation of the .NET HTTP client, but the normal amount of debugging and logging did not record the details.

Thanks to the big’ol Internet, we stumbled on this helpful article from Mike Hadlow to get to the level of information we wanted out of .NET and the HTTP client.The only problem is that this logging doesn’t include time stamps and there was too much traffic to keep up and detangle out of the audit logs. Adjusting the load balancer, we could force most of the traffic through patched software with new logging. This greatly reduced the volume of data and we tried to watch in real time. This is what we saw

 1 System.Net Verbose: 0 : [5156] ConnectStream#43140910::Close()
 2 System.Net Verbose: 0 : [5156] Exiting ConnectStream#43140910::Close()
 3 System.Net Verbose: 0 : [5156] Exiting MyWebClient#1109663::DownloadData() -> Byte[]#50053605
 4 System.Net Verbose: 0 : [5156] MyWebClientWithHead#30034512::DownloadData(https://REDACTED/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED#1553812164)
 5 System.Net Verbose: 0 : [5156] MyWebClientWithHead#30034512::DownloadData(https://REDACTED/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED#1553812164)
 6 System.Net Verbose: 0 : [5156] WebRequest::Create(https://REDACTED/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED)
 7 System.Net Verbose: 0 : [5156] HttpWebRequest#45035759::HttpWebRequest(https://REDACTED/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED#1553812164)
 8 System.Net Verbose: 0 : [5156] Exiting HttpWebRequest#45035759::HttpWebRequest()
 9 System.Net Verbose: 0 : [5156] Exiting WebRequest::Create() -> HttpWebRequest#45035759
 10 System.Net Verbose: 0 : [5156] HttpWebRequest#45035759::GetResponse()
 11 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ServicePoint#17057466
 12 System.Net Information: 0 : [5156] Associating Connection#31630514 with HttpWebRequest#45035759
 13 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ConnectStream#66385004
 14 System.Net Information: 0 : [5156] HttpWebRequest#45035759 - Request: HEAD /v1/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED HTTP/1.1
 
 15 System.Net Information: 0 : [5156] ConnectStream#66385004 - Sending headers
 16 {
 17 x-auth-user: REDACTED
 18 x-auth-expiry: 2015-08-16T01:22:43Z
 19 x-auth-ticket: REDACTED
 20 Accept: text/xml
 21 Content-Type: text/xml
 22 User-Agent: REDACTED/version-2.4501/GetReponoAuthURL/gaggleid:REDACTED
 23 Host: REDACTED
 24 }.
 25 System.Net Error: 0 : [5156] Exception in the HttpWebRequest#45035759:: - The underlying connection was closed: A connection that was expected to be kept alive was closed by the server.
 26 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ServicePoint#17057466
 27 System.Net Information: 0 : [5156] Associating Connection#14267127 with HttpWebRequest#45035759
 28 System.Net Information: 0 : [5156] Connection#14267127 - Created connection from REDACTED:56478 to REDACTED:443.
 29 System.Net Information: 0 : [5156] TlsStream#34867337::.ctor(host=REDACTED, #certs=0)
 30 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ConnectStream#62287651
 31 System.Net Information: 0 : [5156] HttpWebRequest#45035759 - Request: HEAD /v1/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED HTTP/1.1
 32 System.Net Information: 0 : [5156] SecureChannel#50704654::.ctor(hostname=REDACTED, #clientCertificates=0, encryptionPolicy=RequireEncryption)
 33 System.Net Information: 0 : [5156] SecureChannel#50704654 - Left with 0 client certificates to choose from.
 34 System.Net Information: 0 : [5156] Using the cached credential handle.
 35 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = (null), targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 36 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffer length=0, Out-Buffer length=159, returned code=ContinueNeeded).
 37 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 38 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=ContinueNeeded).
 39 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 40 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=ContinueNeeded).
 41 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 42 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=ContinueNeeded).
 43 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 44 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=134, returned code=ContinueNeeded).
 45 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 46 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=ContinueNeeded).
 47 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 48 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=OK).
 49 System.Net Information: 0 : [5156] Remote certificate: [Version]
 50 V3
 51 REDACTED
 
 52 [Signature Algorithm]
 53 sha256RSA(1.2.840.113549.1.1.11)
 
 54 [Public Key]
 55 Algorithm: RSA
 56 Length: 2048
 57 Key Blob: REDACTED
 58 System.Net Information: 0 : [5156] ProcessAuthentication(Protocol=Tls, Cipher=Aes128 128 bit strength, Hash=Sha1 160 bit strength, Key Exchange=44550 256 bit strength).
 
 59 System.Net Information: 0 : [5156] ConnectStream#62287651 - Sending headers
 60 {
 61 x-auth-user: REDACTED
 62 x-auth-expiry: 2015-08-16T01:22:43Z
 63 x-auth-salt: REDACTED
 64 x-auth-ticket: REDACTED
 65 x-auth-dev-ticket: REDACTED
 66 Accept: text/xml
 67 Content-Type: text/xml
 68 User-Agent: REDACTED/version-2.4501/GetReponoAuthURL/gaggleid:REDACTED
 69 Host: REDACTED
 70 }.
 71 System.Net Information: 0 : [5156] Connection#14267127 - Received status line: Version=1.1, StatusCode=200, StatusDescription=OK.
 72 System.Net Information: 0 : [5156] Connection#14267127 - Received headers
 73 {
 74 X-Timestamp: 1413849343.99091
 75 X-Preauth-Uri: /repono/3dfc19b30213a7fd7297e1fb32815b95d15e7187a91e13c84b1c423c/REDACTED?token=PREA_49cd6e580ce3460948610e05e1dba031adf5bc19b098a2a98226cf4a
 76 X-Trans-Id: tx72179a9abf4a4d7e8cc65-0055cfe147
 77 Accept-Ranges: bytes
 78 Content-Length: 5378129
 79 Content-Type: application/pdf
 80 Date: Sun, 16 Aug 2015 01:03:03 GMT
 81 ETag: 40fb56bde55a37911d253debfa002005
 82 Last-Modified: Mon, 20 Oct 2014 23:55:44 GMT
 83 }.
 84 System.Net Information: 0 : [5156] ConnectStream#10452251::ConnectStream(Buffered 0 bytes.)
 85 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ConnectStream#10452251
 86 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with HttpWebResponse#24816868
 87 System.Net Verbose: 0 : [5156] Exiting HttpWebRequest#45035759::GetResponse() -> HttpWebResponse#24816868
 88 System.Net Verbose: 0 : [5156] HttpWebResponse#24816868::GetResponseStream()
 89 System.Net Information: 0 : [5156] ContentLength=5378129
 90 System.Net Verbose: 0 : [5156] Exiting HttpWebResponse#24816868::GetResponseStream() -> ConnectStream#10452251
 91 System.Net Verbose: 0 : [5156] ConnectStream#10452251::Read()
 92 System.Net Verbose: 0 : [5156] Exiting ConnectStream#10452251::Read() -> 0#0
 93 System.Net Verbose: 0 : [5156] ConnectStream#10452251::Close()

If you see it right away, contact me, I would like to hire you on to the team. For those of you that don’t immediately see it, tune your eyes to line 25 in the output.

Here’s what is going on. The .NET client is expecting to reuse the TCP socket and when it doesn’t get an answer, it’s waiting for 10 seconds and then establishes a new HTTPS session. Once we had this isolated, we patched the .NET code to close the connection after retrieving the file and tear down the TCP session. There could be a lot of discussion about what sessions and HTTP mean at the TCP level, but we could not adjust the Repono side and we could adjust our .NET side, so that is the approach we used.

This correction is very important, but it doesn’t easily answer what changed? Through a lot more work in the troubleshooting domain, we noticed that the load balancer had changed behavior specifically around long running sessions.  Patching the load balancer and our .NET code has put us in a better place.

Similarity Search and Hashing for Text Documents

Introduction

This is a high level overview of similarity hashing for text, locality sensitive hashing (LSH) in particular, and connections to application domains like approximate nearest neighbor (ANN) search. This writeup is the result of a literature search and part of a broader project to identify an implementation pattern for similarity search in large scale document collections (100M+).

The Continuum of Similarity

At Catalyst, we encounter ideas which involve the identification and measurement of document similarity. Consider a "more like this" feature, ranking documents by similarity to other documents, clustering documents, tagging near duplicates, email threading, and so on.

In the abstract, these are all issues of similarity, and similarity can be scored on a continuum between 0 and 1. Mildly similar documents might score a 0.5. Exact duplicates score at 1.0.

Is there a general implementation pattern that can be brought to bear?

Can we derive a similarity scoring engine and apply it to multiple current and future features?

How do our answers change as we face larger scale collections of 100M or 500M documents?

These are the drivers behind this research.

Traditional Hashing

In software and systems, hash functions have gained popularity as a means to deterministically map input data to a predefined range of "buckets". Here’s a very basic hash function, which will hash positive integers in to 10 different buckets:

# What is the leftover amount when x is divided by 10?
h(x) = x mod 10
# Here are some examples:
 h(10) =  10 / 10 =  1 with 0 leftover = bucket 0
 h(18) =  18 / 10 =  1 with 8 leftover = bucket 8
 h(34) =  33 / 10 =  3 with 4 leftover = bucket 4
h(237) = 237 / 10 = 23 with 7 leftover = bucket 7

When faced with storing 10 million files, one might wish to spread them evenly across a small range of 1,000 directories (aka buckets). A sequence of hashing and heuristics could be used to assign each of the files to one of the directories.

At the other extreme, when faced with identifying exact duplicates of data, one might define a much larger range, greater even than the number of items to be hashed. If two items hash to the same bucket, then you can be relatively assured that they are identical. This can be used in various ways to filter duplicate items. As a side note, the extension of this idea in to probabilistic data structures is where the popular Bloom Filter is situated.

The main intuition behind this common kind of hashing is that it ensures any given set of inputs will generate hashes which are randomly and uniformly distributed across the defined range. Even the slightest change in input will generate a vastly different hash or signature.

In cryptographic hashing and signatures, the intuition is similar. Hashing of the binary contents of a file, using MD5 or SHA class hashes will generate a fixed length signature for that file. Even a single character change within a text file would then result in a dramatically different document signature.

Similarity Hashing

Similarity hashing arises from a contrasting idea to traditional hashing. The intuition for similarity hashing is to generate hashes or signatures which preserve the similarity and relationships between two items. Rather than a random and uniform distribution, the aim of similarity hashing is to cluster like items together.

Two documents which contain very similar content should result in very similar signatures when passed through a similarity hashing system. Similar content leads to similar hashes. Locality sensitive hashing (LSH) is a formal name for such a system, and a broad academic topic addressing related concerns.

An Example

Let’s walk through an example of one kind of similarity hashing. This example should help explore the concept. It will also begin to show that LSH is not a singular hash function but an assembly of many techniques involving the generation and comparison of document signatures.

First, imagine two sentences:

Sentence A: "The dog is brown"
Sentence B: "The dog is happy"

Then, imagine representing those two sentences in a matrix like so:

words: the | dog | is | brown | happy | sunshine
    A:  1  |  1  | 1  |   1   |   0   |    0
    B:  1  |  1  | 1  |   0   |   1   |    0

For each sentence and each column, a 1 indicates a word was present and a 0 indicates the word was not present. Note that the word "sunshine" is added to the matrix, and since neither sentence contain that word, both have a 0 in that column. This is added to the example to help illustrate how those are dropped later as a part of the similarity measurement.

How similar are the two sentences?

How can the similarity be scored?

This can be formulated as a set problem, and the two ordered sets can be compared:

Sentence A = [1, 1, 1, 1, 0, 0]
Sentence B = [1, 1, 1, 0, 1, 0]

Each sentence has a bit in the first 3 columns. And 5 columns have a bit in at least one of the sets. So we’ll give these two sentences a similarity score of 3/5 or 0.6. The formal name for this is Jaccard Similarity.

Walking through these steps, we have hashed two sentences and then compared their signatures to generate a similarity score.

On the Path to MinHash

Let’s move closer to a real world scenario, where it’s not just two sentences, but a collection of many documents. Envision a matrix of all the terms and documents in the collection.

 words: ace | cat | bag | brown | happy | water | ...
 doc 1:  1  |  1  |  1  |   1   |   0   |   0   | ...
 doc 2:  1  |  0  |  0  |   1   |   0   |   0   | ...
 doc 3:  1  |  0  |  1  |   0   |   0   |   0   | ...
 doc 4:  0  |  0  |  0  |   0   |   0   |   0   | ...
 doc 5:  1  |  1  |  1  |   1   |   0   |   0   | ...
 doc n...

Such a matrix would quickly get very large and it would also tend to be very sparse, meaning it would contain lots of zeros. When dealing with very large scale collections, holding such large matrices in memory and performing comparison calculations becomes a difficult problem.

Here, the various families of LSH techniques come in to play. One of the earliest techniques to be used was MinHash [2]. MinHash happens to begin with a characteristic matrix much like the one in our example. Then, using some clever arrangements of hashing and probability, very compact document signatures can be generated. For example, imagine each document in a collection being represented by only a 1024 bit sequence of bits or numbers. Formally, this is often called dimensionality reduction.

Pausing here, imagine now a collection of 1 million document signatures. In the worst case, you might have to compare every doc signature to every other signature. Comparing all possible pairs would require roughly 1M^2 or 1 trillion comparisons. If each comparison takes 1 millisecond, we’d be done in only 31.6 years! This is the comparison problem.

Continuing on with MinHash, another clever round of hashing and algorithms allows for a much smaller pool of "candidate signatures" to be identified for comparison. With a smaller set of signatures, the comparison problem becomes more reasonable.

Note however, that with each round of dimensionality reduction and hashing, small bits of risk for false positives and/or false negatives creep in to our results. This is the trade off for using probabilistic techniques. By extension, this is why leveraging LSH for nearest neighbor searches leads to the name of Approximate Nearest Neighbors (ANN) and why various other uses result in approximate answers and not exact answers.

A substantial amount of detail in the MinHash algorithms has been skipped in this description, partially to make it accessible and partially because there are better references than I need to provide here. If you’d like more detail, the early sections of Chapter 3 in the free book Mining of Massive Datasets is highly recommended.

To wrap up and generalize, several stages are starting to be exposed:

  1. How should a collection of documents be represented?
  2. How can that representation be made smaller and more compact?
  3. How can the comparison problem be improved by identifying smaller pools of candidate pairs?
  4. How can the remaining comparisons be performed quickly?

Let’s move on and extrapolate from this example to the larger field of LSH.

Broad View of LSH

By some measures, locality sensitive hashing has been around almost 20 years, beginning with Broder’s paper in 1997 [1]. Despite that, the field is still pretty young. LSH is found in a diverse set of academic research areas including digital signal processing, information retrieval, statistics, machine learning, and data mining. As such, searches of the literature reveal a fluid definition of the term LSH and related items.

In this section, I describe a common set of stages, found across the families of LSH implementations. Assuming a collection of documents, the rough sections are:

    1. selecting document characteristics
    1. dimensionality reduction and representation
    1. distance measures
    1. identifying candidate pairs
    1. performing comparisons, ranking, clustering, etc.

Now, let’s look at each section in a more detail. If you’re less interested in the details, this is a good place to jump ahead to the next section "Expansion of the Field".

(A) selecting document characteristics

Selecting document characteristics might also be noted as "feature selection" in the lexicon of machine learning. There are numerous ways to select and represent document characteristics. This selection can change the definition of what it means to be "similar". It can also add or remove technical constraints to the system.

One common approach to selecting document characteristics can be to generate shingles from the characters or even bytes of a document. As an example, the word "foxtrot" could be 3-shingled as [‘fox’, ‘oxt’, ‘xtr’, ‘tro’, ‘rot’]. Another approach might be to tokenize text and select unigrams, bigrams, or other n-grams. A more advanced approach might be to identify the words in a document along with their term frequencies or even their tf-idf scores.

When shingling or tokenization is used, the characteristics tend to be more representative of the lexical nature and basic structure of the text. Did these three characters exist in the document? Did these four words appear together, in order, in the document? This lends itself well, to a type of similarity that is more literal. This is seen in the historic use of MinHash for de-duplication of web collections, or in application toward the identification of plagiarism where common sentences might be copied and re-arranged.

Moving to term frequencies and especially tf-idf begins to create a more semantic representation of text. Later, this can lead to a similarity score which is more about document meaning and concepts. It can also build on the rich research around text statistics in collections.

Feature engineering is a particularly broad and also important part of the process. Some of these techniques can also be important in non-LSH domains and even paired with more traditional hashes and set operations [5, 7]. However, let’s continue on with our overview of the LSH scheme.

(B) dimensionality reduction and representation

Remember now that these document characteristics can grow in size very quickly. This can leads an example of the commonly described "curse of dimensionality". Enter the tools of lossy compression and dimensionality reduction.

Remember the matrix of words and documents in our early examples. Each list of numbers for a given document can be thought of as a vector or signature. They could be as simple as a bit vector like [1, 1, 0, …] or more complex vectors of real numbers or strings of characters. In dealing with vectors or other representations, measuring distance becomes an important tool. If two document vectors are close in distance, it can mean the documents are similar or have similar content.

There are numerous techniques to reduce the dimensionality of document vectors or more broadly reduce the amount of data. They include more basic techniques like sampling and quantization. They also extend in to more complex techniques like MinHash [2], Simhash [10], and Random Projection / Random Indexing. These techniques can be surprisingly good at shrinking the data while preserving the distance or other relationships between documents.

(C) distance measures for different representations

As noted in the previous section, distance measure become important in this world of alternate representations of documents. There are euclidean distance measures, cosine distance, and so on. On a basic two dimensional plot, this can be as simple as measuring distance between two points. It can get more complex when working with vectors in higher dimensions. Alternatively, when working with strings or even just bits, other distance measures can be leveraged. Examples there include Levenshtein distance or Hamming distance.

Pausing a moment, let me say that while this section is in the middle, it can be really useful as a starting point. Hamming distance is conceptually simple, and measures the difference between strings of 1s and 0s. Here’s an example:

bit vector 1: 001001
bit vector 2: 001011

These two vectors have a Hamming distance of 1, because there is only one position in which the two differ.

Additionally, representing anything as strings of bits allows for the use of bit oriented operations. Bits are easy to understand in the abstract, easy to work with, and there exist many very fast operations for their manipulation and comparison. So, it may be motivating to try and focus on the families of LSH techniques surrounding bit vectors. As an example, see the papers from Chappell Et al. [3] on efficient top-k retrieval for similarity signatures of bit vectors.

(D) identifying candidate pairs

Now, no matter what characteristics are used and no matter how small each document signature becomes, there is still the matter of making numerous comparisons between signatures. So, various technical and statistical techniques have arisen for identifying candidate pairs. Finding these candidate pairs which have a high likelihood of being similar is very important. This can dramatically reduce the number of comparisons operations that are required and make the whole idea of a similarity search feasible.

(E) performing comparisons, ranking, clustering, etc.

In some situations, you will want to make full comparisons of the candidate pairs, to confirm their similarity. Some candidates may be thrown out, or the candidates may be ranked by measure of similarity. Depending on the service, documents could be included, excluded, ranked, clustered, and so on.

Expansion of the Field

By now, heads can be swimming (mine does). The field expands quickly, with each section of the LSH sequence being very deep.

LSH begins as a straight forward intuition of similarity hashes and comparisons. However, simple techniques fail at large scales (100M+ docs). Tackling these scale hurdles with approximation and clever algorithms then gives rise to families of LSH techniques around Hamming distance, families around Euclidean Distance, and so on.

In practice, the simple intuition of similarity hashing becomes a tedious sequencing of many tedious steps. It is not insurmountable though, and these techniques are already well established in industry and even available in open source packages.

Closing

Thank you for taking the time to read this writeup. While this document represents many hours of my own work, I recognize the extant and parallel work of others on this topic at Catalyst. Discussion, questions, and feedback on any details large or small are warmly welcomed.

Thanks to JBull for feedback on drafts of this document.

References

Here, you’ll find some of the references which informed this document, and other interesting resources.

[1] A. Broder. On the resemblance and containment of documents. Sequences 1997. Link
This is the original paper on LSH.

[2] A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. "Min-wise independent permutations". ACM Symposium on Theory of Computing, pp. 327–336, 1998. Link
This is the original MinHash paper.

[3] T. Chappell, S. Geva, A. Nguyen, G. Zuccon. Efficient top-k retrieval with signatures. ADCS 2013. (r)
T. Chappell, S. Geva, G. Zuccon. Approximate Nearest-Neighbour Search with Inverted Signature Slice Lists. ECIR 2015. (r)
These two papers, essentially the same, describe accessible and interesting techniques for very fast comparisons of bit vectors. They address the comparison problem with a mix of inverted indexes and probability. Email me for copies.

[4] S. Geva, C. De Vries. TopSig: Topology Preserving Document Signatures. CIKM 2011. (r) Link
This paper was published prior to the Chappell papers [3], and addresses the generation of document signaures which extends on random projection by introducing tf-idf scores. It’s a bit vague, but there’s a C code implementation.

[5] Redacted – internal source

[6] J. Leskovec, A. Rajaraman, and JD Ullman. 2014. Mining of Massive Datasets. Cambridge University Press, New York, NY, USA. (pr) Link
This is a pretty good book from a Stanford trio, covering a broad array of large scale data mining topics. It is available for free online and also used in a data mining class at Coursera.

[7] C. Manning, P Raghavan, H Schütze. Introduction to Information Retrieval. 2008 Cambridge University Press. Link Link
This is a good, free book on Information Retrieval from Stanford.

[8] M. Slaney, M. Casey. Locality-Sensitive Hashing for Finding Nearest Neighbors. IEEE Signal Processing Magazine, March 2008.
This is a well respected description of one kind of Locality Sensitive Hashing.

[9] J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. CoRR, abs/1408.2927, 2014 (pr) Link
This is a very lengthy and thorough survey of Locality Sensitive Hashing and similarity search. It holds particular interest because it was completed so recently (2014).

[10] C. Sadowski, G. Levin. SimHash: Hash-based Similarity Detection.
G. Manku, A. Jain, A. Das Sarma. Detecting Near-Duplicates for Web Crawling. WWW 2007. Link
These are two of the papers on SimHash from people with Google ties.

Speeding Up Large File Transfer Over HTTP

The HTTP protocol is all around us. Can we work within HTTP/1.1 to speed up large file delivery?

What is going on in HTTP when a page is requested?

TCP + HTTP overview of a typical request to view a homepage.

Fig. 1: Simplified TCP + HTTP overview of a typical request to view a homepage.

Could this be sped up?  In HTTP/1.1 we are given two options, pipelining and byte serving.  It is worth understanding each.  In Fig. 1, the operations are presented serially.  In reality that is mostly the way they are carried out.  If the server supports pipelining, requests can be sent without waiting for a response within the same TCP session.

Fig. 2: Pipelining in HTTP/1.1

Fig. 2: Pipelining in HTTP/1.1

In Fig. 2, the same requests have been made in rapid succession.  But as defined, the pipelining feature of HTTP/1.1 will return the resources in the order of the request.  This is in essence a FIFO system and can suffer from head-of-line issues.  If the request for logo.png in Fig. 2 results in a 200 MB file, that resource will be delivered before other resources can continue.  Asynchronous delivery and other improvements are scheduled for HTTP/2.0, but they are not part of HTTP/1.1 and with Google’s withdrawal of SPDY, there aren’t a lot of improvements within pipelining that are available in browsers.

Byte Serving is another feature of HTTP/1.1.  Some content delivered over HTTP can be read progressively (like HTML) while other content needs to be delivered completely before your computer can do anything with it (like a Microsoft Word *.doc file).  PDF files fit into the former category.  A PDF viewer that connects with your browser knows about Byte Serving.  That is how it appears that PDFs stream for some combinations of browsers and PDF viewers.  This is accomplished by the server supporting Ranges and the client making use of Range requests.

Fig. 3: Byte Serving through the use of Ranges in HTTP/1.1

Fig. 3: A simplified version of Byte Serving through the use of Ranges in HTTP/1.1

If the resource can be used in a progressive manner, then byte serving can get chunks of a file and make them usable to the consumer.  Byte Serving can be combined with pipelining, but for reasons already discussed, there are only marginal gains through this approach.

Fig. 4: Combining Byte Serving and pipelining is possible but doesn't make material gains in performance.

Fig. 4: Combining Byte Serving and pipelining is possible but doesn’t make material gains in performance.

Let’s go back and look at that large file in a simple delivery.

Fig. 5 - A simple delivery of a large file.

Fig. 5 : A simple delivery of a large file.

If you need the entire file and can’t do anything progressively, you will end up waiting for the entire payload to complete your request.  Pipelining won’t help much nor will Byte Serving since you still need the whole file to finish.  What if you could make multiple parallel requests of the server asking for portions of the file?  We call this approach PBS or Parallel Byte Serving.

Fig. 7: Agent 1

Fig. 7: Agent 1

Fig. 8: Agent 2

Fig. 8: Agent 2

The file

http://REDACTED/public/p/byte.serve.bin

The meta

HTTP/1.1 200 OK
Date: Mon, 01 Jun 2015 20:04:15 GMT
Server: Apache/2.2.3 (CentOS)
Last-Modified: Fri, 29 May 2015 14:12:18 GMT
ETag: 19e197-181a5ceb-517391052f480
Accept-Ranges: bytes
Content-Length: 404380907
Expires: Wed, 15 Apr 2020 20:00:00 GMT
Cache-Control: public
Connection: close
Content-Type: application/unknown

Time to retrieve using a simple download.

time curl -s http://REDACTED/public/p/byte.serve.bin -XGET -ouptut full.pdf

real 22m2.913s
user 0m3.413s
sys 0m12.991s

By making use of the HTTP/1.1 HEAD call, we know the file is 404380907 bytes.  Now it’s simply a matter of configuring four distinct agents with their own TCP + HTTP session to the server to read four different ranges of the same file.  Here’s an example of one agent.

curl -s -H Range: bytes=0-101095226 http://REDACTED/public/p/byte.serve.bin -XGET –output bin.part1

Three more agents are started up in rapid succession with differences in the Range Request Header and different output files.  This was combined into a  simple script.

time ./multi.sh
STILL WORKING
(Repeat)
DONE

real 7m2.332s
user 0m3.722s
sys 0m11.659s

From 22 minutes to seven minutes.  That is promising.  But this is a very naive setup.  It assumes there is no authentication, TLS, or other expensive operations within the HTTP call. To be useful, PBS would need to be tested against our own production equipment.  Inside of our platform, we’ve done a lot to optimize intra machine communication so delivering a large file faster would have to make big strides for us to want to change some of our software.

To test production resource, the scripts repeatedly requested a 40MB file – both as a single file and as four separate agents asking for 25% of the file.  The PBS approach is faster, but not fast enough to take on the headaches of reassembling the parts and other changes in our agents.  Perhaps if files were much larger, like 100MB or more PBS would showcase it’s advantages more clearly.

Fig. 9: Comparing PBS and regular HTTP delivery of a 40 MB file

Fig. 9: Comparing PBS and regular HTTP delivery of a 40 MB file

The graph shows our average simple delivery was .9964 seconds while PBS was .9381 seconds.  Where are the enormous gains of the large file delivery outlined above?  Well this service is multiplexed with a load balancer, handles authentication, TLS, and other pieces of code.  The overhead for each agent session eats away at the gains of Parallel Byte Serving for smaller files.

The Mathematics Behind Communication Explorer Arrows

Creating lines to represent one way communication.

D3 can draw a line between nodes using their force directed layout. Let’s call these two points (x1, y1) and (x2, y2). Now we need to determine the point (Xp, Yp) to plot the arrow at. We can identify the coordinates using trigonometry.

Note: boundarySpace is the amount of additional space we want between the edge of the target circle and the end of the line, which is set to 8px (large enough to accommodate a marker).

lineDiagram

renderLinks: function(lines, settings){
    var that = this;
    lines.attr("x1", function(d) { return d.source.x.toFixed(2); })
        .attr("y1", function(d) { return d.source.y.toFixed(2); })
        .attr("x2", function(d) { return that.getLinkEndPoint(d, settings)[0].toFixed(2); })
        .attr("y2", function(d) { return that.getLinkEndPoint(d, settings)[1].toFixed(2); });
},
getLinkEndPoint: function(d, settings){
    var theta = Math.atan2(d.target.y - d.source.y, d.target.x - d.source.x),
    Py = parseInt(d.target.y, 10) - (d.target.radius + settings.boundarySpace) * Math.sin(theta),
    Px = parseInt(d.target.x, 10) - (d.target.radius + settings.boundarySpace) * Math.cos(theta);
    return [Py, Px];
},

Creating arcs to represent two way communication.
D3 can also draw an arc between the center point of two nodes. Lets call these points (x1, y1) and (x2, y2). Since D3 cannot draw an arrow on an arc for any possible target circle radius, we need to plot a line at the point (P1x, P1y) where the arc intersects the target circle and put the arrow on line instead of the arc. The point (P1x, P1y) can be determined using the native svg function getPointAtLength.

Next, we get another point along the arc at a very small distance away (PSx, PSy). Now that two points are known, the slope can be calculated and the line extended to point (P2x, P2y).

arcDiagram

plotSlopeLinesAndArrows: function(paths, settings){
    var that = this;
    paths.each(function(l, i) {
        var pathLength = this.getTotalLength() - (parseInt(l.target.radius, 10) + settings.boundarySpace);
        if(pathLength > 0){
            var p1 = this.getPointAtLength(pathLength),
            pS = this.getPointAtLength(pathLength - 5),
            p2 = that.getSlopeLineEndPoint(point1, point2),
            strokeW = that.calculateLinkWidth(l.quantity),
            path = "M" + endPoint.x.toFixed(2) + "," + endPoint.y.toFixed(2) + " L" + point1.x.toFixed(2) +","+ point1.y.toFixed(2);
            $("#slope" + i).attr("d", path).attr("marker-end", function(d) {
                var marker = null;
                if( l.source.state === "hover" || l.source.state === "clicked" )    {
                    marker = that.getMarker("A", strokeW, true);
                }else{
                    marker = that.getMarker("A", strokeW);
                }
                return "url(#" + marker + ")";
            });
            that.renderPhantomLink(l, i, point1);
         }
    });
},

getSlopeLineEndPoint: function(p1, p2){
     var slopeLineLength = 90,
         slope = (p1.y - p2.y) / (p1.x - p2.x),
         theta = Math.atan2(slope, 1),
         x2 = 0,
         y2 = 0;
     if(p1.x - p2.x > 0){
         y2 = p1.y - (slopeLineLength * Math.sin(theta));
         x2 = p1.x - (slopeLineLength * Math.cos(theta));
     }else{
         y2 = p1.y + (slopeLineLength * Math.sin(theta));
         x2 = p1.x + (slopeLineLength * Math.cos(theta));
     }
     return {'x': x2, 'y': y2};
},

Debug Mode:
All of this can be seen in action by looking at the GIF below. The red lines are the slope lines, which are drawn at edge of the target circle when there is two way communication. A marker is placed on them to give the desired effect..

commExplorerDev

Final Appearance:
By changing the CSS to hide the slope lines and the original arcs, the desired appearance can be created.

commExplorerNormal

Taking a closer look, we can see that we have achieved a clean look between the end of the arcs, their marker, and the target circle for all possible radius and stroke sizes.

commExplorerNormalPic

Progressive Shrinking of Text

We needed a way to shrink the text of a document if we found it to be too large for our limits on indexing. To keep the methods consistent, we developed a quick web service in PHP called Shrinkray. Shrinkray receives a large text document and takes steps to make the text smaller by progressively removing information from the document.  This script is a one page web service.  It requires no additional code.

Actions for Shrinkray

  •  ax=lim – reports on the current limit being enforced by Shrinkray
  •  ax=sr – the main shrink ray method

Variables for Shrinkray

  • srpath = the path to a file you want to GET with shrinkray (the web server needs the RIGHT permissions to this path)

Typical Usage Patterns

curl --silent http://REDACTED/shrinkray/v1/?ax=lim

This will return the limit being enforced by Shrinkray.

curl --silent -v http://REDACTED/shrinkray/v1/?ax=sr&srpath=/path/to/file

This will have Shrinkray fetch the document specified by ‘srpath’ and shrink it.

Returns

Response code + payload. If the action is ‘lim’ the payload is the string representing the size limit. If the method is ‘sr’ and the status code is less than 399, the payload will be the shrunk document.

Response Codes for Shrinkray

  • HTTP/1.1 200 OK
  • HTTP/1.1 230 Shrunk via tags
  • HTTP/1.1 235 Shurnk via duplication
  • HTTP/1.1 239 Shrunk via punctuation
  • HTTP/1.1 240 Shrunk via number
  • HTTP/1.1 245 Shrunk via lowercase
  • HTTP/1.1 250 Shrunk via header
  • HTTP/1.1 400 Bad Request
  • HTTP/1.1 410 Gone
  • HTTP/1.1 412 Precondition Failed
  • HTTP/1.1 413 Request Entity Too Large
  • HTTP/1.1 500 Internal Server Error

Response Codes and Events

  • 200 when you ask for the limit over GET or POST
  • 230 when the HTML tags have been removed
  • 235 when the duplicate strings have been removed
  • 239 when punctuation (defined by unicode character class \p{Punctuation}) has been removed
  • 240 when numbers (defined by the unicode character class \p{Number}) have been removed
  • 245 when the words have been lower cased and the duplicates have been removed
  • NOTE: 230-245 are incremental, they include the previous steps
  • 250 when the text could not be shrunk and a chunk of the document lower than ax=lim has been returned to you in the payload
  • 400 when you use the ax=sr and do not provide srpath (GET)
  • 410 when you specify the srpath and the file does not exist
  • 412 when you provide a document that is smaller than ax=lim and no action is taken
  •  413 when all the shrinking processes have failed and nothing is done to your document
  • 500 when something unexpected happens

Using the Response Code

If your response is within 230 to 250, your document has been shrunk and the smaller document is in the payload. It is up to you to persist this document.

You can download a redacted version of the PHP script here

Reducing network overhead in Redis calls with Lua

For one of our applications, we cache a variety of data in Redis for use in later calculations. When profiling the application, I noticed that one of the caching steps was taking much longer that any of the others. Looking at the profiler output quickly showed why:

         3863231 function calls in 266.866 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    50992  259.846    0.005  259.846    0.005 {method 'recv' of '_socket.socket' objects}
    50992    1.293    0.000    2.491    0.000 connection.py:593(pack_command)
    50994    0.802    0.000    0.802    0.000 {method 'sendall' of '_socket.socket' objects}
    50992    0.716    0.000  261.088    0.005 connection.py:315(read_response)
   836706    0.516    0.000    0.516    0.000 {isinstance}
    50992    0.450    0.000  266.530    0.005 client.py:558(execute_command)
   203968    0.392    0.000    0.717    0.000 connection.py:577(encode)
    50992    0.249    0.000  261.709    0.005 client.py:575(parse_response)
    50990    0.210    0.000    0.210    0.000 client.py:231(float_or_none)
    50992    0.199    0.000    0.199    0.000 {method 'feed' of 'hiredis.Reader' objects}
    50992    0.191    0.000    1.142    0.000 connection.py:529(send_packed_command)
    50992    0.171    0.000    3.804    0.000 connection.py:554(send_command)
    50992    0.164    0.000    0.311    0.000 connection.py:885(release)
   254960    0.163    0.000    0.163    0.000 {method 'join' of 'str' objects}
    50992    0.147    0.000    0.257    0.000 connection.py:868(get_connection)
   713889    0.142    0.000    0.142    0.000 {len}
    50992    0.138    0.000  261.250    0.005 connection.py:566(read_response)
   101984    0.124    0.000    0.151    0.000 connection.py:858(_checkpid)

Look at all that time recv’ing!

The problematic piece of application code looked something like this:

for k, v in some_things.iteritems():
    increment_by = 1
    redis_db.zincrby(ZSET, k, increment_by)

It just loops over a dictionary and for each key (k) increments its score by one within the sorted set it’s stored in. So, for dictionaries with many keys, the repeated calls to ZINCRBY really began to add up, as the profiler output indicated. Obviously a quick way to reduce the amount of time spent doing this needless work would would be to push it to the server side, and that’s where Lua comes in.

Since version 2.6.0, Redis has shipped with a built-in Lua interpreter. From what I can tell it was introduced partly as the maintainer’s way of deflecting new feature requests for obscure use cases (“Write your own Lua script to do that, I won’t add it to Redis.”) For more information, see the official documentation and a short introduction to writing Lua scripts for Redis.

Replacing the loop above with a Lua script was pretty straightforward. You can simply define the script as a string, register the script with the server, and then get to using it. As Lua script execution is blocking, it was necessary to add a way to break the items to be incremented into reasonable chunks, so as to avoid asking the server to increment hundreds of thousands of scores at once. I also tested out replacing the original loop with Redis’ pipeline functionality, but the Lua version won out in the end.

lua = """
for i, k in pairs(ARGV) do
  redis.call("ZINCRBY", KEYS[1], 1, k)
end"""
increment = redis_db.register_script(lua)
for chunk in generate_chunk_from_iterable(some_things.iterkeys(), CHUNK_SIZE):
    increment(keys=[ZSET], args=chunk)

The results from the profiler were pretty convincing compared to the previous method:

         1048325 function calls in 1.851 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5    0.643    0.129    0.643    0.129 {method 'recv' of '_socket.socket' objects}
      615    0.295    0.000    0.295    0.000 {method 'sendall' of '_socket.socket' objects}
        5    0.252    0.050    0.634    0.127 connection.py:593(pack_command)
   101999    0.131    0.000    0.272    0.000 connection.py:577(encode)
        1    0.125    0.125    0.125    0.125 {zlib.compress}
        1    0.112    0.112    0.112    0.112 {cPickle.dumps}
   347687    0.100    0.000    0.100    0.000 {isinstance}
   102004    0.064    0.000    0.064    0.000 {method 'join' of 'str' objects}
    35914    0.041    0.000    0.041    0.000 {method 'encode' of 'unicode' objects}
   305707    0.032    0.000    0.032    0.000 {len}
   102013    0.014    0.000    0.014    0.000 _compat.py:21()
    50990    0.011    0.000    0.011    0.000 {round}
        1    0.007    0.007    1.504    1.504 client.py:2636(__call__)
        5    0.005    0.001    1.591    0.318 client.py:558(execute_command)

So remember, when inserting or updating many values, avoid making a lot of round-trips, because the network is always slower than RAM.

logstash tuning on AWS

During a recent test where I put up 125+ AWS instances to do some work I ran into an issue. All of the instances are pushing their logs via logstash-forwarder to a load balanced logstash cluster. Things were running fine but logging failed. It was nice to know that logging doesn’t bring the instances to halt, but the logs are being centrally collected for a reason.

After some digging I found that the issue was overrunning the logstash recipients memory. Basically the logs were flooding in at 4,500-5,500/s which exceeded what logstash could process. The events pilled up and boom. Not enough memory:

Error: Your application used more memory than the safety cap of 4G.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace

The logstash instances are running on c3.xlarge instance types and I decided to perform some tests by throwing 250,000 events at them to see how fast they would process. Basically it came in at about 1,800/s. That number seemed low and I started playing around with the logstash settings.

Since we are using the elasticsearch_http output from logstash I experimented with the number of workers (default 1) for that output plugin. 2 was the sweet spot and I managed to increase the throughput to around 2,100/s. Not a great improvement, but I figured more should be possible.

The c3.xlarge comes with 4 cores and 7.5GB of RAM, but when I was testing the load stayed very low at around 0.5. Clearly I wasn’t getting the full value.

Logstash can also adjust the number of filter workers via the -w flag. I figured the filter might just be where things are getting stuck and so I re-ran my tests with various combinations of filter workers and elasticsearch_http workers.

I’ll skip all the details, but for the c3.xlarge instance type I ended up reaching an ingestion rate about 3,500/s or nearly double the original. That rate was achieved by using

  • filter workers = 8
  • elasticsearch_http workers = 4

Changing either of these up or down reduced the overall rate. It also pushed the load to around 3.7+.

I think a lot more experimentation could be done with different instance types and counts, but for now I’m pretty happy with the new throughput which let’s me run a lot fewer instances to get to target rate of 30,000 events/s. For now I think I have a decent formula to drive any further tests:

  • filter workers = 2 * number of cores
  • elasticsearch_http workers = filter workers / 2

There is still a concern  over load balancing the logstash service, which runs as a TCP service and connections from the forwarders persist. That means that if the right number of forwarding instances all tied to the same endpoint start pushing a lot, we might still overrun the box. There are some good ideas around putting a Redis or RabbitMQ layer in between, but that’s an experiment for another day.

\@matthias

why you should embrace a rabbitmq client library

Recently I had to re-run a lot of documents into one of our applications. The app lives on AWS and ingesting content involves the use of a RabbitMQ queue.

I’ve often used the the amqp-tools/rabbitmq-c for quick ad-hoc work in the past and so I wrote a very terse bash script to feed the list of documents to the queue. That script worked just fine, but I was in a hurry and I added quite a few queue clients to get the work done more quickly.

I stalled out in terms of rate and when I looked a bit more closely I found that my bash script wasn’t able to keep the queue fed sufficiently and my clients were going idle.

I also have some Ruby code using the bunny library and decided to re-write my feed script using that.

The results were startling.

Pushing 100,000 messages to the queue using the bash approach took about 28 minutes.

The Ruby version using a RabbitMQ library with persistent connection did the same work 35 seconds!

During a later run I pushed 1 million messages to RabbitMQ from a single client using the Ruby code.  That run took 6.5 minutes for an effective rate of 2500 messages per second.  The server is running on a r3.large and with that push and all the clients reading from it the load pushed up to only around 1.5. That is also a stark contrast to the bash version of the script during which I would see the load rise to 4+.

I didn’t take the time to dig deeply if this was due to process spawning in the bash script or overhead in connection setup/teardown with RabbitMQ. Given the load impact on the RabbitMQ server of the bash script (which ran on a different system) I’m confident that it’s not process spawning, but instead a lot of extra burden on RabbitMQ to deal with all those connection requests.

In the end it just speaks to the practicality of using client library the right way if things are going too slow when interacting with RabbitMQ.

\@matthias