Quote

Creating custom rules for ESLint

Creating custom rules for ESLint is one of the more attractive way of building continuity tests. This allows you to set up organization or project specific rules that are unique to your code.

Continuity Tests

The idea of testing is generally split between unit tests and integration tests, where unit tests test specific functions or module and integration tests are higher level abstract tests, often done via QA testers in a mostly manual process.

The third and often overlooked set of testing is continuity testing, also known as static code analysis.

ESLint

ESLint is a powerful linting utility for Javascript and is highly configurable and extendable. Through the use of plugins you can write your own custom set of rules to use in your project.

Creating a custom rule set involves creating a new eslint-plugin-{name} project. There is a Yeoman generator that gets you a quick scaffold to get started.

Were going to setup a new project from scratch.

$ mkdir eslint-plugin-sounds && cd eslint-plugin-sounds
$ npm init

Create your project and accept all the defaults for now.

$ npm install eslint --save-dev

We install eslint as a dev dependency since the plugin itself is is called by eslint so we do not need to package eslint with our plugin.

Create the following directories

$ mkdir lib
$ mkdir lib/rules
$ mkdir tests
$ mkdir tests/lib
$ mkdir tests/lib/rules

I’m not going to cover the rest of a normal github/npm package with readme.md. If you are not familiar with creating projects and publishing to npm I suggest you read GitHub Getting Started and NPM Publishing Packages.

Let’s create our first rule

$ cd lib/rules
$ touch first-rule.js

"use strict";
module.exports = function(context) {
    return {
        // Rule methods - AST Node Type
    }
}
module.schema = [];

The module returns a simple object with methods as properties. Each property is an AST node, ie Identity, CallExpression, MemberExpression.

First we need to know what part of the static tree we are going to lint.

Head over to ASTExplorer.net and put in your code snippit and see the tree.

ice_screenshot_20151208-143351

 

 

 

 

We want to write a rule that checks this method to see if we are sending arguments and the first argument is not null.

We see from the AST we are going to use the CallExpression node.

In that tree we have a MemberExpression with the object (sounds) and property (get) inside the callee property. sounds.call() is the method that would be called.

ice_screenshot_20151208-143656    ice_screenshot_20151208-143713

 

return {
    "CallExpression": function(node) {
        // node = {CallExpression}
        var callee = node.callee;
        // callee.object.name === 'sounds'
        // callee.property.name === 'get'
    }
}

So we now know that we want to test this method so we need to check against the object and property name.

if(callee.object.name === "sounds" &&
    callee.property.name === "get") {
  // Now we only match the methods that we are looking for
}

Lastly we want to check to see if that method is being called without arguments or a null first argument.

ice_screenshot_20151208-143824    ice_screenshot_20151208-143805

 

 

if (!node.arguments[0] || node.arguments[0].value === null) {
    //Report the error
}

Let put it all together

"use strict";

module.exports = function(context) {
    return {
        "CallExpression": function(node) {
            var callee = node.callee;
            if(callee.object.name === "sounds" &&
                callee.property.name === "get" &&
                node.arguments[0] &&
                node.arguments[0].value ==== null) {
                    context.report(node, "Method sounds.get() called without argument or first argument is null");
            }
        }
    };
};

module.schema = [];

There’s your first rule. It matches the method signature sounds.get() and checks to see if the first argument is defined and not null.

To implement your new rule you need to include your plugin in your node_modules and in your eslint config you include your plugin and rule.

{
  "plugins": [
    "sounds"
  ],
  "rules": {
    "sounds/first-rule": 1
  }
}

Notice that in the plugins section we only put our project name (sounds) not the full module name (eslint-plugin-sounds) as eslint expects eslint-plugin-{name} is the plugin path.

From here you should create a test to try out possible valid and invalid code snippets to make sure that everything is still valid. Check out this guide on how to write a test for your new rule.

There are dozens upon dozens of plugins and examples out there with more complex examples; see eslint-plugin-react for some really good ones.

Today we learned how to create your first ESLint plugin and custom rule, and implement it in your lint config file.

Advertisements

Similarity Search and Hashing for Text Documents

Introduction

This is a high level overview of similarity hashing for text, locality sensitive hashing (LSH) in particular, and connections to application domains like approximate nearest neighbor (ANN) search. This writeup is the result of a literature search and part of a broader project to identify an implementation pattern for similarity search in large scale document collections (100M+).

The Continuum of Similarity

At Catalyst, we encounter ideas which involve the identification and measurement of document similarity. Consider a "more like this" feature, ranking documents by similarity to other documents, clustering documents, tagging near duplicates, email threading, and so on.

In the abstract, these are all issues of similarity, and similarity can be scored on a continuum between 0 and 1. Mildly similar documents might score a 0.5. Exact duplicates score at 1.0.

Is there a general implementation pattern that can be brought to bear?

Can we derive a similarity scoring engine and apply it to multiple current and future features?

How do our answers change as we face larger scale collections of 100M or 500M documents?

These are the drivers behind this research.

Traditional Hashing

In software and systems, hash functions have gained popularity as a means to deterministically map input data to a predefined range of "buckets". Here’s a very basic hash function, which will hash positive integers in to 10 different buckets:

# What is the leftover amount when x is divided by 10?
h(x) = x mod 10
# Here are some examples:
 h(10) =  10 / 10 =  1 with 0 leftover = bucket 0
 h(18) =  18 / 10 =  1 with 8 leftover = bucket 8
 h(34) =  33 / 10 =  3 with 4 leftover = bucket 4
h(237) = 237 / 10 = 23 with 7 leftover = bucket 7

When faced with storing 10 million files, one might wish to spread them evenly across a small range of 1,000 directories (aka buckets). A sequence of hashing and heuristics could be used to assign each of the files to one of the directories.

At the other extreme, when faced with identifying exact duplicates of data, one might define a much larger range, greater even than the number of items to be hashed. If two items hash to the same bucket, then you can be relatively assured that they are identical. This can be used in various ways to filter duplicate items. As a side note, the extension of this idea in to probabilistic data structures is where the popular Bloom Filter is situated.

The main intuition behind this common kind of hashing is that it ensures any given set of inputs will generate hashes which are randomly and uniformly distributed across the defined range. Even the slightest change in input will generate a vastly different hash or signature.

In cryptographic hashing and signatures, the intuition is similar. Hashing of the binary contents of a file, using MD5 or SHA class hashes will generate a fixed length signature for that file. Even a single character change within a text file would then result in a dramatically different document signature.

Similarity Hashing

Similarity hashing arises from a contrasting idea to traditional hashing. The intuition for similarity hashing is to generate hashes or signatures which preserve the similarity and relationships between two items. Rather than a random and uniform distribution, the aim of similarity hashing is to cluster like items together.

Two documents which contain very similar content should result in very similar signatures when passed through a similarity hashing system. Similar content leads to similar hashes. Locality sensitive hashing (LSH) is a formal name for such a system, and a broad academic topic addressing related concerns.

An Example

Let’s walk through an example of one kind of similarity hashing. This example should help explore the concept. It will also begin to show that LSH is not a singular hash function but an assembly of many techniques involving the generation and comparison of document signatures.

First, imagine two sentences:

Sentence A: "The dog is brown"
Sentence B: "The dog is happy"

Then, imagine representing those two sentences in a matrix like so:

words: the | dog | is | brown | happy | sunshine
    A:  1  |  1  | 1  |   1   |   0   |    0
    B:  1  |  1  | 1  |   0   |   1   |    0

For each sentence and each column, a 1 indicates a word was present and a 0 indicates the word was not present. Note that the word "sunshine" is added to the matrix, and since neither sentence contain that word, both have a 0 in that column. This is added to the example to help illustrate how those are dropped later as a part of the similarity measurement.

How similar are the two sentences?

How can the similarity be scored?

This can be formulated as a set problem, and the two ordered sets can be compared:

Sentence A = [1, 1, 1, 1, 0, 0]
Sentence B = [1, 1, 1, 0, 1, 0]

Each sentence has a bit in the first 3 columns. And 5 columns have a bit in at least one of the sets. So we’ll give these two sentences a similarity score of 3/5 or 0.6. The formal name for this is Jaccard Similarity.

Walking through these steps, we have hashed two sentences and then compared their signatures to generate a similarity score.

On the Path to MinHash

Let’s move closer to a real world scenario, where it’s not just two sentences, but a collection of many documents. Envision a matrix of all the terms and documents in the collection.

 words: ace | cat | bag | brown | happy | water | ...
 doc 1:  1  |  1  |  1  |   1   |   0   |   0   | ...
 doc 2:  1  |  0  |  0  |   1   |   0   |   0   | ...
 doc 3:  1  |  0  |  1  |   0   |   0   |   0   | ...
 doc 4:  0  |  0  |  0  |   0   |   0   |   0   | ...
 doc 5:  1  |  1  |  1  |   1   |   0   |   0   | ...
 doc n...

Such a matrix would quickly get very large and it would also tend to be very sparse, meaning it would contain lots of zeros. When dealing with very large scale collections, holding such large matrices in memory and performing comparison calculations becomes a difficult problem.

Here, the various families of LSH techniques come in to play. One of the earliest techniques to be used was MinHash [2]. MinHash happens to begin with a characteristic matrix much like the one in our example. Then, using some clever arrangements of hashing and probability, very compact document signatures can be generated. For example, imagine each document in a collection being represented by only a 1024 bit sequence of bits or numbers. Formally, this is often called dimensionality reduction.

Pausing here, imagine now a collection of 1 million document signatures. In the worst case, you might have to compare every doc signature to every other signature. Comparing all possible pairs would require roughly 1M^2 or 1 trillion comparisons. If each comparison takes 1 millisecond, we’d be done in only 31.6 years! This is the comparison problem.

Continuing on with MinHash, another clever round of hashing and algorithms allows for a much smaller pool of "candidate signatures" to be identified for comparison. With a smaller set of signatures, the comparison problem becomes more reasonable.

Note however, that with each round of dimensionality reduction and hashing, small bits of risk for false positives and/or false negatives creep in to our results. This is the trade off for using probabilistic techniques. By extension, this is why leveraging LSH for nearest neighbor searches leads to the name of Approximate Nearest Neighbors (ANN) and why various other uses result in approximate answers and not exact answers.

A substantial amount of detail in the MinHash algorithms has been skipped in this description, partially to make it accessible and partially because there are better references than I need to provide here. If you’d like more detail, the early sections of Chapter 3 in the free book Mining of Massive Datasets is highly recommended.

To wrap up and generalize, several stages are starting to be exposed:

  1. How should a collection of documents be represented?
  2. How can that representation be made smaller and more compact?
  3. How can the comparison problem be improved by identifying smaller pools of candidate pairs?
  4. How can the remaining comparisons be performed quickly?

Let’s move on and extrapolate from this example to the larger field of LSH.

Broad View of LSH

By some measures, locality sensitive hashing has been around almost 20 years, beginning with Broder’s paper in 1997 [1]. Despite that, the field is still pretty young. LSH is found in a diverse set of academic research areas including digital signal processing, information retrieval, statistics, machine learning, and data mining. As such, searches of the literature reveal a fluid definition of the term LSH and related items.

In this section, I describe a common set of stages, found across the families of LSH implementations. Assuming a collection of documents, the rough sections are:

    1. selecting document characteristics
    1. dimensionality reduction and representation
    1. distance measures
    1. identifying candidate pairs
    1. performing comparisons, ranking, clustering, etc.

Now, let’s look at each section in a more detail. If you’re less interested in the details, this is a good place to jump ahead to the next section "Expansion of the Field".

(A) selecting document characteristics

Selecting document characteristics might also be noted as "feature selection" in the lexicon of machine learning. There are numerous ways to select and represent document characteristics. This selection can change the definition of what it means to be "similar". It can also add or remove technical constraints to the system.

One common approach to selecting document characteristics can be to generate shingles from the characters or even bytes of a document. As an example, the word "foxtrot" could be 3-shingled as [‘fox’, ‘oxt’, ‘xtr’, ‘tro’, ‘rot’]. Another approach might be to tokenize text and select unigrams, bigrams, or other n-grams. A more advanced approach might be to identify the words in a document along with their term frequencies or even their tf-idf scores.

When shingling or tokenization is used, the characteristics tend to be more representative of the lexical nature and basic structure of the text. Did these three characters exist in the document? Did these four words appear together, in order, in the document? This lends itself well, to a type of similarity that is more literal. This is seen in the historic use of MinHash for de-duplication of web collections, or in application toward the identification of plagiarism where common sentences might be copied and re-arranged.

Moving to term frequencies and especially tf-idf begins to create a more semantic representation of text. Later, this can lead to a similarity score which is more about document meaning and concepts. It can also build on the rich research around text statistics in collections.

Feature engineering is a particularly broad and also important part of the process. Some of these techniques can also be important in non-LSH domains and even paired with more traditional hashes and set operations [5, 7]. However, let’s continue on with our overview of the LSH scheme.

(B) dimensionality reduction and representation

Remember now that these document characteristics can grow in size very quickly. This can leads an example of the commonly described "curse of dimensionality". Enter the tools of lossy compression and dimensionality reduction.

Remember the matrix of words and documents in our early examples. Each list of numbers for a given document can be thought of as a vector or signature. They could be as simple as a bit vector like [1, 1, 0, …] or more complex vectors of real numbers or strings of characters. In dealing with vectors or other representations, measuring distance becomes an important tool. If two document vectors are close in distance, it can mean the documents are similar or have similar content.

There are numerous techniques to reduce the dimensionality of document vectors or more broadly reduce the amount of data. They include more basic techniques like sampling and quantization. They also extend in to more complex techniques like MinHash [2], Simhash [10], and Random Projection / Random Indexing. These techniques can be surprisingly good at shrinking the data while preserving the distance or other relationships between documents.

(C) distance measures for different representations

As noted in the previous section, distance measure become important in this world of alternate representations of documents. There are euclidean distance measures, cosine distance, and so on. On a basic two dimensional plot, this can be as simple as measuring distance between two points. It can get more complex when working with vectors in higher dimensions. Alternatively, when working with strings or even just bits, other distance measures can be leveraged. Examples there include Levenshtein distance or Hamming distance.

Pausing a moment, let me say that while this section is in the middle, it can be really useful as a starting point. Hamming distance is conceptually simple, and measures the difference between strings of 1s and 0s. Here’s an example:

bit vector 1: 001001
bit vector 2: 001011

These two vectors have a Hamming distance of 1, because there is only one position in which the two differ.

Additionally, representing anything as strings of bits allows for the use of bit oriented operations. Bits are easy to understand in the abstract, easy to work with, and there exist many very fast operations for their manipulation and comparison. So, it may be motivating to try and focus on the families of LSH techniques surrounding bit vectors. As an example, see the papers from Chappell Et al. [3] on efficient top-k retrieval for similarity signatures of bit vectors.

(D) identifying candidate pairs

Now, no matter what characteristics are used and no matter how small each document signature becomes, there is still the matter of making numerous comparisons between signatures. So, various technical and statistical techniques have arisen for identifying candidate pairs. Finding these candidate pairs which have a high likelihood of being similar is very important. This can dramatically reduce the number of comparisons operations that are required and make the whole idea of a similarity search feasible.

(E) performing comparisons, ranking, clustering, etc.

In some situations, you will want to make full comparisons of the candidate pairs, to confirm their similarity. Some candidates may be thrown out, or the candidates may be ranked by measure of similarity. Depending on the service, documents could be included, excluded, ranked, clustered, and so on.

Expansion of the Field

By now, heads can be swimming (mine does). The field expands quickly, with each section of the LSH sequence being very deep.

LSH begins as a straight forward intuition of similarity hashes and comparisons. However, simple techniques fail at large scales (100M+ docs). Tackling these scale hurdles with approximation and clever algorithms then gives rise to families of LSH techniques around Hamming distance, families around Euclidean Distance, and so on.

In practice, the simple intuition of similarity hashing becomes a tedious sequencing of many tedious steps. It is not insurmountable though, and these techniques are already well established in industry and even available in open source packages.

Closing

Thank you for taking the time to read this writeup. While this document represents many hours of my own work, I recognize the extant and parallel work of others on this topic at Catalyst. Discussion, questions, and feedback on any details large or small are warmly welcomed.

Thanks to JBull for feedback on drafts of this document.

References

Here, you’ll find some of the references which informed this document, and other interesting resources.

[1] A. Broder. On the resemblance and containment of documents. Sequences 1997. Link
This is the original paper on LSH.

[2] A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. "Min-wise independent permutations". ACM Symposium on Theory of Computing, pp. 327–336, 1998. Link
This is the original MinHash paper.

[3] T. Chappell, S. Geva, A. Nguyen, G. Zuccon. Efficient top-k retrieval with signatures. ADCS 2013. (r)
T. Chappell, S. Geva, G. Zuccon. Approximate Nearest-Neighbour Search with Inverted Signature Slice Lists. ECIR 2015. (r)
These two papers, essentially the same, describe accessible and interesting techniques for very fast comparisons of bit vectors. They address the comparison problem with a mix of inverted indexes and probability. Email me for copies.

[4] S. Geva, C. De Vries. TopSig: Topology Preserving Document Signatures. CIKM 2011. (r) Link
This paper was published prior to the Chappell papers [3], and addresses the generation of document signaures which extends on random projection by introducing tf-idf scores. It’s a bit vague, but there’s a C code implementation.

[5] Redacted – internal source

[6] J. Leskovec, A. Rajaraman, and JD Ullman. 2014. Mining of Massive Datasets. Cambridge University Press, New York, NY, USA. (pr) Link
This is a pretty good book from a Stanford trio, covering a broad array of large scale data mining topics. It is available for free online and also used in a data mining class at Coursera.

[7] C. Manning, P Raghavan, H Schütze. Introduction to Information Retrieval. 2008 Cambridge University Press. Link Link
This is a good, free book on Information Retrieval from Stanford.

[8] M. Slaney, M. Casey. Locality-Sensitive Hashing for Finding Nearest Neighbors. IEEE Signal Processing Magazine, March 2008.
This is a well respected description of one kind of Locality Sensitive Hashing.

[9] J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. CoRR, abs/1408.2927, 2014 (pr) Link
This is a very lengthy and thorough survey of Locality Sensitive Hashing and similarity search. It holds particular interest because it was completed so recently (2014).

[10] C. Sadowski, G. Levin. SimHash: Hash-based Similarity Detection.
G. Manku, A. Jain, A. Das Sarma. Detecting Near-Duplicates for Web Crawling. WWW 2007. Link
These are two of the papers on SimHash from people with Google ties.

Speeding Up Large File Transfer Over HTTP

The HTTP protocol is all around us. Can we work within HTTP/1.1 to speed up large file delivery?

What is going on in HTTP when a page is requested?

TCP + HTTP overview of a typical request to view a homepage.

Fig. 1: Simplified TCP + HTTP overview of a typical request to view a homepage.

Could this be sped up?  In HTTP/1.1 we are given two options, pipelining and byte serving.  It is worth understanding each.  In Fig. 1, the operations are presented serially.  In reality that is mostly the way they are carried out.  If the server supports pipelining, requests can be sent without waiting for a response within the same TCP session.

Fig. 2: Pipelining in HTTP/1.1

Fig. 2: Pipelining in HTTP/1.1

In Fig. 2, the same requests have been made in rapid succession.  But as defined, the pipelining feature of HTTP/1.1 will return the resources in the order of the request.  This is in essence a FIFO system and can suffer from head-of-line issues.  If the request for logo.png in Fig. 2 results in a 200 MB file, that resource will be delivered before other resources can continue.  Asynchronous delivery and other improvements are scheduled for HTTP/2.0, but they are not part of HTTP/1.1 and with Google’s withdrawal of SPDY, there aren’t a lot of improvements within pipelining that are available in browsers.

Byte Serving is another feature of HTTP/1.1.  Some content delivered over HTTP can be read progressively (like HTML) while other content needs to be delivered completely before your computer can do anything with it (like a Microsoft Word *.doc file).  PDF files fit into the former category.  A PDF viewer that connects with your browser knows about Byte Serving.  That is how it appears that PDFs stream for some combinations of browsers and PDF viewers.  This is accomplished by the server supporting Ranges and the client making use of Range requests.

Fig. 3: Byte Serving through the use of Ranges in HTTP/1.1

Fig. 3: A simplified version of Byte Serving through the use of Ranges in HTTP/1.1

If the resource can be used in a progressive manner, then byte serving can get chunks of a file and make them usable to the consumer.  Byte Serving can be combined with pipelining, but for reasons already discussed, there are only marginal gains through this approach.

Fig. 4: Combining Byte Serving and pipelining is possible but doesn't make material gains in performance.

Fig. 4: Combining Byte Serving and pipelining is possible but doesn’t make material gains in performance.

Let’s go back and look at that large file in a simple delivery.

Fig. 5 - A simple delivery of a large file.

Fig. 5 : A simple delivery of a large file.

If you need the entire file and can’t do anything progressively, you will end up waiting for the entire payload to complete your request.  Pipelining won’t help much nor will Byte Serving since you still need the whole file to finish.  What if you could make multiple parallel requests of the server asking for portions of the file?  We call this approach PBS or Parallel Byte Serving.

Fig. 7: Agent 1

Fig. 7: Agent 1

Fig. 8: Agent 2

Fig. 8: Agent 2

The file

http://REDACTED/public/p/byte.serve.bin

The meta

HTTP/1.1 200 OK
Date: Mon, 01 Jun 2015 20:04:15 GMT
Server: Apache/2.2.3 (CentOS)
Last-Modified: Fri, 29 May 2015 14:12:18 GMT
ETag: 19e197-181a5ceb-517391052f480
Accept-Ranges: bytes
Content-Length: 404380907
Expires: Wed, 15 Apr 2020 20:00:00 GMT
Cache-Control: public
Connection: close
Content-Type: application/unknown

Time to retrieve using a simple download.

time curl -s http://REDACTED/public/p/byte.serve.bin -XGET -ouptut full.pdf

real 22m2.913s
user 0m3.413s
sys 0m12.991s

By making use of the HTTP/1.1 HEAD call, we know the file is 404380907 bytes.  Now it’s simply a matter of configuring four distinct agents with their own TCP + HTTP session to the server to read four different ranges of the same file.  Here’s an example of one agent.

curl -s -H Range: bytes=0-101095226 http://REDACTED/public/p/byte.serve.bin -XGET –output bin.part1

Three more agents are started up in rapid succession with differences in the Range Request Header and different output files.  This was combined into a  simple script.

time ./multi.sh
STILL WORKING
(Repeat)
DONE

real 7m2.332s
user 0m3.722s
sys 0m11.659s

From 22 minutes to seven minutes.  That is promising.  But this is a very naive setup.  It assumes there is no authentication, TLS, or other expensive operations within the HTTP call. To be useful, PBS would need to be tested against our own production equipment.  Inside of our platform, we’ve done a lot to optimize intra machine communication so delivering a large file faster would have to make big strides for us to want to change some of our software.

To test production resource, the scripts repeatedly requested a 40MB file – both as a single file and as four separate agents asking for 25% of the file.  The PBS approach is faster, but not fast enough to take on the headaches of reassembling the parts and other changes in our agents.  Perhaps if files were much larger, like 100MB or more PBS would showcase it’s advantages more clearly.

Fig. 9: Comparing PBS and regular HTTP delivery of a 40 MB file

Fig. 9: Comparing PBS and regular HTTP delivery of a 40 MB file

The graph shows our average simple delivery was .9964 seconds while PBS was .9381 seconds.  Where are the enormous gains of the large file delivery outlined above?  Well this service is multiplexed with a load balancer, handles authentication, TLS, and other pieces of code.  The overhead for each agent session eats away at the gains of Parallel Byte Serving for smaller files.

The Mathematics Behind Communication Explorer Arrows

Creating lines to represent one way communication.

D3 can draw a line between nodes using their force directed layout. Let’s call these two points (x1, y1) and (x2, y2). Now we need to determine the point (Xp, Yp) to plot the arrow at. We can identify the coordinates using trigonometry.

Note: boundarySpace is the amount of additional space we want between the edge of the target circle and the end of the line, which is set to 8px (large enough to accommodate a marker).

lineDiagram

renderLinks: function(lines, settings){
    var that = this;
    lines.attr("x1", function(d) { return d.source.x.toFixed(2); })
        .attr("y1", function(d) { return d.source.y.toFixed(2); })
        .attr("x2", function(d) { return that.getLinkEndPoint(d, settings)[0].toFixed(2); })
        .attr("y2", function(d) { return that.getLinkEndPoint(d, settings)[1].toFixed(2); });
},
getLinkEndPoint: function(d, settings){
    var theta = Math.atan2(d.target.y - d.source.y, d.target.x - d.source.x),
    Py = parseInt(d.target.y, 10) - (d.target.radius + settings.boundarySpace) * Math.sin(theta),
    Px = parseInt(d.target.x, 10) - (d.target.radius + settings.boundarySpace) * Math.cos(theta);
    return [Py, Px];
},

Creating arcs to represent two way communication.
D3 can also draw an arc between the center point of two nodes. Lets call these points (x1, y1) and (x2, y2). Since D3 cannot draw an arrow on an arc for any possible target circle radius, we need to plot a line at the point (P1x, P1y) where the arc intersects the target circle and put the arrow on line instead of the arc. The point (P1x, P1y) can be determined using the native svg function getPointAtLength.

Next, we get another point along the arc at a very small distance away (PSx, PSy). Now that two points are known, the slope can be calculated and the line extended to point (P2x, P2y).

arcDiagram

plotSlopeLinesAndArrows: function(paths, settings){
    var that = this;
    paths.each(function(l, i) {
        var pathLength = this.getTotalLength() - (parseInt(l.target.radius, 10) + settings.boundarySpace);
        if(pathLength > 0){
            var p1 = this.getPointAtLength(pathLength),
            pS = this.getPointAtLength(pathLength - 5),
            p2 = that.getSlopeLineEndPoint(point1, point2),
            strokeW = that.calculateLinkWidth(l.quantity),
            path = "M" + endPoint.x.toFixed(2) + "," + endPoint.y.toFixed(2) + " L" + point1.x.toFixed(2) +","+ point1.y.toFixed(2);
            $("#slope" + i).attr("d", path).attr("marker-end", function(d) {
                var marker = null;
                if( l.source.state === "hover" || l.source.state === "clicked" )    {
                    marker = that.getMarker("A", strokeW, true);
                }else{
                    marker = that.getMarker("A", strokeW);
                }
                return "url(#" + marker + ")";
            });
            that.renderPhantomLink(l, i, point1);
         }
    });
},

getSlopeLineEndPoint: function(p1, p2){
     var slopeLineLength = 90,
         slope = (p1.y - p2.y) / (p1.x - p2.x),
         theta = Math.atan2(slope, 1),
         x2 = 0,
         y2 = 0;
     if(p1.x - p2.x > 0){
         y2 = p1.y - (slopeLineLength * Math.sin(theta));
         x2 = p1.x - (slopeLineLength * Math.cos(theta));
     }else{
         y2 = p1.y + (slopeLineLength * Math.sin(theta));
         x2 = p1.x + (slopeLineLength * Math.cos(theta));
     }
     return {'x': x2, 'y': y2};
},

Debug Mode:
All of this can be seen in action by looking at the GIF below. The red lines are the slope lines, which are drawn at edge of the target circle when there is two way communication. A marker is placed on them to give the desired effect..

commExplorerDev

Final Appearance:
By changing the CSS to hide the slope lines and the original arcs, the desired appearance can be created.

commExplorerNormal

Taking a closer look, we can see that we have achieved a clean look between the end of the arcs, their marker, and the target circle for all possible radius and stroke sizes.

commExplorerNormalPic

Reducing network overhead in Redis calls with Lua

For one of our applications, we cache a variety of data in Redis for use in later calculations. When profiling the application, I noticed that one of the caching steps was taking much longer that any of the others. Looking at the profiler output quickly showed why:

         3863231 function calls in 266.866 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    50992  259.846    0.005  259.846    0.005 {method 'recv' of '_socket.socket' objects}
    50992    1.293    0.000    2.491    0.000 connection.py:593(pack_command)
    50994    0.802    0.000    0.802    0.000 {method 'sendall' of '_socket.socket' objects}
    50992    0.716    0.000  261.088    0.005 connection.py:315(read_response)
   836706    0.516    0.000    0.516    0.000 {isinstance}
    50992    0.450    0.000  266.530    0.005 client.py:558(execute_command)
   203968    0.392    0.000    0.717    0.000 connection.py:577(encode)
    50992    0.249    0.000  261.709    0.005 client.py:575(parse_response)
    50990    0.210    0.000    0.210    0.000 client.py:231(float_or_none)
    50992    0.199    0.000    0.199    0.000 {method 'feed' of 'hiredis.Reader' objects}
    50992    0.191    0.000    1.142    0.000 connection.py:529(send_packed_command)
    50992    0.171    0.000    3.804    0.000 connection.py:554(send_command)
    50992    0.164    0.000    0.311    0.000 connection.py:885(release)
   254960    0.163    0.000    0.163    0.000 {method 'join' of 'str' objects}
    50992    0.147    0.000    0.257    0.000 connection.py:868(get_connection)
   713889    0.142    0.000    0.142    0.000 {len}
    50992    0.138    0.000  261.250    0.005 connection.py:566(read_response)
   101984    0.124    0.000    0.151    0.000 connection.py:858(_checkpid)

Look at all that time recv’ing!

The problematic piece of application code looked something like this:

for k, v in some_things.iteritems():
    increment_by = 1
    redis_db.zincrby(ZSET, k, increment_by)

It just loops over a dictionary and for each key (k) increments its score by one within the sorted set it’s stored in. So, for dictionaries with many keys, the repeated calls to ZINCRBY really began to add up, as the profiler output indicated. Obviously a quick way to reduce the amount of time spent doing this needless work would would be to push it to the server side, and that’s where Lua comes in.

Since version 2.6.0, Redis has shipped with a built-in Lua interpreter. From what I can tell it was introduced partly as the maintainer’s way of deflecting new feature requests for obscure use cases (“Write your own Lua script to do that, I won’t add it to Redis.”) For more information, see the official documentation and a short introduction to writing Lua scripts for Redis.

Replacing the loop above with a Lua script was pretty straightforward. You can simply define the script as a string, register the script with the server, and then get to using it. As Lua script execution is blocking, it was necessary to add a way to break the items to be incremented into reasonable chunks, so as to avoid asking the server to increment hundreds of thousands of scores at once. I also tested out replacing the original loop with Redis’ pipeline functionality, but the Lua version won out in the end.

lua = """
for i, k in pairs(ARGV) do
  redis.call("ZINCRBY", KEYS[1], 1, k)
end"""
increment = redis_db.register_script(lua)
for chunk in generate_chunk_from_iterable(some_things.iterkeys(), CHUNK_SIZE):
    increment(keys=[ZSET], args=chunk)

The results from the profiler were pretty convincing compared to the previous method:

         1048325 function calls in 1.851 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5    0.643    0.129    0.643    0.129 {method 'recv' of '_socket.socket' objects}
      615    0.295    0.000    0.295    0.000 {method 'sendall' of '_socket.socket' objects}
        5    0.252    0.050    0.634    0.127 connection.py:593(pack_command)
   101999    0.131    0.000    0.272    0.000 connection.py:577(encode)
        1    0.125    0.125    0.125    0.125 {zlib.compress}
        1    0.112    0.112    0.112    0.112 {cPickle.dumps}
   347687    0.100    0.000    0.100    0.000 {isinstance}
   102004    0.064    0.000    0.064    0.000 {method 'join' of 'str' objects}
    35914    0.041    0.000    0.041    0.000 {method 'encode' of 'unicode' objects}
   305707    0.032    0.000    0.032    0.000 {len}
   102013    0.014    0.000    0.014    0.000 _compat.py:21()
    50990    0.011    0.000    0.011    0.000 {round}
        1    0.007    0.007    1.504    1.504 client.py:2636(__call__)
        5    0.005    0.001    1.591    0.318 client.py:558(execute_command)

So remember, when inserting or updating many values, avoid making a lot of round-trips, because the network is always slower than RAM.