Similarity Search and Hashing for Text Documents


This is a high level overview of similarity hashing for text, locality sensitive hashing (LSH) in particular, and connections to application domains like approximate nearest neighbor (ANN) search. This writeup is the result of a literature search and part of a broader project to identify an implementation pattern for similarity search in large scale document collections (100M+).

The Continuum of Similarity

At Catalyst, we encounter ideas which involve the identification and measurement of document similarity. Consider a "more like this" feature, ranking documents by similarity to other documents, clustering documents, tagging near duplicates, email threading, and so on.

In the abstract, these are all issues of similarity, and similarity can be scored on a continuum between 0 and 1. Mildly similar documents might score a 0.5. Exact duplicates score at 1.0.

Is there a general implementation pattern that can be brought to bear?

Can we derive a similarity scoring engine and apply it to multiple current and future features?

How do our answers change as we face larger scale collections of 100M or 500M documents?

These are the drivers behind this research.

Traditional Hashing

In software and systems, hash functions have gained popularity as a means to deterministically map input data to a predefined range of "buckets". Here’s a very basic hash function, which will hash positive integers in to 10 different buckets:

# What is the leftover amount when x is divided by 10?
h(x) = x mod 10
# Here are some examples:
 h(10) =  10 / 10 =  1 with 0 leftover = bucket 0
 h(18) =  18 / 10 =  1 with 8 leftover = bucket 8
 h(34) =  33 / 10 =  3 with 4 leftover = bucket 4
h(237) = 237 / 10 = 23 with 7 leftover = bucket 7

When faced with storing 10 million files, one might wish to spread them evenly across a small range of 1,000 directories (aka buckets). A sequence of hashing and heuristics could be used to assign each of the files to one of the directories.

At the other extreme, when faced with identifying exact duplicates of data, one might define a much larger range, greater even than the number of items to be hashed. If two items hash to the same bucket, then you can be relatively assured that they are identical. This can be used in various ways to filter duplicate items. As a side note, the extension of this idea in to probabilistic data structures is where the popular Bloom Filter is situated.

The main intuition behind this common kind of hashing is that it ensures any given set of inputs will generate hashes which are randomly and uniformly distributed across the defined range. Even the slightest change in input will generate a vastly different hash or signature.

In cryptographic hashing and signatures, the intuition is similar. Hashing of the binary contents of a file, using MD5 or SHA class hashes will generate a fixed length signature for that file. Even a single character change within a text file would then result in a dramatically different document signature.

Similarity Hashing

Similarity hashing arises from a contrasting idea to traditional hashing. The intuition for similarity hashing is to generate hashes or signatures which preserve the similarity and relationships between two items. Rather than a random and uniform distribution, the aim of similarity hashing is to cluster like items together.

Two documents which contain very similar content should result in very similar signatures when passed through a similarity hashing system. Similar content leads to similar hashes. Locality sensitive hashing (LSH) is a formal name for such a system, and a broad academic topic addressing related concerns.

An Example

Let’s walk through an example of one kind of similarity hashing. This example should help explore the concept. It will also begin to show that LSH is not a singular hash function but an assembly of many techniques involving the generation and comparison of document signatures.

First, imagine two sentences:

Sentence A: "The dog is brown"
Sentence B: "The dog is happy"

Then, imagine representing those two sentences in a matrix like so:

words: the | dog | is | brown | happy | sunshine
    A:  1  |  1  | 1  |   1   |   0   |    0
    B:  1  |  1  | 1  |   0   |   1   |    0

For each sentence and each column, a 1 indicates a word was present and a 0 indicates the word was not present. Note that the word "sunshine" is added to the matrix, and since neither sentence contain that word, both have a 0 in that column. This is added to the example to help illustrate how those are dropped later as a part of the similarity measurement.

How similar are the two sentences?

How can the similarity be scored?

This can be formulated as a set problem, and the two ordered sets can be compared:

Sentence A = [1, 1, 1, 1, 0, 0]
Sentence B = [1, 1, 1, 0, 1, 0]

Each sentence has a bit in the first 3 columns. And 5 columns have a bit in at least one of the sets. So we’ll give these two sentences a similarity score of 3/5 or 0.6. The formal name for this is Jaccard Similarity.

Walking through these steps, we have hashed two sentences and then compared their signatures to generate a similarity score.

On the Path to MinHash

Let’s move closer to a real world scenario, where it’s not just two sentences, but a collection of many documents. Envision a matrix of all the terms and documents in the collection.

 words: ace | cat | bag | brown | happy | water | ...
 doc 1:  1  |  1  |  1  |   1   |   0   |   0   | ...
 doc 2:  1  |  0  |  0  |   1   |   0   |   0   | ...
 doc 3:  1  |  0  |  1  |   0   |   0   |   0   | ...
 doc 4:  0  |  0  |  0  |   0   |   0   |   0   | ...
 doc 5:  1  |  1  |  1  |   1   |   0   |   0   | ...
 doc n...

Such a matrix would quickly get very large and it would also tend to be very sparse, meaning it would contain lots of zeros. When dealing with very large scale collections, holding such large matrices in memory and performing comparison calculations becomes a difficult problem.

Here, the various families of LSH techniques come in to play. One of the earliest techniques to be used was MinHash [2]. MinHash happens to begin with a characteristic matrix much like the one in our example. Then, using some clever arrangements of hashing and probability, very compact document signatures can be generated. For example, imagine each document in a collection being represented by only a 1024 bit sequence of bits or numbers. Formally, this is often called dimensionality reduction.

Pausing here, imagine now a collection of 1 million document signatures. In the worst case, you might have to compare every doc signature to every other signature. Comparing all possible pairs would require roughly 1M^2 or 1 trillion comparisons. If each comparison takes 1 millisecond, we’d be done in only 31.6 years! This is the comparison problem.

Continuing on with MinHash, another clever round of hashing and algorithms allows for a much smaller pool of "candidate signatures" to be identified for comparison. With a smaller set of signatures, the comparison problem becomes more reasonable.

Note however, that with each round of dimensionality reduction and hashing, small bits of risk for false positives and/or false negatives creep in to our results. This is the trade off for using probabilistic techniques. By extension, this is why leveraging LSH for nearest neighbor searches leads to the name of Approximate Nearest Neighbors (ANN) and why various other uses result in approximate answers and not exact answers.

A substantial amount of detail in the MinHash algorithms has been skipped in this description, partially to make it accessible and partially because there are better references than I need to provide here. If you’d like more detail, the early sections of Chapter 3 in the free book Mining of Massive Datasets is highly recommended.

To wrap up and generalize, several stages are starting to be exposed:

  1. How should a collection of documents be represented?
  2. How can that representation be made smaller and more compact?
  3. How can the comparison problem be improved by identifying smaller pools of candidate pairs?
  4. How can the remaining comparisons be performed quickly?

Let’s move on and extrapolate from this example to the larger field of LSH.

Broad View of LSH

By some measures, locality sensitive hashing has been around almost 20 years, beginning with Broder’s paper in 1997 [1]. Despite that, the field is still pretty young. LSH is found in a diverse set of academic research areas including digital signal processing, information retrieval, statistics, machine learning, and data mining. As such, searches of the literature reveal a fluid definition of the term LSH and related items.

In this section, I describe a common set of stages, found across the families of LSH implementations. Assuming a collection of documents, the rough sections are:

    1. selecting document characteristics
    1. dimensionality reduction and representation
    1. distance measures
    1. identifying candidate pairs
    1. performing comparisons, ranking, clustering, etc.

Now, let’s look at each section in a more detail. If you’re less interested in the details, this is a good place to jump ahead to the next section "Expansion of the Field".

(A) selecting document characteristics

Selecting document characteristics might also be noted as "feature selection" in the lexicon of machine learning. There are numerous ways to select and represent document characteristics. This selection can change the definition of what it means to be "similar". It can also add or remove technical constraints to the system.

One common approach to selecting document characteristics can be to generate shingles from the characters or even bytes of a document. As an example, the word "foxtrot" could be 3-shingled as [‘fox’, ‘oxt’, ‘xtr’, ‘tro’, ‘rot’]. Another approach might be to tokenize text and select unigrams, bigrams, or other n-grams. A more advanced approach might be to identify the words in a document along with their term frequencies or even their tf-idf scores.

When shingling or tokenization is used, the characteristics tend to be more representative of the lexical nature and basic structure of the text. Did these three characters exist in the document? Did these four words appear together, in order, in the document? This lends itself well, to a type of similarity that is more literal. This is seen in the historic use of MinHash for de-duplication of web collections, or in application toward the identification of plagiarism where common sentences might be copied and re-arranged.

Moving to term frequencies and especially tf-idf begins to create a more semantic representation of text. Later, this can lead to a similarity score which is more about document meaning and concepts. It can also build on the rich research around text statistics in collections.

Feature engineering is a particularly broad and also important part of the process. Some of these techniques can also be important in non-LSH domains and even paired with more traditional hashes and set operations [5, 7]. However, let’s continue on with our overview of the LSH scheme.

(B) dimensionality reduction and representation

Remember now that these document characteristics can grow in size very quickly. This can leads an example of the commonly described "curse of dimensionality". Enter the tools of lossy compression and dimensionality reduction.

Remember the matrix of words and documents in our early examples. Each list of numbers for a given document can be thought of as a vector or signature. They could be as simple as a bit vector like [1, 1, 0, …] or more complex vectors of real numbers or strings of characters. In dealing with vectors or other representations, measuring distance becomes an important tool. If two document vectors are close in distance, it can mean the documents are similar or have similar content.

There are numerous techniques to reduce the dimensionality of document vectors or more broadly reduce the amount of data. They include more basic techniques like sampling and quantization. They also extend in to more complex techniques like MinHash [2], Simhash [10], and Random Projection / Random Indexing. These techniques can be surprisingly good at shrinking the data while preserving the distance or other relationships between documents.

(C) distance measures for different representations

As noted in the previous section, distance measure become important in this world of alternate representations of documents. There are euclidean distance measures, cosine distance, and so on. On a basic two dimensional plot, this can be as simple as measuring distance between two points. It can get more complex when working with vectors in higher dimensions. Alternatively, when working with strings or even just bits, other distance measures can be leveraged. Examples there include Levenshtein distance or Hamming distance.

Pausing a moment, let me say that while this section is in the middle, it can be really useful as a starting point. Hamming distance is conceptually simple, and measures the difference between strings of 1s and 0s. Here’s an example:

bit vector 1: 001001
bit vector 2: 001011

These two vectors have a Hamming distance of 1, because there is only one position in which the two differ.

Additionally, representing anything as strings of bits allows for the use of bit oriented operations. Bits are easy to understand in the abstract, easy to work with, and there exist many very fast operations for their manipulation and comparison. So, it may be motivating to try and focus on the families of LSH techniques surrounding bit vectors. As an example, see the papers from Chappell Et al. [3] on efficient top-k retrieval for similarity signatures of bit vectors.

(D) identifying candidate pairs

Now, no matter what characteristics are used and no matter how small each document signature becomes, there is still the matter of making numerous comparisons between signatures. So, various technical and statistical techniques have arisen for identifying candidate pairs. Finding these candidate pairs which have a high likelihood of being similar is very important. This can dramatically reduce the number of comparisons operations that are required and make the whole idea of a similarity search feasible.

(E) performing comparisons, ranking, clustering, etc.

In some situations, you will want to make full comparisons of the candidate pairs, to confirm their similarity. Some candidates may be thrown out, or the candidates may be ranked by measure of similarity. Depending on the service, documents could be included, excluded, ranked, clustered, and so on.

Expansion of the Field

By now, heads can be swimming (mine does). The field expands quickly, with each section of the LSH sequence being very deep.

LSH begins as a straight forward intuition of similarity hashes and comparisons. However, simple techniques fail at large scales (100M+ docs). Tackling these scale hurdles with approximation and clever algorithms then gives rise to families of LSH techniques around Hamming distance, families around Euclidean Distance, and so on.

In practice, the simple intuition of similarity hashing becomes a tedious sequencing of many tedious steps. It is not insurmountable though, and these techniques are already well established in industry and even available in open source packages.


Thank you for taking the time to read this writeup. While this document represents many hours of my own work, I recognize the extant and parallel work of others on this topic at Catalyst. Discussion, questions, and feedback on any details large or small are warmly welcomed.

Thanks to JBull for feedback on drafts of this document.


Here, you’ll find some of the references which informed this document, and other interesting resources.

[1] A. Broder. On the resemblance and containment of documents. Sequences 1997. Link
This is the original paper on LSH.

[2] A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. "Min-wise independent permutations". ACM Symposium on Theory of Computing, pp. 327–336, 1998. Link
This is the original MinHash paper.

[3] T. Chappell, S. Geva, A. Nguyen, G. Zuccon. Efficient top-k retrieval with signatures. ADCS 2013. (r)
T. Chappell, S. Geva, G. Zuccon. Approximate Nearest-Neighbour Search with Inverted Signature Slice Lists. ECIR 2015. (r)
These two papers, essentially the same, describe accessible and interesting techniques for very fast comparisons of bit vectors. They address the comparison problem with a mix of inverted indexes and probability. Email me for copies.

[4] S. Geva, C. De Vries. TopSig: Topology Preserving Document Signatures. CIKM 2011. (r) Link
This paper was published prior to the Chappell papers [3], and addresses the generation of document signaures which extends on random projection by introducing tf-idf scores. It’s a bit vague, but there’s a C code implementation.

[5] Redacted – internal source

[6] J. Leskovec, A. Rajaraman, and JD Ullman. 2014. Mining of Massive Datasets. Cambridge University Press, New York, NY, USA. (pr) Link
This is a pretty good book from a Stanford trio, covering a broad array of large scale data mining topics. It is available for free online and also used in a data mining class at Coursera.

[7] C. Manning, P Raghavan, H Schütze. Introduction to Information Retrieval. 2008 Cambridge University Press. Link Link
This is a good, free book on Information Retrieval from Stanford.

[8] M. Slaney, M. Casey. Locality-Sensitive Hashing for Finding Nearest Neighbors. IEEE Signal Processing Magazine, March 2008.
This is a well respected description of one kind of Locality Sensitive Hashing.

[9] J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. CoRR, abs/1408.2927, 2014 (pr) Link
This is a very lengthy and thorough survey of Locality Sensitive Hashing and similarity search. It holds particular interest because it was completed so recently (2014).

[10] C. Sadowski, G. Levin. SimHash: Hash-based Similarity Detection.
G. Manku, A. Jain, A. Das Sarma. Detecting Near-Duplicates for Web Crawling. WWW 2007. Link
These are two of the papers on SimHash from people with Google ties.

Deploying Python Applications and their Virtual Environments


As noted in a past article, we leverage virtualenv and pip to isolate and manage some of our python applications. A natural next question is “How can a python virtual environment and related application be deployed to a production server?”. This article provides a conceptual overview of one way such deployments can be handled.

The Server Environment and Conventions

First, let’s discuss some assumptions about the server environment. In this article, a deployment server, development server(s), and production server(s) are all discussed. It can be assumed that all these servers are running the same operating system (in this case, RHEL 6). This provides a luxury which allows for transporting virtual environments from one host to another with no ill effects and no requirement to build new virtual environments for each host.

Additionally, there are some directory conventions used which help assure consistency from host to host. The virtual environment is located in a standard path such as /opt/companyname/. The code for each python application is then located in a directory inside the virtual environment root. This makes for a set of paths like so:

Example directories:

/opt/company/myapp/   # the virtual env root

/opt/company/myapp/myapp/              # the application root
/opt/company/myapp/myapp/lib/          # the application library
/opt/company/myapp/myapp/bin/   # the application

The Build

The building of the python application is a two step process. First the virtual environment is created or updated. Next, the desired version of the application is exported from the repository. This work all takes place on the deployment server.

Steps to build the virtual env and application:

# Go to the standard app location
cd /opt/company/

# Create the virtual env if needed
virtualenv ./myapp

# Export the desired copy of the app inside the virtual env root
svn export $repouri /opt/company/myapp/myapp/

# Activate the virtualenv
cd /opt/company/myapp/ && source ./bin/activate

# Install the requirements
cd /opt/company/myapp/myapp/
pip install -r ./requirements.txt

Here’s an example script which would handle such a build:


The Deploy

Once the virtualenv and application are built, the deployment can be handled with some rsync and scripting work. This same model can be used to deploy to development servers or production servers, maintaining consistency across any environment. It can also be used to deploy your application to a new server. While a bit of a simplification, the deployment can be envisioned as a simple for-loop around rsync.

Example deployment loop:

for host in $devservers; do
    rsync -avz --delete-after /opt/company/myapp $host:/opt/company/myapp

Here’s an example script which would handle such a build:



This describes one of many ways python applications and their virtual environments can be deployed to remote hosts. It is a fairly simple matter to assemble these techniques into shell scripts for semi-automated build and deployment. Such scripts can then be enhanced preferred conventions as well as the more intelligent handling of application restarts, rollbacks, configuration management, and other desired improvements particular to the application.

Leveraging Python Virtual Environments and pip


Python virtual environments are a common means by which python applications can be isolated from one another and also segregated from the global directories and packages of an operating system. They can provde a pragmatic compromise between the flexibility needs of software development and the stability standards for server management. While new techniques like containers and Docker may point to the future of application deployment, virtual environments remain a solid choice for local Python development as well as application management across a set of Linux servers.

Installing virtualenv and pip

To take full advantage of python virtual environments, both virtualenv and pip should be installed. The virtualenv package provides environment management and isolation while pip provides python package installation and management within the virtual environment. These tools are not always available by default though. In order to bootstrap, it is often easiest to install virtualenv which contains pip.

Installation of virtualenv varies across different operating systems. Here, I will focus on Redhat Enterprise Linux / CentOS because of their common usage for our server installations.

For both RHEL 5 and RHEL 6, the EPEL repository is a great resource for obtaining virtualenv and related python packages. Installing this repository can be as simple as installing a single rpm manually, or having it added to the server configuration using your preferred management system such as puppet or chef. Details of this installation are beyond the scope of this article. Readers may also wish to instead investigate and use the newer RedHat Software Collections.

Once the EPEL repository is installed, installing virtualenv (and pip) is just a simple rpm installation. For RHEL 6, python 2.6 is the default python installation. For RHEL 5, you must install python2.6 from EPEL in addition to the virtualenv package. Note that this means paying close attention to 2.4 vs 2.6 usage on RHEL 5 systems.

Installing virtualenv (and pip) on RHEL 6:

# Requires EPEL repo
[root@host]$ yum install python-virtualenv

Installing virtualenv (and pip) on RHEL 5:

# Requires EPEL repo
# Requires python2.6 from EPEL
[root@host]$ yum install python26
[root@host]$ yum install python26-virtualenv

Creating and Activating a Virtual Environment

With the necessary pre-requisites, you can now create new virtual environments. The examples here illustrate usage on a Linux server. However, it is not necessary to have root privileges to uses virtual environments or pip. You can use them for local exploration and development, on automated testing hosts, and of course development and production servers.

Creating a new virtual environment:

# Choose a common location for your applications / virtual envs
[host]# cd /opt/company/

# Create a new virtual environment
[host]# virtualenv ./myapp/

Using the virtual environment requires that it be “activated”. This is a simple one step command.

Activate the virtual environment:

[root@host]# source ./myapp/bin/activate

Using a Virtual Environment Together with an Application

How you choose to use a virtual environment for an application can vary dramatically per situation and personal preferences. Once a virtual environment is activated, you can navigate the server filesystem and use any python code you wish. Conceptually, the virtualenv can be wholly separate from an application. In practice, we often tie a single application to a single virtualenv. Below are just two examples to illustrate differences.

A virtualenv and a local working copy of code:

# Here's a virtual env and activation
source ~/myvirts/myapp/bin/activate

# And here is a working copy of trunk from myapp
cd ~/mycode/svn/myapp-trunk/

An application inside a virtual env directory:

# A virtual env root and an application root inside
/opt/company/myapp/            # the virtual env root
/opt/company/myapp/myapp/               # the application root
/opt/company/myapp/myapp/lib/          # the application library
/opt/company/myapp/myapp/bin/   # the application

# So you can enter the virtual env and then run the application
cd /opt/company/myapp/
source ./bin/activate
cd ./myapp/

While not a standard convention, placing the application inside the virtualenv directory like this allows for synchronizing a whole virtualenv and application from server to server, in a single isolated directory. You may prefer to decouple the directories for very valid reasons such as easier management in diverse server environment. As noted initially, conventions for joining a virtualenv and an app can vary widely.

Installing Packages with pip

The pip application allows you to install and manage applications. While it can be used globally, we focus here on usage within a virtual environment. Packages can be installed from several sources, with the most common being installations from the Python Package Index.

Example installation of a common package named “requests”:

# List the current packages
(myapp)[host]# pip freeze

# Install a new package from pypi
(myapp)[host]# pip install requests
Downloading/unpacking requests
Downloading requests-2.3.0.tar.gz (429kB): 429kB downloaded
Running egg_info for package requests
Installing collected packages: requests
Running install for requests
Successfully installed requests
Cleaning up...

For a given application, you’ll want to keep a list of all the required external packages. The pip application provides a convention for this. You can place all the package names (and versions if desired) inside a file named “requirements.txt”. This file can then be stored in the root directory of your application, and managed like any other file in your version control system.

Example requirements.txt file

(myapp)[host]$cat requirements.txt

Install requirements from file:

# Make sure the virtualenv is already activated!
(myapp)[host]$pip install -r ./requirements.txt

Installing Packages from a Repository

The last item we’ll cover is the installation of packages from a repository. For a variety of reasons, the packages or versions you need to install with pip may not be available in pypi. A handy workaround can be to install a package straight from a repository. This may require some special setup effort on behalf of the package owner, but those details are beyond the scope of this article. This technique can be especially handy for installing internal packages straight from your company repository.

There are many possible variations to this installation method, which depend on the type of repository and other attributes. I’ll show two common examples.

Requirements entry which will install the langid package from the GitHub master branch:

(myapp)[host]$cat requirements.txt
-e git+

Requirements entry which will install a specific tagged version of an internal package (helper-lib) from a local svn repository:

(myapp)[host]$cat requirements.txt
(dre-ops)[rain dre-ops-trunk]$cat requirements.txt
-e svn+ssh://


The above examples illustrate some of the techniques used at Catalyst to isolate and manage python applications and their requirements. The flexibility of virtualenv and pip can allow you to use your own conventions for your own environment. Overall, we’ve found pip and virtualenv to be very useful for the management and deployment of python applications in our environment.