So Many Operations

Often, I will reduce a complex problem into a set of abstract computer “ops”. These ops aren’t meant to be an exact description of how the computer or the network would carry out the task, but a logical abstraction. Let me walk through a short example.

How does a file get to repono (storage)?

Ignoring all of the steps that got us to the point of putting the file in Repono, here’s how I would think of the operational cost of Repono (as of June 2016)

  • pfiles calls stork, stork pushes a job into ins-queue for every file, operational cost is 1*n
  • ins-queue records the job in a queue table, operational cost is 1*n
  • r2 asks for work from ins-queue and marks it WIP, operational cost is 1*n
  • r2 commits the file to repono, operational cost is 1*n
  • repono writes to disk, cost is 2*n for replica
  • r2 logs the outcome to logserver, operational cost is 1*n
  • r2 asks to remove the record from ins-queue, operational cost is 1*n

As such, if we have 1,000,000 documents, in this overview, we have 9,000,000 operations.

But there is much more in the details…

What’s going on in ins-queue?

ins-queue was developed in house. That could make it well tuned for our needs or another tool where we have to keep up with all aspects of it operationally. Your perspective will likely determine how you see this sort of software. Regardless, here’s some of the current stats of ins-queue (June 2016)

 +----+------------+-----------+
 | id |       tool |       cnt |
 +----+------------+-----------+
 | 44 |    roz_kvp |     62649 |
 | 43 |    roz_mdb |     62839 |
 | 1  |        roz |     62931 |
 | 57 | roz_merged |     79696 |
 | 5  |     grimes |    364235 |
 | 6  |     pdfopt |   3825390 |
 | 3  |   smithers |  10607081 |
 | 4  |    s3queue | 667402231 |
 +----+------------+-----------+

Yes, that is 667M documents that have flowed through ins-queue (s3queue) and (mostly) on to Repono. That also means we have done nearly 11M data overlays (smithers) and have optimized 3.8M PDFs (pdfopt). But there’s a little more…how do we know how many records? Every time we take a record out of s3queue, we ask to increment this counter. That’s one additional operation. If we were at 9n before, we’re now at 10n. But the chained events that update the counters also update another logging effort to keep track of activity in deletion mode. Our new count is 11n. So for the 667M documents listed here, we’ve taken 7.3B (yes, Billion) operations to store those documents in Repono (this doesn’t count optimizing, scanning, inventory, and even a full drill down on all of the logging). If we went to a granular level, I suspect that the act of storing a document with logging is probably 20n.

What’s going on in logserver?

We have a REST based generic logging tool called logserver or Ticket Tool.   It is very simple.  It was discussed previously and the source is here.

As adoption has increased, so has our desire to create more detailed logs.  This platform is one of the busiest in our operations (although that is a crowded field of busy platforms).   It can fill quickly and the long term value of the information decays pretty quickly.  So, to save storage, we have cron’d some administration to take the older files and zip them up in place.   A custom error handler will find the compressed file if you have a link to the original file.  That’s going pretty well until we look at our replication hosts.

Yesterday, it went into alarm.  So much data had accumulated.

banshee

Our monitoring has thresholds on operational limits.

The comfortable and generic syncing of all data was never updated to reflect the reaction to the broader adoption of the logging service.  As such, we would have the original file and the zipped file.  After removing the original files, we recovered over 60% of this resource.

All of this is happening behind the scenes and adds to the operational costs, choices, and scale of working in a distributed platform.

The Not So Small Stuff

Part of our application retrieves files from an internal storage system that behaves similarly to AWS S3. The system has four heads and is fronted by a network load balancer. Clients of the storage system, named Repono, retrieve authentication tickets from our vending system.  Repono then does an authorization look up and if everything is OK, the resource (aka “the file”) is handed out over HTTPS. Normally, our performance of this system is really fantastic and looks like this chart:

Response Time

Response time as recorded by our internal monitoring system

About a month ago, the performance of this system really changed. We went from routine performance under 1 second to sporadic responses clocking in over 10 seconds.

Uh Oh!

A different monitoring system using ELK measuring the response time of our middleware to Repono

This problem did not reveal itself easily. The major client of Repono was recently upgraded to a newer version of Microsoft .NET. One step we took was to revert the change and downgrade the version of .NET. For around 20 to 30 minutes, the performance was looking good, and then it reverted to the lumpy 10 second performance.

What else could have changed? This was not easy to see. The HTTPS traffic did not allow easy inspection and alternative testing with our go to baseline client, cURL could not reproduce the results. The problem was clearly affecting production and we couldn’t easily change out our middleware tools in .NET for a set of shell based testing scripts using cURL.

We needed to see inside the HTTP and TCP conversation of the .NET HTTP client, but the normal amount of debugging and logging did not record the details.

Thanks to the big’ol Internet, we stumbled on this helpful article from Mike Hadlow to get to the level of information we wanted out of .NET and the HTTP client.The only problem is that this logging doesn’t include time stamps and there was too much traffic to keep up and detangle out of the audit logs. Adjusting the load balancer, we could force most of the traffic through patched software with new logging. This greatly reduced the volume of data and we tried to watch in real time. This is what we saw

 1 System.Net Verbose: 0 : [5156] ConnectStream#43140910::Close()
 2 System.Net Verbose: 0 : [5156] Exiting ConnectStream#43140910::Close()
 3 System.Net Verbose: 0 : [5156] Exiting MyWebClient#1109663::DownloadData() -> Byte[]#50053605
 4 System.Net Verbose: 0 : [5156] MyWebClientWithHead#30034512::DownloadData(https://REDACTED/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED#1553812164)
 5 System.Net Verbose: 0 : [5156] MyWebClientWithHead#30034512::DownloadData(https://REDACTED/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED#1553812164)
 6 System.Net Verbose: 0 : [5156] WebRequest::Create(https://REDACTED/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED)
 7 System.Net Verbose: 0 : [5156] HttpWebRequest#45035759::HttpWebRequest(https://REDACTED/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED#1553812164)
 8 System.Net Verbose: 0 : [5156] Exiting HttpWebRequest#45035759::HttpWebRequest()
 9 System.Net Verbose: 0 : [5156] Exiting WebRequest::Create() -> HttpWebRequest#45035759
 10 System.Net Verbose: 0 : [5156] HttpWebRequest#45035759::GetResponse()
 11 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ServicePoint#17057466
 12 System.Net Information: 0 : [5156] Associating Connection#31630514 with HttpWebRequest#45035759
 13 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ConnectStream#66385004
 14 System.Net Information: 0 : [5156] HttpWebRequest#45035759 - Request: HEAD /v1/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED HTTP/1.1
 
 15 System.Net Information: 0 : [5156] ConnectStream#66385004 - Sending headers
 16 {
 17 x-auth-user: REDACTED
 18 x-auth-expiry: 2015-08-16T01:22:43Z
 19 x-auth-ticket: REDACTED
 20 Accept: text/xml
 21 Content-Type: text/xml
 22 User-Agent: REDACTED/version-2.4501/GetReponoAuthURL/gaggleid:REDACTED
 23 Host: REDACTED
 24 }.
 25 System.Net Error: 0 : [5156] Exception in the HttpWebRequest#45035759:: - The underlying connection was closed: A connection that was expected to be kept alive was closed by the server.
 26 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ServicePoint#17057466
 27 System.Net Information: 0 : [5156] Associating Connection#14267127 with HttpWebRequest#45035759
 28 System.Net Information: 0 : [5156] Connection#14267127 - Created connection from REDACTED:56478 to REDACTED:443.
 29 System.Net Information: 0 : [5156] TlsStream#34867337::.ctor(host=REDACTED, #certs=0)
 30 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ConnectStream#62287651
 31 System.Net Information: 0 : [5156] HttpWebRequest#45035759 - Request: HEAD /v1/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED HTTP/1.1
 32 System.Net Information: 0 : [5156] SecureChannel#50704654::.ctor(hostname=REDACTED, #clientCertificates=0, encryptionPolicy=RequireEncryption)
 33 System.Net Information: 0 : [5156] SecureChannel#50704654 - Left with 0 client certificates to choose from.
 34 System.Net Information: 0 : [5156] Using the cached credential handle.
 35 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = (null), targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 36 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffer length=0, Out-Buffer length=159, returned code=ContinueNeeded).
 37 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 38 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=ContinueNeeded).
 39 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 40 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=ContinueNeeded).
 41 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 42 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=ContinueNeeded).
 43 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 44 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=134, returned code=ContinueNeeded).
 45 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 46 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=ContinueNeeded).
 47 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 48 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=OK).
 49 System.Net Information: 0 : [5156] Remote certificate: [Version]
 50 V3
 51 REDACTED
 
 52 [Signature Algorithm]
 53 sha256RSA(1.2.840.113549.1.1.11)
 
 54 [Public Key]
 55 Algorithm: RSA
 56 Length: 2048
 57 Key Blob: REDACTED
 58 System.Net Information: 0 : [5156] ProcessAuthentication(Protocol=Tls, Cipher=Aes128 128 bit strength, Hash=Sha1 160 bit strength, Key Exchange=44550 256 bit strength).
 
 59 System.Net Information: 0 : [5156] ConnectStream#62287651 - Sending headers
 60 {
 61 x-auth-user: REDACTED
 62 x-auth-expiry: 2015-08-16T01:22:43Z
 63 x-auth-salt: REDACTED
 64 x-auth-ticket: REDACTED
 65 x-auth-dev-ticket: REDACTED
 66 Accept: text/xml
 67 Content-Type: text/xml
 68 User-Agent: REDACTED/version-2.4501/GetReponoAuthURL/gaggleid:REDACTED
 69 Host: REDACTED
 70 }.
 71 System.Net Information: 0 : [5156] Connection#14267127 - Received status line: Version=1.1, StatusCode=200, StatusDescription=OK.
 72 System.Net Information: 0 : [5156] Connection#14267127 - Received headers
 73 {
 74 X-Timestamp: 1413849343.99091
 75 X-Preauth-Uri: /repono/3dfc19b30213a7fd7297e1fb32815b95d15e7187a91e13c84b1c423c/REDACTED?token=PREA_49cd6e580ce3460948610e05e1dba031adf5bc19b098a2a98226cf4a
 76 X-Trans-Id: tx72179a9abf4a4d7e8cc65-0055cfe147
 77 Accept-Ranges: bytes
 78 Content-Length: 5378129
 79 Content-Type: application/pdf
 80 Date: Sun, 16 Aug 2015 01:03:03 GMT
 81 ETag: 40fb56bde55a37911d253debfa002005
 82 Last-Modified: Mon, 20 Oct 2014 23:55:44 GMT
 83 }.
 84 System.Net Information: 0 : [5156] ConnectStream#10452251::ConnectStream(Buffered 0 bytes.)
 85 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ConnectStream#10452251
 86 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with HttpWebResponse#24816868
 87 System.Net Verbose: 0 : [5156] Exiting HttpWebRequest#45035759::GetResponse() -> HttpWebResponse#24816868
 88 System.Net Verbose: 0 : [5156] HttpWebResponse#24816868::GetResponseStream()
 89 System.Net Information: 0 : [5156] ContentLength=5378129
 90 System.Net Verbose: 0 : [5156] Exiting HttpWebResponse#24816868::GetResponseStream() -> ConnectStream#10452251
 91 System.Net Verbose: 0 : [5156] ConnectStream#10452251::Read()
 92 System.Net Verbose: 0 : [5156] Exiting ConnectStream#10452251::Read() -> 0#0
 93 System.Net Verbose: 0 : [5156] ConnectStream#10452251::Close()

If you see it right away, contact me, I would like to hire you on to the team. For those of you that don’t immediately see it, tune your eyes to line 25 in the output.

Here’s what is going on. The .NET client is expecting to reuse the TCP socket and when it doesn’t get an answer, it’s waiting for 10 seconds and then establishes a new HTTPS session. Once we had this isolated, we patched the .NET code to close the connection after retrieving the file and tear down the TCP session. There could be a lot of discussion about what sessions and HTTP mean at the TCP level, but we could not adjust the Repono side and we could adjust our .NET side, so that is the approach we used.

This correction is very important, but it doesn’t easily answer what changed? Through a lot more work in the troubleshooting domain, we noticed that the load balancer had changed behavior specifically around long running sessions.  Patching the load balancer and our .NET code has put us in a better place.

Speeding Up Large File Transfer Over HTTP

The HTTP protocol is all around us. Can we work within HTTP/1.1 to speed up large file delivery?

What is going on in HTTP when a page is requested?

TCP + HTTP overview of a typical request to view a homepage.

Fig. 1: Simplified TCP + HTTP overview of a typical request to view a homepage.

Could this be sped up?  In HTTP/1.1 we are given two options, pipelining and byte serving.  It is worth understanding each.  In Fig. 1, the operations are presented serially.  In reality that is mostly the way they are carried out.  If the server supports pipelining, requests can be sent without waiting for a response within the same TCP session.

Fig. 2: Pipelining in HTTP/1.1

Fig. 2: Pipelining in HTTP/1.1

In Fig. 2, the same requests have been made in rapid succession.  But as defined, the pipelining feature of HTTP/1.1 will return the resources in the order of the request.  This is in essence a FIFO system and can suffer from head-of-line issues.  If the request for logo.png in Fig. 2 results in a 200 MB file, that resource will be delivered before other resources can continue.  Asynchronous delivery and other improvements are scheduled for HTTP/2.0, but they are not part of HTTP/1.1 and with Google’s withdrawal of SPDY, there aren’t a lot of improvements within pipelining that are available in browsers.

Byte Serving is another feature of HTTP/1.1.  Some content delivered over HTTP can be read progressively (like HTML) while other content needs to be delivered completely before your computer can do anything with it (like a Microsoft Word *.doc file).  PDF files fit into the former category.  A PDF viewer that connects with your browser knows about Byte Serving.  That is how it appears that PDFs stream for some combinations of browsers and PDF viewers.  This is accomplished by the server supporting Ranges and the client making use of Range requests.

Fig. 3: Byte Serving through the use of Ranges in HTTP/1.1

Fig. 3: A simplified version of Byte Serving through the use of Ranges in HTTP/1.1

If the resource can be used in a progressive manner, then byte serving can get chunks of a file and make them usable to the consumer.  Byte Serving can be combined with pipelining, but for reasons already discussed, there are only marginal gains through this approach.

Fig. 4: Combining Byte Serving and pipelining is possible but doesn't make material gains in performance.

Fig. 4: Combining Byte Serving and pipelining is possible but doesn’t make material gains in performance.

Let’s go back and look at that large file in a simple delivery.

Fig. 5 - A simple delivery of a large file.

Fig. 5 : A simple delivery of a large file.

If you need the entire file and can’t do anything progressively, you will end up waiting for the entire payload to complete your request.  Pipelining won’t help much nor will Byte Serving since you still need the whole file to finish.  What if you could make multiple parallel requests of the server asking for portions of the file?  We call this approach PBS or Parallel Byte Serving.

Fig. 7: Agent 1

Fig. 7: Agent 1

Fig. 8: Agent 2

Fig. 8: Agent 2

The file

http://REDACTED/public/p/byte.serve.bin

The meta

HTTP/1.1 200 OK
Date: Mon, 01 Jun 2015 20:04:15 GMT
Server: Apache/2.2.3 (CentOS)
Last-Modified: Fri, 29 May 2015 14:12:18 GMT
ETag: 19e197-181a5ceb-517391052f480
Accept-Ranges: bytes
Content-Length: 404380907
Expires: Wed, 15 Apr 2020 20:00:00 GMT
Cache-Control: public
Connection: close
Content-Type: application/unknown

Time to retrieve using a simple download.

time curl -s http://REDACTED/public/p/byte.serve.bin -XGET -ouptut full.pdf

real 22m2.913s
user 0m3.413s
sys 0m12.991s

By making use of the HTTP/1.1 HEAD call, we know the file is 404380907 bytes.  Now it’s simply a matter of configuring four distinct agents with their own TCP + HTTP session to the server to read four different ranges of the same file.  Here’s an example of one agent.

curl -s -H Range: bytes=0-101095226 http://REDACTED/public/p/byte.serve.bin -XGET –output bin.part1

Three more agents are started up in rapid succession with differences in the Range Request Header and different output files.  This was combined into a  simple script.

time ./multi.sh
STILL WORKING
(Repeat)
DONE

real 7m2.332s
user 0m3.722s
sys 0m11.659s

From 22 minutes to seven minutes.  That is promising.  But this is a very naive setup.  It assumes there is no authentication, TLS, or other expensive operations within the HTTP call. To be useful, PBS would need to be tested against our own production equipment.  Inside of our platform, we’ve done a lot to optimize intra machine communication so delivering a large file faster would have to make big strides for us to want to change some of our software.

To test production resource, the scripts repeatedly requested a 40MB file – both as a single file and as four separate agents asking for 25% of the file.  The PBS approach is faster, but not fast enough to take on the headaches of reassembling the parts and other changes in our agents.  Perhaps if files were much larger, like 100MB or more PBS would showcase it’s advantages more clearly.

Fig. 9: Comparing PBS and regular HTTP delivery of a 40 MB file

Fig. 9: Comparing PBS and regular HTTP delivery of a 40 MB file

The graph shows our average simple delivery was .9964 seconds while PBS was .9381 seconds.  Where are the enormous gains of the large file delivery outlined above?  Well this service is multiplexed with a load balancer, handles authentication, TLS, and other pieces of code.  The overhead for each agent session eats away at the gains of Parallel Byte Serving for smaller files.

Progressive Shrinking of Text

We needed a way to shrink the text of a document if we found it to be too large for our limits on indexing. To keep the methods consistent, we developed a quick web service in PHP called Shrinkray. Shrinkray receives a large text document and takes steps to make the text smaller by progressively removing information from the document.  This script is a one page web service.  It requires no additional code.

Actions for Shrinkray

  •  ax=lim – reports on the current limit being enforced by Shrinkray
  •  ax=sr – the main shrink ray method

Variables for Shrinkray

  • srpath = the path to a file you want to GET with shrinkray (the web server needs the RIGHT permissions to this path)

Typical Usage Patterns

curl --silent http://REDACTED/shrinkray/v1/?ax=lim

This will return the limit being enforced by Shrinkray.

curl --silent -v http://REDACTED/shrinkray/v1/?ax=sr&srpath=/path/to/file

This will have Shrinkray fetch the document specified by ‘srpath’ and shrink it.

Returns

Response code + payload. If the action is ‘lim’ the payload is the string representing the size limit. If the method is ‘sr’ and the status code is less than 399, the payload will be the shrunk document.

Response Codes for Shrinkray

  • HTTP/1.1 200 OK
  • HTTP/1.1 230 Shrunk via tags
  • HTTP/1.1 235 Shurnk via duplication
  • HTTP/1.1 239 Shrunk via punctuation
  • HTTP/1.1 240 Shrunk via number
  • HTTP/1.1 245 Shrunk via lowercase
  • HTTP/1.1 250 Shrunk via header
  • HTTP/1.1 400 Bad Request
  • HTTP/1.1 410 Gone
  • HTTP/1.1 412 Precondition Failed
  • HTTP/1.1 413 Request Entity Too Large
  • HTTP/1.1 500 Internal Server Error

Response Codes and Events

  • 200 when you ask for the limit over GET or POST
  • 230 when the HTML tags have been removed
  • 235 when the duplicate strings have been removed
  • 239 when punctuation (defined by unicode character class \p{Punctuation}) has been removed
  • 240 when numbers (defined by the unicode character class \p{Number}) have been removed
  • 245 when the words have been lower cased and the duplicates have been removed
  • NOTE: 230-245 are incremental, they include the previous steps
  • 250 when the text could not be shrunk and a chunk of the document lower than ax=lim has been returned to you in the payload
  • 400 when you use the ax=sr and do not provide srpath (GET)
  • 410 when you specify the srpath and the file does not exist
  • 412 when you provide a document that is smaller than ax=lim and no action is taken
  •  413 when all the shrinking processes have failed and nothing is done to your document
  • 500 when something unexpected happens

Using the Response Code

If your response is within 230 to 250, your document has been shrunk and the smaller document is in the payload. It is up to you to persist this document.

You can download a redacted version of the PHP script here

Beware cURL Plus Files

Sometimes a queue worker needs to be no more glamorous than a shell script. If your queues are HTTP and so are the other services, it’s easy to reach for the shell and the venerable cURL. cURL is the UNIX operators default user agent. if it doesn’t work in cURL, there’s a good chance it won’t work in other situations.

We have a queue worker that interacts with several web services. It follows this rough outline…

1. Check for work
2. Get configs
3. Coordinate, communicate
4. Do work

Repeat forever which could be “a mighty long time” as Prince once told us.

The last step is the most interesting, but a little more background…

It is not hard to envision some generic functions in the shell. A generic logger function handles logging. Here are some near real world snippets of code.

1. RSTOPWATCHBEGIN=$(date +”%s.%N”)
2. curl -ski -H “x-auth-user: ${RUSER}” -H “x-auth-expiry: ${AEXPIRY}” “${THIS_RURL}${BUCKETNAME}/${USAFE}” -XPUT –upload-file “${SOURCE}” > $RRESULTS 2>$SPECIAL_ERR_LOG
3. RSTOPWATCHEND=$(date +”%s.%N”)

You can see from this example that the time to interact with this service is the difference between RSTOPWATCHEND (line 3) and RSTOPWATCHBEGIN (line 1). Because these are more granular than “second” you will need to do floating point math commonly in awk or bc (or hope your shell supports it, most do not). Passing it to the logger function records it for evaluation later.

cURL is a rock star. In this worker script, when doing the work of getting configs and communicating over HTTP, routinely, the work completes in hundredths of seconds. The way the script is set up, that includes the time to invoke cURL.

Here is the output of some of those results…

5320103 GETs
0.016 seconds per transaction

But when that interaction involves grabbing a file that is available locally through an NFS mount, the results go south quickly.

Here are those results…

961375 total seconds
507016 PUTs
1.896 seconds per transaction

What can it be? Clearly, it should not be cURL, too many other services are being interacted with over HTTP with expected results. It must be the web service. It is just slower and more complicated than the other HTTP services.

Here is where the story could have ended.

For a number of reasons, we had other code running against this same service. One was using a mojolicious library. The average interaction time with the same service doing the same work was 0.5 seconds. That is not insignificant when you do an operation 3 million times a day. But this worker was PUTing files already in memory. So it is not quite the same.

A different worker was built using Python and the Request library for HTTP. This code had a much smaller transaction time with the web service too.

Here are those results…

21180 total seconds
127479 PUTs
0.166 seconds per transaction

The timing calls are isolated to the same transaction. The files are still fetched over NFS. The service is still authenticated. The service is still using SSL. Finally, the most important thing is that the Python code was running on the same machines as the scripts using cURL. We can comfortably use the phrase, “holding all other variables equal…”

What can account for the 1.6 second difference?

Now it is hard to ignore cURL. We suspect that there is more overhead than we anticipate for cURL to spawn a child process and pull that file into the PUT. Other influencers may include slower authentication responses or less efficient SSL libraries.

If you love and use cURL, you may want to dig into the logs and check,your performance. It might me worth using a different tool for the heavy lifting.

Development and Engineering

As a smaller company, we all have jobs with many different hats.  Over the last couple of years, I’ve been able to finally understand at least two of the mindsets within our development team.  These two mindsets are almost like different hats.  I call them development and engineering and here’s how I see the two mindsets coming together during this project.

We wanted to move from a complicated scale out process to something simpler.  Currently we have a language identification tool supplied as a feature of a bigger data manager.  The language identification is in essence a library that is not separable from the data manager.  Thus, getting the language of a document requires interaction with the data manager.  If we wanted to make this simpler, we have to ask a question: Is it possible to identify language outside of the data manager?  Of course this answer is “surely, it must be possible,” but as a small business, we can’t just know it’s possible out in the world, it has to be possible for our skill sets and has to be something we can afford to spend the time on to make it work.

This first task we like to call feasibility.  It combines a little bit of engineering and a little bit of development to search for answers.  The engineering mindset has to look for something that would work here — with our platform and our constraints.  The developer mindset has to know the code can be manipulated for our needs.  This may be through simple process control or through modification to tailor towards the way things work now.  The results of this search is often captured in a document format based on SCQA that we call the SCPA.  Our difference is that we fell the question is often self evident and we present a proposal to the situation and complication along with alternatives if we can find them.

If we move forward with the SCPA, the developer’s next job is to get something working.  This often acts like a proof of concept.  Our newer developers always feel very proud when they complete this stage.  This is often what the idealized version of development looks like in our dreams.

If the results of getting something working confirm the assertions in the SCPA, we’ll take the next step which makes the developer return to their engineering mindset.  They have to start thinking about operational issues.  How fast is it?  How big is it?  What’s likely to break?  How will we know if it is working correctly?  We answer many of the questions through functional testing and unit testing, but it’s hard to answer some of the other questions without running the software in the real world or something that gets really close to approximating the real world (this is way harder than it sounds).

While working on the language identifier, I was happy to see it working.  I worked with the Product Manager to complete acceptance testing — in this case, did it give us the right answers?  We set up a list of over 20K documents that we knew the language for and began the testing.  The tests seem to be taking a long time, but I had plenty of other stuff to do so it was ok to let them run.  I bundled up the results and passed it to the Product Manager so the analysis could begin, but I had to go back and address the speed issues.  Again, the engineering and development mindset have to work in concert.  I collaborated with my team and collected a lot of good advice including doing the smallest things possible to get acceptable speed without having to drift towards thread management and spawning.  We all agreed that loading the language model for every document was a lot of overhead and if we could load it once and run many documents through the same process we should see an appreciable improvement.  If this wasn’t going to be enough, then go back and revisit threading or multiprocessing options.

The original code was invoked by letting the OS pass the files into one at a time.  This of course means reloading the very large model each time the script was invoked.

$ for D in ../corpus/docs/9/*body; do argot.py -d $D -o ~/out; done

real	44m12.313s
user	40m35.763s
sys	3m21.840s

The revision handles the OS work within the code and loads the model only once.  The results were astonishing.

$ argot.py -p ../corpus/docs/9/ -o ~/out

real 0m10.558s
user 0m10.272s
sys 0m0.265s

This is the same set of documents

$ ls -1 ../corpus/docs/9/*body | wc -l
    2095

We went from 44m12s to 10s. These details are critical to our operations and they combine development and engineering mindset to get the best results.

Milestones

This week, we crossed an interesting milestone in operations – the creation of our 500,000 ticket.  A long time ago, we ran things by email.  Moving to a ticket based system for tracking work was not trivial.  What was a ticket?  An email seemed to be easy to understand – whatever was in the email was in the email.  But what should a ticket be?  We decided to go with the loose idea that a ticket tracked a unit of work.  The definition wasn’t made more specific than that.  After our people got used to using tickets, we started hooking up our software to the ticketing system.  We integrated monitoring first and then came status updates from various software jobs.  We stumbled here a bit because the ticket was not the same as logging from a process.  If we treated the ticket as a log, we could have tickets with 20K entries in them.  That wasn’t making the tickets more useful, just more noisy.  So we came up with a different idea – the ticket tool.  The ticket tool is a very simple PHP application that accepts a ticket number, a task, and a note.  It appends to a text file.  It was written a long time ago, so it does things we probably wouldn’t do now like it returns status codes in the HTML body instead of using status codes in the HTTP header.  It’s also old enough to have been started in CVS.  (Redacted source at the end of this post).

With the invention of Ticket Tool, the view of the ticket changed subtly.  Instead of being the place to track the details of a unit of work, it became the hub to find all of the details.  The secret was simply recording URL links to the ticket tool inside the ticket.  Now it’s not uncommon for our tickets to have five to ten different tools recording details in ticket tool and posting links in our ticket.

Capturing events and details.

We have integration with tickets baked in everywhere in operations.  We have hooks for mail, bash, python, Windows, and probably everything else too.  We use the API from the ticketing system, but we also have written our own that does more things than the original API.  We have a system that extracts the records from the ticket database and converts them to XML to be loaded into a full text system that gives us powerful searching of the ticket history.  Our use of tickets will likely continue to grow.

Here’s the monthly count of tickets created since we started.

Ticket-count-monthly

Our ticketing system is provided by UserScape’s HelpSpot.  We’ve had great success with Ian and his team.

Ticket Tool Source

Does anyone like queues?

We have to deal with lots of documents. When customers give us one document, we may end up creating two or three different versions. That is a quick multiplier, one million documents can become three million documents easily. Since we own our colocation hardware, dynamic scaling like AWS is only partly real to us. We have to build a useful capacity and then decide if that level services the business well enough. Build it too big and it a lot of CapEx is wasted. Build it too small and it won’t survive the demands. The right size for us is often not the high watermark of demand, it’s somewhere beneath that.

What happens when demand exceeds the capacity? A familiar queue steps in place to buffer the work. As soon as you put in a queue, you confront issues of how the queue should run: FIFO (first in, first out), priorities, equality, or any other ideas. With many customers consuming the same infrastructure, there’s not a single answer that works well for queues. FIFO breaks down if Customer A has four million docs and and Customer B has only 100 documents. Customer B has to wait behind Customer A until their job completes.

One of our home grown queueing systems is backed with MySQL. Jobs come into a pretty simple table. Our first approach was to add an ordering table. We can manipulate the ordering table and combine it with the jobs table to sort the work. This would look something like:

SELECT a.qid FROM view_queue a, queue_job_size b WHERE a.jobid=b.jobid and a.block = 0 order by b.cnt asc LIMIT 1

At this point, we had a decent working countermeasure with the queue_job_size table helping out with ordering. Customer B wouldn’t be living in the shadow of Customer A.

Then the circumstances changed again. At one point we had too many small jobs and we weren’t getting around to the large jobs. Our system favored anything that looked like Customer B and Customer A was at the bottom of the list. Ugh.

We added an override table. Through a web interface, a technician could signal to the queue that one job they knew about needed more attention. The signal was captured in the queue_job_priority table

+--------+----------+------------+--------+
| jobid  | priority | ts_update  | active |
+--------+----------+------------+--------+
|  12345 |     1000 | 1394030636 |      0 |
| 473435 |    10000 | 1400627124 |      0 |
| 477280 |    10000 | 1401408608 |      0 |
| 482175 |      500 | 1403140692 |      0 |
| 484328 |      500 | 1403140692 |      0 |
| 484466 |      500 | 1403140692 |      0 |
| 485264 |    10000 | 1403192993 |      0 |
+--------+----------+------------+--------+

Now we could alter the ordering table and take into account special circumstances. Updating the table was accomplished through cron.

update queue_job_size a, queue_job_priority b set a.cnt=b.priority where a.jobid=b.jobid and b.active=1;

This countermeasure was released and we felt pretty good that we could work on the small jobs, get around to the big jobs, and then pay attention to special exceptions.

Except that it didn’t work well when people were sleeping. Customers can load data into our platform without any one being around to help, and they do just that. If a customer loaded a large job and no one was around to prioritize it, we ran into the same problems as before.

We devised a slightly different strategy. On even hours of the day, we manipulate the ordering table for the smallest jobs. On odd hours of the day, we manipulate the ordering table to work on the oldest jobs.

Here’s the ordering table on an even hour:

+--------+------+------------+
| jobid  | cnt  | ts_update  |
+--------+------+------------+
| 499925 |  278 | 1406227561 |
| 499913 |  413 | 1406227561 |
| 499915 |  434 | 1406227561 |
| 499939 |  450 | 1406227561 |
| 499973 |  660 | 1406227561 |
| 499923 |  677 | 1406227561 |
| 499927 |  848 | 1406227561 |
| 499933 |  878 | 1406227561 |
| 499931 | 1023 | 1406227561 |
| 499910 | 1153 | 1406227561 |
| 497980 |  100 | 1406215802 |
| 498048 |  100 | 1406216187 |
+--------+------+------------+

Here’s the same ordering table on an odd hour, with the oldest work prioritized

+--------+-------+------------+
| jobid  | cnt   | ts_update  |
+--------+-------+------------+
| 498048 | 10000 | 1406228521 |
| 498106 | 10005 | 1406228521 |
| 498113 | 10010 | 1406228521 |
| 498154 | 10015 | 1406228521 |
| 498228 | 10020 | 1406228521 |
| 498237 | 10025 | 1406228521 |
| 498293 | 10030 | 1406228521 |
| 498339 | 10035 | 1406228521 |
| 498346 | 10040 | 1406228521 |
| 498349 | 10045 | 1406228521 |
| 497980 | 10000 | 1406215802 |
+--------+-------+------------+

The ordering table still respects the exception driven priorities.

Will queues ever please anyone? I suspect not. This current system is an attempt to thread the needle and please most people.