The Not So Small Stuff

Part of our application retrieves files from an internal storage system that behaves similarly to AWS S3. The system has four heads and is fronted by a network load balancer. Clients of the storage system, named Repono, retrieve authentication tickets from our vending system.  Repono then does an authorization look up and if everything is OK, the resource (aka “the file”) is handed out over HTTPS. Normally, our performance of this system is really fantastic and looks like this chart:

Response Time

Response time as recorded by our internal monitoring system

About a month ago, the performance of this system really changed. We went from routine performance under 1 second to sporadic responses clocking in over 10 seconds.

Uh Oh!

A different monitoring system using ELK measuring the response time of our middleware to Repono

This problem did not reveal itself easily. The major client of Repono was recently upgraded to a newer version of Microsoft .NET. One step we took was to revert the change and downgrade the version of .NET. For around 20 to 30 minutes, the performance was looking good, and then it reverted to the lumpy 10 second performance.

What else could have changed? This was not easy to see. The HTTPS traffic did not allow easy inspection and alternative testing with our go to baseline client, cURL could not reproduce the results. The problem was clearly affecting production and we couldn’t easily change out our middleware tools in .NET for a set of shell based testing scripts using cURL.

We needed to see inside the HTTP and TCP conversation of the .NET HTTP client, but the normal amount of debugging and logging did not record the details.

Thanks to the big’ol Internet, we stumbled on this helpful article from Mike Hadlow to get to the level of information we wanted out of .NET and the HTTP client.The only problem is that this logging doesn’t include time stamps and there was too much traffic to keep up and detangle out of the audit logs. Adjusting the load balancer, we could force most of the traffic through patched software with new logging. This greatly reduced the volume of data and we tried to watch in real time. This is what we saw

 1 System.Net Verbose: 0 : [5156] ConnectStream#43140910::Close()
 2 System.Net Verbose: 0 : [5156] Exiting ConnectStream#43140910::Close()
 3 System.Net Verbose: 0 : [5156] Exiting MyWebClient#1109663::DownloadData() -> Byte[]#50053605
 4 System.Net Verbose: 0 : [5156] MyWebClientWithHead#30034512::DownloadData(https://REDACTED/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED#1553812164)
 5 System.Net Verbose: 0 : [5156] MyWebClientWithHead#30034512::DownloadData(https://REDACTED/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED#1553812164)
 6 System.Net Verbose: 0 : [5156] WebRequest::Create(https://REDACTED/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED)
 7 System.Net Verbose: 0 : [5156] HttpWebRequest#45035759::HttpWebRequest(https://REDACTED/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED#1553812164)
 8 System.Net Verbose: 0 : [5156] Exiting HttpWebRequest#45035759::HttpWebRequest()
 9 System.Net Verbose: 0 : [5156] Exiting WebRequest::Create() -> HttpWebRequest#45035759
 10 System.Net Verbose: 0 : [5156] HttpWebRequest#45035759::GetResponse()
 11 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ServicePoint#17057466
 12 System.Net Information: 0 : [5156] Associating Connection#31630514 with HttpWebRequest#45035759
 13 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ConnectStream#66385004
 14 System.Net Information: 0 : [5156] HttpWebRequest#45035759 - Request: HEAD /v1/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED HTTP/1.1
 
 15 System.Net Information: 0 : [5156] ConnectStream#66385004 - Sending headers
 16 {
 17 x-auth-user: REDACTED
 18 x-auth-expiry: 2015-08-16T01:22:43Z
 19 x-auth-ticket: REDACTED
 20 Accept: text/xml
 21 Content-Type: text/xml
 22 User-Agent: REDACTED/version-2.4501/GetReponoAuthURL/gaggleid:REDACTED
 23 Host: REDACTED
 24 }.
 25 System.Net Error: 0 : [5156] Exception in the HttpWebRequest#45035759:: - The underlying connection was closed: A connection that was expected to be kept alive was closed by the server.
 26 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ServicePoint#17057466
 27 System.Net Information: 0 : [5156] Associating Connection#14267127 with HttpWebRequest#45035759
 28 System.Net Information: 0 : [5156] Connection#14267127 - Created connection from REDACTED:56478 to REDACTED:443.
 29 System.Net Information: 0 : [5156] TlsStream#34867337::.ctor(host=REDACTED, #certs=0)
 30 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ConnectStream#62287651
 31 System.Net Information: 0 : [5156] HttpWebRequest#45035759 - Request: HEAD /v1/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED HTTP/1.1
 32 System.Net Information: 0 : [5156] SecureChannel#50704654::.ctor(hostname=REDACTED, #clientCertificates=0, encryptionPolicy=RequireEncryption)
 33 System.Net Information: 0 : [5156] SecureChannel#50704654 - Left with 0 client certificates to choose from.
 34 System.Net Information: 0 : [5156] Using the cached credential handle.
 35 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = (null), targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 36 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffer length=0, Out-Buffer length=159, returned code=ContinueNeeded).
 37 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 38 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=ContinueNeeded).
 39 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 40 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=ContinueNeeded).
 41 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 42 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=ContinueNeeded).
 43 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 44 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=134, returned code=ContinueNeeded).
 45 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 46 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=ContinueNeeded).
 47 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 48 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=OK).
 49 System.Net Information: 0 : [5156] Remote certificate: [Version]
 50 V3
 51 REDACTED
 
 52 [Signature Algorithm]
 53 sha256RSA(1.2.840.113549.1.1.11)
 
 54 [Public Key]
 55 Algorithm: RSA
 56 Length: 2048
 57 Key Blob: REDACTED
 58 System.Net Information: 0 : [5156] ProcessAuthentication(Protocol=Tls, Cipher=Aes128 128 bit strength, Hash=Sha1 160 bit strength, Key Exchange=44550 256 bit strength).
 
 59 System.Net Information: 0 : [5156] ConnectStream#62287651 - Sending headers
 60 {
 61 x-auth-user: REDACTED
 62 x-auth-expiry: 2015-08-16T01:22:43Z
 63 x-auth-salt: REDACTED
 64 x-auth-ticket: REDACTED
 65 x-auth-dev-ticket: REDACTED
 66 Accept: text/xml
 67 Content-Type: text/xml
 68 User-Agent: REDACTED/version-2.4501/GetReponoAuthURL/gaggleid:REDACTED
 69 Host: REDACTED
 70 }.
 71 System.Net Information: 0 : [5156] Connection#14267127 - Received status line: Version=1.1, StatusCode=200, StatusDescription=OK.
 72 System.Net Information: 0 : [5156] Connection#14267127 - Received headers
 73 {
 74 X-Timestamp: 1413849343.99091
 75 X-Preauth-Uri: /repono/3dfc19b30213a7fd7297e1fb32815b95d15e7187a91e13c84b1c423c/REDACTED?token=PREA_49cd6e580ce3460948610e05e1dba031adf5bc19b098a2a98226cf4a
 76 X-Trans-Id: tx72179a9abf4a4d7e8cc65-0055cfe147
 77 Accept-Ranges: bytes
 78 Content-Length: 5378129
 79 Content-Type: application/pdf
 80 Date: Sun, 16 Aug 2015 01:03:03 GMT
 81 ETag: 40fb56bde55a37911d253debfa002005
 82 Last-Modified: Mon, 20 Oct 2014 23:55:44 GMT
 83 }.
 84 System.Net Information: 0 : [5156] ConnectStream#10452251::ConnectStream(Buffered 0 bytes.)
 85 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ConnectStream#10452251
 86 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with HttpWebResponse#24816868
 87 System.Net Verbose: 0 : [5156] Exiting HttpWebRequest#45035759::GetResponse() -> HttpWebResponse#24816868
 88 System.Net Verbose: 0 : [5156] HttpWebResponse#24816868::GetResponseStream()
 89 System.Net Information: 0 : [5156] ContentLength=5378129
 90 System.Net Verbose: 0 : [5156] Exiting HttpWebResponse#24816868::GetResponseStream() -> ConnectStream#10452251
 91 System.Net Verbose: 0 : [5156] ConnectStream#10452251::Read()
 92 System.Net Verbose: 0 : [5156] Exiting ConnectStream#10452251::Read() -> 0#0
 93 System.Net Verbose: 0 : [5156] ConnectStream#10452251::Close()

If you see it right away, contact me, I would like to hire you on to the team. For those of you that don’t immediately see it, tune your eyes to line 25 in the output.

Here’s what is going on. The .NET client is expecting to reuse the TCP socket and when it doesn’t get an answer, it’s waiting for 10 seconds and then establishes a new HTTPS session. Once we had this isolated, we patched the .NET code to close the connection after retrieving the file and tear down the TCP session. There could be a lot of discussion about what sessions and HTTP mean at the TCP level, but we could not adjust the Repono side and we could adjust our .NET side, so that is the approach we used.

This correction is very important, but it doesn’t easily answer what changed? Through a lot more work in the troubleshooting domain, we noticed that the load balancer had changed behavior specifically around long running sessions.  Patching the load balancer and our .NET code has put us in a better place.

Advertisements

why you should embrace a rabbitmq client library

Recently I had to re-run a lot of documents into one of our applications. The app lives on AWS and ingesting content involves the use of a RabbitMQ queue.

I’ve often used the the amqp-tools/rabbitmq-c for quick ad-hoc work in the past and so I wrote a very terse bash script to feed the list of documents to the queue. That script worked just fine, but I was in a hurry and I added quite a few queue clients to get the work done more quickly.

I stalled out in terms of rate and when I looked a bit more closely I found that my bash script wasn’t able to keep the queue fed sufficiently and my clients were going idle.

I also have some Ruby code using the bunny library and decided to re-write my feed script using that.

The results were startling.

Pushing 100,000 messages to the queue using the bash approach took about 28 minutes.

The Ruby version using a RabbitMQ library with persistent connection did the same work 35 seconds!

During a later run I pushed 1 million messages to RabbitMQ from a single client using the Ruby code.  That run took 6.5 minutes for an effective rate of 2500 messages per second.  The server is running on a r3.large and with that push and all the clients reading from it the load pushed up to only around 1.5. That is also a stark contrast to the bash version of the script during which I would see the load rise to 4+.

I didn’t take the time to dig deeply if this was due to process spawning in the bash script or overhead in connection setup/teardown with RabbitMQ. Given the load impact on the RabbitMQ server of the bash script (which ran on a different system) I’m confident that it’s not process spawning, but instead a lot of extra burden on RabbitMQ to deal with all those connection requests.

In the end it just speaks to the practicality of using client library the right way if things are going too slow when interacting with RabbitMQ.

\@matthias

Beware cURL Plus Files

Sometimes a queue worker needs to be no more glamorous than a shell script. If your queues are HTTP and so are the other services, it’s easy to reach for the shell and the venerable cURL. cURL is the UNIX operators default user agent. if it doesn’t work in cURL, there’s a good chance it won’t work in other situations.

We have a queue worker that interacts with several web services. It follows this rough outline…

1. Check for work
2. Get configs
3. Coordinate, communicate
4. Do work

Repeat forever which could be “a mighty long time” as Prince once told us.

The last step is the most interesting, but a little more background…

It is not hard to envision some generic functions in the shell. A generic logger function handles logging. Here are some near real world snippets of code.

1. RSTOPWATCHBEGIN=$(date +”%s.%N”)
2. curl -ski -H “x-auth-user: ${RUSER}” -H “x-auth-expiry: ${AEXPIRY}” “${THIS_RURL}${BUCKETNAME}/${USAFE}” -XPUT –upload-file “${SOURCE}” > $RRESULTS 2>$SPECIAL_ERR_LOG
3. RSTOPWATCHEND=$(date +”%s.%N”)

You can see from this example that the time to interact with this service is the difference between RSTOPWATCHEND (line 3) and RSTOPWATCHBEGIN (line 1). Because these are more granular than “second” you will need to do floating point math commonly in awk or bc (or hope your shell supports it, most do not). Passing it to the logger function records it for evaluation later.

cURL is a rock star. In this worker script, when doing the work of getting configs and communicating over HTTP, routinely, the work completes in hundredths of seconds. The way the script is set up, that includes the time to invoke cURL.

Here is the output of some of those results…

5320103 GETs
0.016 seconds per transaction

But when that interaction involves grabbing a file that is available locally through an NFS mount, the results go south quickly.

Here are those results…

961375 total seconds
507016 PUTs
1.896 seconds per transaction

What can it be? Clearly, it should not be cURL, too many other services are being interacted with over HTTP with expected results. It must be the web service. It is just slower and more complicated than the other HTTP services.

Here is where the story could have ended.

For a number of reasons, we had other code running against this same service. One was using a mojolicious library. The average interaction time with the same service doing the same work was 0.5 seconds. That is not insignificant when you do an operation 3 million times a day. But this worker was PUTing files already in memory. So it is not quite the same.

A different worker was built using Python and the Request library for HTTP. This code had a much smaller transaction time with the web service too.

Here are those results…

21180 total seconds
127479 PUTs
0.166 seconds per transaction

The timing calls are isolated to the same transaction. The files are still fetched over NFS. The service is still authenticated. The service is still using SSL. Finally, the most important thing is that the Python code was running on the same machines as the scripts using cURL. We can comfortably use the phrase, “holding all other variables equal…”

What can account for the 1.6 second difference?

Now it is hard to ignore cURL. We suspect that there is more overhead than we anticipate for cURL to spawn a child process and pull that file into the PUT. Other influencers may include slower authentication responses or less efficient SSL libraries.

If you love and use cURL, you may want to dig into the logs and check,your performance. It might me worth using a different tool for the heavy lifting.

Development and Engineering

As a smaller company, we all have jobs with many different hats.  Over the last couple of years, I’ve been able to finally understand at least two of the mindsets within our development team.  These two mindsets are almost like different hats.  I call them development and engineering and here’s how I see the two mindsets coming together during this project.

We wanted to move from a complicated scale out process to something simpler.  Currently we have a language identification tool supplied as a feature of a bigger data manager.  The language identification is in essence a library that is not separable from the data manager.  Thus, getting the language of a document requires interaction with the data manager.  If we wanted to make this simpler, we have to ask a question: Is it possible to identify language outside of the data manager?  Of course this answer is “surely, it must be possible,” but as a small business, we can’t just know it’s possible out in the world, it has to be possible for our skill sets and has to be something we can afford to spend the time on to make it work.

This first task we like to call feasibility.  It combines a little bit of engineering and a little bit of development to search for answers.  The engineering mindset has to look for something that would work here — with our platform and our constraints.  The developer mindset has to know the code can be manipulated for our needs.  This may be through simple process control or through modification to tailor towards the way things work now.  The results of this search is often captured in a document format based on SCQA that we call the SCPA.  Our difference is that we fell the question is often self evident and we present a proposal to the situation and complication along with alternatives if we can find them.

If we move forward with the SCPA, the developer’s next job is to get something working.  This often acts like a proof of concept.  Our newer developers always feel very proud when they complete this stage.  This is often what the idealized version of development looks like in our dreams.

If the results of getting something working confirm the assertions in the SCPA, we’ll take the next step which makes the developer return to their engineering mindset.  They have to start thinking about operational issues.  How fast is it?  How big is it?  What’s likely to break?  How will we know if it is working correctly?  We answer many of the questions through functional testing and unit testing, but it’s hard to answer some of the other questions without running the software in the real world or something that gets really close to approximating the real world (this is way harder than it sounds).

While working on the language identifier, I was happy to see it working.  I worked with the Product Manager to complete acceptance testing — in this case, did it give us the right answers?  We set up a list of over 20K documents that we knew the language for and began the testing.  The tests seem to be taking a long time, but I had plenty of other stuff to do so it was ok to let them run.  I bundled up the results and passed it to the Product Manager so the analysis could begin, but I had to go back and address the speed issues.  Again, the engineering and development mindset have to work in concert.  I collaborated with my team and collected a lot of good advice including doing the smallest things possible to get acceptable speed without having to drift towards thread management and spawning.  We all agreed that loading the language model for every document was a lot of overhead and if we could load it once and run many documents through the same process we should see an appreciable improvement.  If this wasn’t going to be enough, then go back and revisit threading or multiprocessing options.

The original code was invoked by letting the OS pass the files into one at a time.  This of course means reloading the very large model each time the script was invoked.

$ for D in ../corpus/docs/9/*body; do argot.py -d $D -o ~/out; done

real	44m12.313s
user	40m35.763s
sys	3m21.840s

The revision handles the OS work within the code and loads the model only once.  The results were astonishing.

$ argot.py -p ../corpus/docs/9/ -o ~/out

real 0m10.558s
user 0m10.272s
sys 0m0.265s

This is the same set of documents

$ ls -1 ../corpus/docs/9/*body | wc -l
    2095

We went from 44m12s to 10s. These details are critical to our operations and they combine development and engineering mindset to get the best results.

Does anyone like queues?

We have to deal with lots of documents. When customers give us one document, we may end up creating two or three different versions. That is a quick multiplier, one million documents can become three million documents easily. Since we own our colocation hardware, dynamic scaling like AWS is only partly real to us. We have to build a useful capacity and then decide if that level services the business well enough. Build it too big and it a lot of CapEx is wasted. Build it too small and it won’t survive the demands. The right size for us is often not the high watermark of demand, it’s somewhere beneath that.

What happens when demand exceeds the capacity? A familiar queue steps in place to buffer the work. As soon as you put in a queue, you confront issues of how the queue should run: FIFO (first in, first out), priorities, equality, or any other ideas. With many customers consuming the same infrastructure, there’s not a single answer that works well for queues. FIFO breaks down if Customer A has four million docs and and Customer B has only 100 documents. Customer B has to wait behind Customer A until their job completes.

One of our home grown queueing systems is backed with MySQL. Jobs come into a pretty simple table. Our first approach was to add an ordering table. We can manipulate the ordering table and combine it with the jobs table to sort the work. This would look something like:

SELECT a.qid FROM view_queue a, queue_job_size b WHERE a.jobid=b.jobid and a.block = 0 order by b.cnt asc LIMIT 1

At this point, we had a decent working countermeasure with the queue_job_size table helping out with ordering. Customer B wouldn’t be living in the shadow of Customer A.

Then the circumstances changed again. At one point we had too many small jobs and we weren’t getting around to the large jobs. Our system favored anything that looked like Customer B and Customer A was at the bottom of the list. Ugh.

We added an override table. Through a web interface, a technician could signal to the queue that one job they knew about needed more attention. The signal was captured in the queue_job_priority table

+--------+----------+------------+--------+
| jobid  | priority | ts_update  | active |
+--------+----------+------------+--------+
|  12345 |     1000 | 1394030636 |      0 |
| 473435 |    10000 | 1400627124 |      0 |
| 477280 |    10000 | 1401408608 |      0 |
| 482175 |      500 | 1403140692 |      0 |
| 484328 |      500 | 1403140692 |      0 |
| 484466 |      500 | 1403140692 |      0 |
| 485264 |    10000 | 1403192993 |      0 |
+--------+----------+------------+--------+

Now we could alter the ordering table and take into account special circumstances. Updating the table was accomplished through cron.

update queue_job_size a, queue_job_priority b set a.cnt=b.priority where a.jobid=b.jobid and b.active=1;

This countermeasure was released and we felt pretty good that we could work on the small jobs, get around to the big jobs, and then pay attention to special exceptions.

Except that it didn’t work well when people were sleeping. Customers can load data into our platform without any one being around to help, and they do just that. If a customer loaded a large job and no one was around to prioritize it, we ran into the same problems as before.

We devised a slightly different strategy. On even hours of the day, we manipulate the ordering table for the smallest jobs. On odd hours of the day, we manipulate the ordering table to work on the oldest jobs.

Here’s the ordering table on an even hour:

+--------+------+------------+
| jobid  | cnt  | ts_update  |
+--------+------+------------+
| 499925 |  278 | 1406227561 |
| 499913 |  413 | 1406227561 |
| 499915 |  434 | 1406227561 |
| 499939 |  450 | 1406227561 |
| 499973 |  660 | 1406227561 |
| 499923 |  677 | 1406227561 |
| 499927 |  848 | 1406227561 |
| 499933 |  878 | 1406227561 |
| 499931 | 1023 | 1406227561 |
| 499910 | 1153 | 1406227561 |
| 497980 |  100 | 1406215802 |
| 498048 |  100 | 1406216187 |
+--------+------+------------+

Here’s the same ordering table on an odd hour, with the oldest work prioritized

+--------+-------+------------+
| jobid  | cnt   | ts_update  |
+--------+-------+------------+
| 498048 | 10000 | 1406228521 |
| 498106 | 10005 | 1406228521 |
| 498113 | 10010 | 1406228521 |
| 498154 | 10015 | 1406228521 |
| 498228 | 10020 | 1406228521 |
| 498237 | 10025 | 1406228521 |
| 498293 | 10030 | 1406228521 |
| 498339 | 10035 | 1406228521 |
| 498346 | 10040 | 1406228521 |
| 498349 | 10045 | 1406228521 |
| 497980 | 10000 | 1406215802 |
+--------+-------+------------+

The ordering table still respects the exception driven priorities.

Will queues ever please anyone? I suspect not. This current system is an attempt to thread the needle and please most people.