So Many Operations

Often, I will reduce a complex problem into a set of abstract computer “ops”. These ops aren’t meant to be an exact description of how the computer or the network would carry out the task, but a logical abstraction. Let me walk through a short example.

How does a file get to repono (storage)?

Ignoring all of the steps that got us to the point of putting the file in Repono, here’s how I would think of the operational cost of Repono (as of June 2016)

  • pfiles calls stork, stork pushes a job into ins-queue for every file, operational cost is 1*n
  • ins-queue records the job in a queue table, operational cost is 1*n
  • r2 asks for work from ins-queue and marks it WIP, operational cost is 1*n
  • r2 commits the file to repono, operational cost is 1*n
  • repono writes to disk, cost is 2*n for replica
  • r2 logs the outcome to logserver, operational cost is 1*n
  • r2 asks to remove the record from ins-queue, operational cost is 1*n

As such, if we have 1,000,000 documents, in this overview, we have 9,000,000 operations.

But there is much more in the details…

What’s going on in ins-queue?

ins-queue was developed in house. That could make it well tuned for our needs or another tool where we have to keep up with all aspects of it operationally. Your perspective will likely determine how you see this sort of software. Regardless, here’s some of the current stats of ins-queue (June 2016)

 +----+------------+-----------+
 | id |       tool |       cnt |
 +----+------------+-----------+
 | 44 |    roz_kvp |     62649 |
 | 43 |    roz_mdb |     62839 |
 | 1  |        roz |     62931 |
 | 57 | roz_merged |     79696 |
 | 5  |     grimes |    364235 |
 | 6  |     pdfopt |   3825390 |
 | 3  |   smithers |  10607081 |
 | 4  |    s3queue | 667402231 |
 +----+------------+-----------+

Yes, that is 667M documents that have flowed through ins-queue (s3queue) and (mostly) on to Repono. That also means we have done nearly 11M data overlays (smithers) and have optimized 3.8M PDFs (pdfopt). But there’s a little more…how do we know how many records? Every time we take a record out of s3queue, we ask to increment this counter. That’s one additional operation. If we were at 9n before, we’re now at 10n. But the chained events that update the counters also update another logging effort to keep track of activity in deletion mode. Our new count is 11n. So for the 667M documents listed here, we’ve taken 7.3B (yes, Billion) operations to store those documents in Repono (this doesn’t count optimizing, scanning, inventory, and even a full drill down on all of the logging). If we went to a granular level, I suspect that the act of storing a document with logging is probably 20n.

What’s going on in logserver?

We have a REST based generic logging tool called logserver or Ticket Tool.   It is very simple.  It was discussed previously and the source is here.

As adoption has increased, so has our desire to create more detailed logs.  This platform is one of the busiest in our operations (although that is a crowded field of busy platforms).   It can fill quickly and the long term value of the information decays pretty quickly.  So, to save storage, we have cron’d some administration to take the older files and zip them up in place.   A custom error handler will find the compressed file if you have a link to the original file.  That’s going pretty well until we look at our replication hosts.

Yesterday, it went into alarm.  So much data had accumulated.

banshee

Our monitoring has thresholds on operational limits.

The comfortable and generic syncing of all data was never updated to reflect the reaction to the broader adoption of the logging service.  As such, we would have the original file and the zipped file.  After removing the original files, we recovered over 60% of this resource.

All of this is happening behind the scenes and adds to the operational costs, choices, and scale of working in a distributed platform.

Advertisements

The Not So Small Stuff

Part of our application retrieves files from an internal storage system that behaves similarly to AWS S3. The system has four heads and is fronted by a network load balancer. Clients of the storage system, named Repono, retrieve authentication tickets from our vending system.  Repono then does an authorization look up and if everything is OK, the resource (aka “the file”) is handed out over HTTPS. Normally, our performance of this system is really fantastic and looks like this chart:

Response Time

Response time as recorded by our internal monitoring system

About a month ago, the performance of this system really changed. We went from routine performance under 1 second to sporadic responses clocking in over 10 seconds.

Uh Oh!

A different monitoring system using ELK measuring the response time of our middleware to Repono

This problem did not reveal itself easily. The major client of Repono was recently upgraded to a newer version of Microsoft .NET. One step we took was to revert the change and downgrade the version of .NET. For around 20 to 30 minutes, the performance was looking good, and then it reverted to the lumpy 10 second performance.

What else could have changed? This was not easy to see. The HTTPS traffic did not allow easy inspection and alternative testing with our go to baseline client, cURL could not reproduce the results. The problem was clearly affecting production and we couldn’t easily change out our middleware tools in .NET for a set of shell based testing scripts using cURL.

We needed to see inside the HTTP and TCP conversation of the .NET HTTP client, but the normal amount of debugging and logging did not record the details.

Thanks to the big’ol Internet, we stumbled on this helpful article from Mike Hadlow to get to the level of information we wanted out of .NET and the HTTP client.The only problem is that this logging doesn’t include time stamps and there was too much traffic to keep up and detangle out of the audit logs. Adjusting the load balancer, we could force most of the traffic through patched software with new logging. This greatly reduced the volume of data and we tried to watch in real time. This is what we saw

 1 System.Net Verbose: 0 : [5156] ConnectStream#43140910::Close()
 2 System.Net Verbose: 0 : [5156] Exiting ConnectStream#43140910::Close()
 3 System.Net Verbose: 0 : [5156] Exiting MyWebClient#1109663::DownloadData() -> Byte[]#50053605
 4 System.Net Verbose: 0 : [5156] MyWebClientWithHead#30034512::DownloadData(https://REDACTED/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED#1553812164)
 5 System.Net Verbose: 0 : [5156] MyWebClientWithHead#30034512::DownloadData(https://REDACTED/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED#1553812164)
 6 System.Net Verbose: 0 : [5156] WebRequest::Create(https://REDACTED/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED)
 7 System.Net Verbose: 0 : [5156] HttpWebRequest#45035759::HttpWebRequest(https://REDACTED/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED#1553812164)
 8 System.Net Verbose: 0 : [5156] Exiting HttpWebRequest#45035759::HttpWebRequest()
 9 System.Net Verbose: 0 : [5156] Exiting WebRequest::Create() -> HttpWebRequest#45035759
 10 System.Net Verbose: 0 : [5156] HttpWebRequest#45035759::GetResponse()
 11 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ServicePoint#17057466
 12 System.Net Information: 0 : [5156] Associating Connection#31630514 with HttpWebRequest#45035759
 13 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ConnectStream#66385004
 14 System.Net Information: 0 : [5156] HttpWebRequest#45035759 - Request: HEAD /v1/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED HTTP/1.1
 
 15 System.Net Information: 0 : [5156] ConnectStream#66385004 - Sending headers
 16 {
 17 x-auth-user: REDACTED
 18 x-auth-expiry: 2015-08-16T01:22:43Z
 19 x-auth-ticket: REDACTED
 20 Accept: text/xml
 21 Content-Type: text/xml
 22 User-Agent: REDACTED/version-2.4501/GetReponoAuthURL/gaggleid:REDACTED
 23 Host: REDACTED
 24 }.
 25 System.Net Error: 0 : [5156] Exception in the HttpWebRequest#45035759:: - The underlying connection was closed: A connection that was expected to be kept alive was closed by the server.
 26 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ServicePoint#17057466
 27 System.Net Information: 0 : [5156] Associating Connection#14267127 with HttpWebRequest#45035759
 28 System.Net Information: 0 : [5156] Connection#14267127 - Created connection from REDACTED:56478 to REDACTED:443.
 29 System.Net Information: 0 : [5156] TlsStream#34867337::.ctor(host=REDACTED, #certs=0)
 30 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ConnectStream#62287651
 31 System.Net Information: 0 : [5156] HttpWebRequest#45035759 - Request: HEAD /v1/repono/REDACTED/REDACTED/n/1/6/REDACTED/REDACTED HTTP/1.1
 32 System.Net Information: 0 : [5156] SecureChannel#50704654::.ctor(hostname=REDACTED, #clientCertificates=0, encryptionPolicy=RequireEncryption)
 33 System.Net Information: 0 : [5156] SecureChannel#50704654 - Left with 0 client certificates to choose from.
 34 System.Net Information: 0 : [5156] Using the cached credential handle.
 35 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = (null), targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 36 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffer length=0, Out-Buffer length=159, returned code=ContinueNeeded).
 37 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 38 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=ContinueNeeded).
 39 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 40 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=ContinueNeeded).
 41 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 42 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=ContinueNeeded).
 43 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 44 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=134, returned code=ContinueNeeded).
 45 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 46 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=ContinueNeeded).
 47 System.Net Information: 0 : [5156] InitializeSecurityContext(credential = System.Net.SafeFreeCredential_SECURITY, context = 2496950:2dabe60, targetName = REDACTED, inFlags = ReplayDetect, SequenceDetect, Confidentiality, AllocateMemory, InitManualCredValidation)
 48 System.Net Information: 0 : [5156] InitializeSecurityContext(In-Buffers count=2, Out-Buffer length=0, returned code=OK).
 49 System.Net Information: 0 : [5156] Remote certificate: [Version]
 50 V3
 51 REDACTED
 
 52 [Signature Algorithm]
 53 sha256RSA(1.2.840.113549.1.1.11)
 
 54 [Public Key]
 55 Algorithm: RSA
 56 Length: 2048
 57 Key Blob: REDACTED
 58 System.Net Information: 0 : [5156] ProcessAuthentication(Protocol=Tls, Cipher=Aes128 128 bit strength, Hash=Sha1 160 bit strength, Key Exchange=44550 256 bit strength).
 
 59 System.Net Information: 0 : [5156] ConnectStream#62287651 - Sending headers
 60 {
 61 x-auth-user: REDACTED
 62 x-auth-expiry: 2015-08-16T01:22:43Z
 63 x-auth-salt: REDACTED
 64 x-auth-ticket: REDACTED
 65 x-auth-dev-ticket: REDACTED
 66 Accept: text/xml
 67 Content-Type: text/xml
 68 User-Agent: REDACTED/version-2.4501/GetReponoAuthURL/gaggleid:REDACTED
 69 Host: REDACTED
 70 }.
 71 System.Net Information: 0 : [5156] Connection#14267127 - Received status line: Version=1.1, StatusCode=200, StatusDescription=OK.
 72 System.Net Information: 0 : [5156] Connection#14267127 - Received headers
 73 {
 74 X-Timestamp: 1413849343.99091
 75 X-Preauth-Uri: /repono/3dfc19b30213a7fd7297e1fb32815b95d15e7187a91e13c84b1c423c/REDACTED?token=PREA_49cd6e580ce3460948610e05e1dba031adf5bc19b098a2a98226cf4a
 76 X-Trans-Id: tx72179a9abf4a4d7e8cc65-0055cfe147
 77 Accept-Ranges: bytes
 78 Content-Length: 5378129
 79 Content-Type: application/pdf
 80 Date: Sun, 16 Aug 2015 01:03:03 GMT
 81 ETag: 40fb56bde55a37911d253debfa002005
 82 Last-Modified: Mon, 20 Oct 2014 23:55:44 GMT
 83 }.
 84 System.Net Information: 0 : [5156] ConnectStream#10452251::ConnectStream(Buffered 0 bytes.)
 85 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with ConnectStream#10452251
 86 System.Net Information: 0 : [5156] Associating HttpWebRequest#45035759 with HttpWebResponse#24816868
 87 System.Net Verbose: 0 : [5156] Exiting HttpWebRequest#45035759::GetResponse() -> HttpWebResponse#24816868
 88 System.Net Verbose: 0 : [5156] HttpWebResponse#24816868::GetResponseStream()
 89 System.Net Information: 0 : [5156] ContentLength=5378129
 90 System.Net Verbose: 0 : [5156] Exiting HttpWebResponse#24816868::GetResponseStream() -> ConnectStream#10452251
 91 System.Net Verbose: 0 : [5156] ConnectStream#10452251::Read()
 92 System.Net Verbose: 0 : [5156] Exiting ConnectStream#10452251::Read() -> 0#0
 93 System.Net Verbose: 0 : [5156] ConnectStream#10452251::Close()

If you see it right away, contact me, I would like to hire you on to the team. For those of you that don’t immediately see it, tune your eyes to line 25 in the output.

Here’s what is going on. The .NET client is expecting to reuse the TCP socket and when it doesn’t get an answer, it’s waiting for 10 seconds and then establishes a new HTTPS session. Once we had this isolated, we patched the .NET code to close the connection after retrieving the file and tear down the TCP session. There could be a lot of discussion about what sessions and HTTP mean at the TCP level, but we could not adjust the Repono side and we could adjust our .NET side, so that is the approach we used.

This correction is very important, but it doesn’t easily answer what changed? Through a lot more work in the troubleshooting domain, we noticed that the load balancer had changed behavior specifically around long running sessions.  Patching the load balancer and our .NET code has put us in a better place.

logstash tuning on AWS

During a recent test where I put up 125+ AWS instances to do some work I ran into an issue. All of the instances are pushing their logs via logstash-forwarder to a load balanced logstash cluster. Things were running fine but logging failed. It was nice to know that logging doesn’t bring the instances to halt, but the logs are being centrally collected for a reason.

After some digging I found that the issue was overrunning the logstash recipients memory. Basically the logs were flooding in at 4,500-5,500/s which exceeded what logstash could process. The events pilled up and boom. Not enough memory:

Error: Your application used more memory than the safety cap of 4G.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace

The logstash instances are running on c3.xlarge instance types and I decided to perform some tests by throwing 250,000 events at them to see how fast they would process. Basically it came in at about 1,800/s. That number seemed low and I started playing around with the logstash settings.

Since we are using the elasticsearch_http output from logstash I experimented with the number of workers (default 1) for that output plugin. 2 was the sweet spot and I managed to increase the throughput to around 2,100/s. Not a great improvement, but I figured more should be possible.

The c3.xlarge comes with 4 cores and 7.5GB of RAM, but when I was testing the load stayed very low at around 0.5. Clearly I wasn’t getting the full value.

Logstash can also adjust the number of filter workers via the -w flag. I figured the filter might just be where things are getting stuck and so I re-ran my tests with various combinations of filter workers and elasticsearch_http workers.

I’ll skip all the details, but for the c3.xlarge instance type I ended up reaching an ingestion rate about 3,500/s or nearly double the original. That rate was achieved by using

  • filter workers = 8
  • elasticsearch_http workers = 4

Changing either of these up or down reduced the overall rate. It also pushed the load to around 3.7+.

I think a lot more experimentation could be done with different instance types and counts, but for now I’m pretty happy with the new throughput which let’s me run a lot fewer instances to get to target rate of 30,000 events/s. For now I think I have a decent formula to drive any further tests:

  • filter workers = 2 * number of cores
  • elasticsearch_http workers = filter workers / 2

There is still a concern  over load balancing the logstash service, which runs as a TCP service and connections from the forwarders persist. That means that if the right number of forwarding instances all tied to the same endpoint start pushing a lot, we might still overrun the box. There are some good ideas around putting a Redis or RabbitMQ layer in between, but that’s an experiment for another day.

\@matthias

why you should embrace a rabbitmq client library

Recently I had to re-run a lot of documents into one of our applications. The app lives on AWS and ingesting content involves the use of a RabbitMQ queue.

I’ve often used the the amqp-tools/rabbitmq-c for quick ad-hoc work in the past and so I wrote a very terse bash script to feed the list of documents to the queue. That script worked just fine, but I was in a hurry and I added quite a few queue clients to get the work done more quickly.

I stalled out in terms of rate and when I looked a bit more closely I found that my bash script wasn’t able to keep the queue fed sufficiently and my clients were going idle.

I also have some Ruby code using the bunny library and decided to re-write my feed script using that.

The results were startling.

Pushing 100,000 messages to the queue using the bash approach took about 28 minutes.

The Ruby version using a RabbitMQ library with persistent connection did the same work 35 seconds!

During a later run I pushed 1 million messages to RabbitMQ from a single client using the Ruby code.  That run took 6.5 minutes for an effective rate of 2500 messages per second.  The server is running on a r3.large and with that push and all the clients reading from it the load pushed up to only around 1.5. That is also a stark contrast to the bash version of the script during which I would see the load rise to 4+.

I didn’t take the time to dig deeply if this was due to process spawning in the bash script or overhead in connection setup/teardown with RabbitMQ. Given the load impact on the RabbitMQ server of the bash script (which ran on a different system) I’m confident that it’s not process spawning, but instead a lot of extra burden on RabbitMQ to deal with all those connection requests.

In the end it just speaks to the practicality of using client library the right way if things are going too slow when interacting with RabbitMQ.

\@matthias