So Many Operations

Often, I will reduce a complex problem into a set of abstract computer “ops”. These ops aren’t meant to be an exact description of how the computer or the network would carry out the task, but a logical abstraction. Let me walk through a short example.

How does a file get to repono (storage)?

Ignoring all of the steps that got us to the point of putting the file in Repono, here’s how I would think of the operational cost of Repono (as of June 2016)

  • pfiles calls stork, stork pushes a job into ins-queue for every file, operational cost is 1*n
  • ins-queue records the job in a queue table, operational cost is 1*n
  • r2 asks for work from ins-queue and marks it WIP, operational cost is 1*n
  • r2 commits the file to repono, operational cost is 1*n
  • repono writes to disk, cost is 2*n for replica
  • r2 logs the outcome to logserver, operational cost is 1*n
  • r2 asks to remove the record from ins-queue, operational cost is 1*n

As such, if we have 1,000,000 documents, in this overview, we have 9,000,000 operations.

But there is much more in the details…

What’s going on in ins-queue?

ins-queue was developed in house. That could make it well tuned for our needs or another tool where we have to keep up with all aspects of it operationally. Your perspective will likely determine how you see this sort of software. Regardless, here’s some of the current stats of ins-queue (June 2016)

 | id |       tool |       cnt |
 | 44 |    roz_kvp |     62649 |
 | 43 |    roz_mdb |     62839 |
 | 1  |        roz |     62931 |
 | 57 | roz_merged |     79696 |
 | 5  |     grimes |    364235 |
 | 6  |     pdfopt |   3825390 |
 | 3  |   smithers |  10607081 |
 | 4  |    s3queue | 667402231 |

Yes, that is 667M documents that have flowed through ins-queue (s3queue) and (mostly) on to Repono. That also means we have done nearly 11M data overlays (smithers) and have optimized 3.8M PDFs (pdfopt). But there’s a little more…how do we know how many records? Every time we take a record out of s3queue, we ask to increment this counter. That’s one additional operation. If we were at 9n before, we’re now at 10n. But the chained events that update the counters also update another logging effort to keep track of activity in deletion mode. Our new count is 11n. So for the 667M documents listed here, we’ve taken 7.3B (yes, Billion) operations to store those documents in Repono (this doesn’t count optimizing, scanning, inventory, and even a full drill down on all of the logging). If we went to a granular level, I suspect that the act of storing a document with logging is probably 20n.

What’s going on in logserver?

We have a REST based generic logging tool called logserver or Ticket Tool.   It is very simple.  It was discussed previously and the source is here.

As adoption has increased, so has our desire to create more detailed logs.  This platform is one of the busiest in our operations (although that is a crowded field of busy platforms).   It can fill quickly and the long term value of the information decays pretty quickly.  So, to save storage, we have cron’d some administration to take the older files and zip them up in place.   A custom error handler will find the compressed file if you have a link to the original file.  That’s going pretty well until we look at our replication hosts.

Yesterday, it went into alarm.  So much data had accumulated.


Our monitoring has thresholds on operational limits.

The comfortable and generic syncing of all data was never updated to reflect the reaction to the broader adoption of the logging service.  As such, we would have the original file and the zipped file.  After removing the original files, we recovered over 60% of this resource.

All of this is happening behind the scenes and adds to the operational costs, choices, and scale of working in a distributed platform.