Development and Engineering

As a smaller company, we all have jobs with many different hats.  Over the last couple of years, I’ve been able to finally understand at least two of the mindsets within our development team.  These two mindsets are almost like different hats.  I call them development and engineering and here’s how I see the two mindsets coming together during this project.

We wanted to move from a complicated scale out process to something simpler.  Currently we have a language identification tool supplied as a feature of a bigger data manager.  The language identification is in essence a library that is not separable from the data manager.  Thus, getting the language of a document requires interaction with the data manager.  If we wanted to make this simpler, we have to ask a question: Is it possible to identify language outside of the data manager?  Of course this answer is “surely, it must be possible,” but as a small business, we can’t just know it’s possible out in the world, it has to be possible for our skill sets and has to be something we can afford to spend the time on to make it work.

This first task we like to call feasibility.  It combines a little bit of engineering and a little bit of development to search for answers.  The engineering mindset has to look for something that would work here — with our platform and our constraints.  The developer mindset has to know the code can be manipulated for our needs.  This may be through simple process control or through modification to tailor towards the way things work now.  The results of this search is often captured in a document format based on SCQA that we call the SCPA.  Our difference is that we fell the question is often self evident and we present a proposal to the situation and complication along with alternatives if we can find them.

If we move forward with the SCPA, the developer’s next job is to get something working.  This often acts like a proof of concept.  Our newer developers always feel very proud when they complete this stage.  This is often what the idealized version of development looks like in our dreams.

If the results of getting something working confirm the assertions in the SCPA, we’ll take the next step which makes the developer return to their engineering mindset.  They have to start thinking about operational issues.  How fast is it?  How big is it?  What’s likely to break?  How will we know if it is working correctly?  We answer many of the questions through functional testing and unit testing, but it’s hard to answer some of the other questions without running the software in the real world or something that gets really close to approximating the real world (this is way harder than it sounds).

While working on the language identifier, I was happy to see it working.  I worked with the Product Manager to complete acceptance testing — in this case, did it give us the right answers?  We set up a list of over 20K documents that we knew the language for and began the testing.  The tests seem to be taking a long time, but I had plenty of other stuff to do so it was ok to let them run.  I bundled up the results and passed it to the Product Manager so the analysis could begin, but I had to go back and address the speed issues.  Again, the engineering and development mindset have to work in concert.  I collaborated with my team and collected a lot of good advice including doing the smallest things possible to get acceptable speed without having to drift towards thread management and spawning.  We all agreed that loading the language model for every document was a lot of overhead and if we could load it once and run many documents through the same process we should see an appreciable improvement.  If this wasn’t going to be enough, then go back and revisit threading or multiprocessing options.

The original code was invoked by letting the OS pass the files into one at a time.  This of course means reloading the very large model each time the script was invoked.

$ for D in ../corpus/docs/9/*body; do argot.py -d $D -o ~/out; done

real	44m12.313s
user	40m35.763s
sys	3m21.840s

The revision handles the OS work within the code and loads the model only once.  The results were astonishing.

$ argot.py -p ../corpus/docs/9/ -o ~/out

real 0m10.558s
user 0m10.272s
sys 0m0.265s

This is the same set of documents

$ ls -1 ../corpus/docs/9/*body | wc -l
    2095

We went from 44m12s to 10s. These details are critical to our operations and they combine development and engineering mindset to get the best results.

Advertisements

Milestones

This week, we crossed an interesting milestone in operations – the creation of our 500,000 ticket.  A long time ago, we ran things by email.  Moving to a ticket based system for tracking work was not trivial.  What was a ticket?  An email seemed to be easy to understand – whatever was in the email was in the email.  But what should a ticket be?  We decided to go with the loose idea that a ticket tracked a unit of work.  The definition wasn’t made more specific than that.  After our people got used to using tickets, we started hooking up our software to the ticketing system.  We integrated monitoring first and then came status updates from various software jobs.  We stumbled here a bit because the ticket was not the same as logging from a process.  If we treated the ticket as a log, we could have tickets with 20K entries in them.  That wasn’t making the tickets more useful, just more noisy.  So we came up with a different idea – the ticket tool.  The ticket tool is a very simple PHP application that accepts a ticket number, a task, and a note.  It appends to a text file.  It was written a long time ago, so it does things we probably wouldn’t do now like it returns status codes in the HTML body instead of using status codes in the HTTP header.  It’s also old enough to have been started in CVS.  (Redacted source at the end of this post).

With the invention of Ticket Tool, the view of the ticket changed subtly.  Instead of being the place to track the details of a unit of work, it became the hub to find all of the details.  The secret was simply recording URL links to the ticket tool inside the ticket.  Now it’s not uncommon for our tickets to have five to ten different tools recording details in ticket tool and posting links in our ticket.

Capturing events and details.

We have integration with tickets baked in everywhere in operations.  We have hooks for mail, bash, python, Windows, and probably everything else too.  We use the API from the ticketing system, but we also have written our own that does more things than the original API.  We have a system that extracts the records from the ticket database and converts them to XML to be loaded into a full text system that gives us powerful searching of the ticket history.  Our use of tickets will likely continue to grow.

Here’s the monthly count of tickets created since we started.

Ticket-count-monthly

Our ticketing system is provided by UserScape’s HelpSpot.  We’ve had great success with Ian and his team.

Ticket Tool Source

Does anyone like queues?

We have to deal with lots of documents. When customers give us one document, we may end up creating two or three different versions. That is a quick multiplier, one million documents can become three million documents easily. Since we own our colocation hardware, dynamic scaling like AWS is only partly real to us. We have to build a useful capacity and then decide if that level services the business well enough. Build it too big and it a lot of CapEx is wasted. Build it too small and it won’t survive the demands. The right size for us is often not the high watermark of demand, it’s somewhere beneath that.

What happens when demand exceeds the capacity? A familiar queue steps in place to buffer the work. As soon as you put in a queue, you confront issues of how the queue should run: FIFO (first in, first out), priorities, equality, or any other ideas. With many customers consuming the same infrastructure, there’s not a single answer that works well for queues. FIFO breaks down if Customer A has four million docs and and Customer B has only 100 documents. Customer B has to wait behind Customer A until their job completes.

One of our home grown queueing systems is backed with MySQL. Jobs come into a pretty simple table. Our first approach was to add an ordering table. We can manipulate the ordering table and combine it with the jobs table to sort the work. This would look something like:

SELECT a.qid FROM view_queue a, queue_job_size b WHERE a.jobid=b.jobid and a.block = 0 order by b.cnt asc LIMIT 1

At this point, we had a decent working countermeasure with the queue_job_size table helping out with ordering. Customer B wouldn’t be living in the shadow of Customer A.

Then the circumstances changed again. At one point we had too many small jobs and we weren’t getting around to the large jobs. Our system favored anything that looked like Customer B and Customer A was at the bottom of the list. Ugh.

We added an override table. Through a web interface, a technician could signal to the queue that one job they knew about needed more attention. The signal was captured in the queue_job_priority table

+--------+----------+------------+--------+
| jobid  | priority | ts_update  | active |
+--------+----------+------------+--------+
|  12345 |     1000 | 1394030636 |      0 |
| 473435 |    10000 | 1400627124 |      0 |
| 477280 |    10000 | 1401408608 |      0 |
| 482175 |      500 | 1403140692 |      0 |
| 484328 |      500 | 1403140692 |      0 |
| 484466 |      500 | 1403140692 |      0 |
| 485264 |    10000 | 1403192993 |      0 |
+--------+----------+------------+--------+

Now we could alter the ordering table and take into account special circumstances. Updating the table was accomplished through cron.

update queue_job_size a, queue_job_priority b set a.cnt=b.priority where a.jobid=b.jobid and b.active=1;

This countermeasure was released and we felt pretty good that we could work on the small jobs, get around to the big jobs, and then pay attention to special exceptions.

Except that it didn’t work well when people were sleeping. Customers can load data into our platform without any one being around to help, and they do just that. If a customer loaded a large job and no one was around to prioritize it, we ran into the same problems as before.

We devised a slightly different strategy. On even hours of the day, we manipulate the ordering table for the smallest jobs. On odd hours of the day, we manipulate the ordering table to work on the oldest jobs.

Here’s the ordering table on an even hour:

+--------+------+------------+
| jobid  | cnt  | ts_update  |
+--------+------+------------+
| 499925 |  278 | 1406227561 |
| 499913 |  413 | 1406227561 |
| 499915 |  434 | 1406227561 |
| 499939 |  450 | 1406227561 |
| 499973 |  660 | 1406227561 |
| 499923 |  677 | 1406227561 |
| 499927 |  848 | 1406227561 |
| 499933 |  878 | 1406227561 |
| 499931 | 1023 | 1406227561 |
| 499910 | 1153 | 1406227561 |
| 497980 |  100 | 1406215802 |
| 498048 |  100 | 1406216187 |
+--------+------+------------+

Here’s the same ordering table on an odd hour, with the oldest work prioritized

+--------+-------+------------+
| jobid  | cnt   | ts_update  |
+--------+-------+------------+
| 498048 | 10000 | 1406228521 |
| 498106 | 10005 | 1406228521 |
| 498113 | 10010 | 1406228521 |
| 498154 | 10015 | 1406228521 |
| 498228 | 10020 | 1406228521 |
| 498237 | 10025 | 1406228521 |
| 498293 | 10030 | 1406228521 |
| 498339 | 10035 | 1406228521 |
| 498346 | 10040 | 1406228521 |
| 498349 | 10045 | 1406228521 |
| 497980 | 10000 | 1406215802 |
+--------+-------+------------+

The ordering table still respects the exception driven priorities.

Will queues ever please anyone? I suspect not. This current system is an attempt to thread the needle and please most people.