Does anyone like queues?

We have to deal with lots of documents. When customers give us one document, we may end up creating two or three different versions. That is a quick multiplier, one million documents can become three million documents easily. Since we own our colocation hardware, dynamic scaling like AWS is only partly real to us. We have to build a useful capacity and then decide if that level services the business well enough. Build it too big and it a lot of CapEx is wasted. Build it too small and it won’t survive the demands. The right size for us is often not the high watermark of demand, it’s somewhere beneath that.

What happens when demand exceeds the capacity? A familiar queue steps in place to buffer the work. As soon as you put in a queue, you confront issues of how the queue should run: FIFO (first in, first out), priorities, equality, or any other ideas. With many customers consuming the same infrastructure, there’s not a single answer that works well for queues. FIFO breaks down if Customer A has four million docs and and Customer B has only 100 documents. Customer B has to wait behind Customer A until their job completes.

One of our home grown queueing systems is backed with MySQL. Jobs come into a pretty simple table. Our first approach was to add an ordering table. We can manipulate the ordering table and combine it with the jobs table to sort the work. This would look something like:

SELECT a.qid FROM view_queue a, queue_job_size b WHERE a.jobid=b.jobid and a.block = 0 order by b.cnt asc LIMIT 1

At this point, we had a decent working countermeasure with the queue_job_size table helping out with ordering. Customer B wouldn’t be living in the shadow of Customer A.

Then the circumstances changed again. At one point we had too many small jobs and we weren’t getting around to the large jobs. Our system favored anything that looked like Customer B and Customer A was at the bottom of the list. Ugh.

We added an override table. Through a web interface, a technician could signal to the queue that one job they knew about needed more attention. The signal was captured in the queue_job_priority table

+--------+----------+------------+--------+
| jobid  | priority | ts_update  | active |
+--------+----------+------------+--------+
|  12345 |     1000 | 1394030636 |      0 |
| 473435 |    10000 | 1400627124 |      0 |
| 477280 |    10000 | 1401408608 |      0 |
| 482175 |      500 | 1403140692 |      0 |
| 484328 |      500 | 1403140692 |      0 |
| 484466 |      500 | 1403140692 |      0 |
| 485264 |    10000 | 1403192993 |      0 |
+--------+----------+------------+--------+

Now we could alter the ordering table and take into account special circumstances. Updating the table was accomplished through cron.

update queue_job_size a, queue_job_priority b set a.cnt=b.priority where a.jobid=b.jobid and b.active=1;

This countermeasure was released and we felt pretty good that we could work on the small jobs, get around to the big jobs, and then pay attention to special exceptions.

Except that it didn’t work well when people were sleeping. Customers can load data into our platform without any one being around to help, and they do just that. If a customer loaded a large job and no one was around to prioritize it, we ran into the same problems as before.

We devised a slightly different strategy. On even hours of the day, we manipulate the ordering table for the smallest jobs. On odd hours of the day, we manipulate the ordering table to work on the oldest jobs.

Here’s the ordering table on an even hour:

+--------+------+------------+
| jobid  | cnt  | ts_update  |
+--------+------+------------+
| 499925 |  278 | 1406227561 |
| 499913 |  413 | 1406227561 |
| 499915 |  434 | 1406227561 |
| 499939 |  450 | 1406227561 |
| 499973 |  660 | 1406227561 |
| 499923 |  677 | 1406227561 |
| 499927 |  848 | 1406227561 |
| 499933 |  878 | 1406227561 |
| 499931 | 1023 | 1406227561 |
| 499910 | 1153 | 1406227561 |
| 497980 |  100 | 1406215802 |
| 498048 |  100 | 1406216187 |
+--------+------+------------+

Here’s the same ordering table on an odd hour, with the oldest work prioritized

+--------+-------+------------+
| jobid  | cnt   | ts_update  |
+--------+-------+------------+
| 498048 | 10000 | 1406228521 |
| 498106 | 10005 | 1406228521 |
| 498113 | 10010 | 1406228521 |
| 498154 | 10015 | 1406228521 |
| 498228 | 10020 | 1406228521 |
| 498237 | 10025 | 1406228521 |
| 498293 | 10030 | 1406228521 |
| 498339 | 10035 | 1406228521 |
| 498346 | 10040 | 1406228521 |
| 498349 | 10045 | 1406228521 |
| 497980 | 10000 | 1406215802 |
+--------+-------+------------+

The ordering table still respects the exception driven priorities.

Will queues ever please anyone? I suspect not. This current system is an attempt to thread the needle and please most people.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s