As a smaller company, we all have jobs with many different hats. Over the last couple of years, I’ve been able to finally understand at least two of the mindsets within our development team. These two mindsets are almost like different hats. I call them development and engineering and here’s how I see the two mindsets coming together during this project.
We wanted to move from a complicated scale out process to something simpler. Currently we have a language identification tool supplied as a feature of a bigger data manager. The language identification is in essence a library that is not separable from the data manager. Thus, getting the language of a document requires interaction with the data manager. If we wanted to make this simpler, we have to ask a question: Is it possible to identify language outside of the data manager? Of course this answer is “surely, it must be possible,” but as a small business, we can’t just know it’s possible out in the world, it has to be possible for our skill sets and has to be something we can afford to spend the time on to make it work.
This first task we like to call feasibility. It combines a little bit of engineering and a little bit of development to search for answers. The engineering mindset has to look for something that would work here — with our platform and our constraints. The developer mindset has to know the code can be manipulated for our needs. This may be through simple process control or through modification to tailor towards the way things work now. The results of this search is often captured in a document format based on SCQA that we call the SCPA. Our difference is that we fell the question is often self evident and we present a proposal to the situation and complication along with alternatives if we can find them.
If we move forward with the SCPA, the developer’s next job is to get something working. This often acts like a proof of concept. Our newer developers always feel very proud when they complete this stage. This is often what the idealized version of development looks like in our dreams.
If the results of getting something working confirm the assertions in the SCPA, we’ll take the next step which makes the developer return to their engineering mindset. They have to start thinking about operational issues. How fast is it? How big is it? What’s likely to break? How will we know if it is working correctly? We answer many of the questions through functional testing and unit testing, but it’s hard to answer some of the other questions without running the software in the real world or something that gets really close to approximating the real world (this is way harder than it sounds).
While working on the language identifier, I was happy to see it working. I worked with the Product Manager to complete acceptance testing — in this case, did it give us the right answers? We set up a list of over 20K documents that we knew the language for and began the testing. The tests seem to be taking a long time, but I had plenty of other stuff to do so it was ok to let them run. I bundled up the results and passed it to the Product Manager so the analysis could begin, but I had to go back and address the speed issues. Again, the engineering and development mindset have to work in concert. I collaborated with my team and collected a lot of good advice including doing the smallest things possible to get acceptable speed without having to drift towards thread management and spawning. We all agreed that loading the language model for every document was a lot of overhead and if we could load it once and run many documents through the same process we should see an appreciable improvement. If this wasn’t going to be enough, then go back and revisit threading or multiprocessing options.
The original code was invoked by letting the OS pass the files into one at a time. This of course means reloading the very large model each time the script was invoked.
$ for D in ../corpus/docs/9/*body; do argot.py -d $D -o ~/out; done real 44m12.313s user 40m35.763s sys 3m21.840s
The revision handles the OS work within the code and loads the model only once. The results were astonishing.
$ argot.py -p ../corpus/docs/9/ -o ~/out real 0m10.558s user 0m10.272s sys 0m0.265s
This is the same set of documents
$ ls -1 ../corpus/docs/9/*body | wc -l 2095
We went from 44m12s to 10s. These details are critical to our operations and they combine development and engineering mindset to get the best results.