Tuesday, April 10, 2012

Some musings on the status quo of programming in the professional world

I occasionally get quite frustrated with the status quo of peoples' programming practices. Lately, the amount of manual work that goes into programming seems a bit excessive. First, some back story.

One of the niches I've carved out for myself at work is in the area of converting statistical models created in SAS. The general process is as follows.
  1. Our statistical modelers create a logistic regression model that predicts something -- credit worthiness, ability/willingness to repay debt, likelihood of fraud, etc. -- and they give us the SAS code for that model.
  2. We convert the model from SAS to ECL.
  3. Once converted, both the engineer (me) and the modeler processes a large number of records in both ECL and SAS, respectively. Results are compared to find errors in the ECL that are fixed.
  4. When the ECL produces the same results as the SAS, the model is fully validated and can be put into a production release.
When I started doing this five years ago, this process was extremely manual. I would go through the SAS code line-by-line, typing up ECL that I created based on my semantic interpretation of the SAS code. Models at the time were usually several hundred lines of code (LOC), maybe a thousand or two on occasion. Since that time, the size of our models has increased dramatically. For one of our flagship products, RiskView, a newer model easily exceeds 10k LOC.

As models started increasing in size, I worked on what became a fairly good 80/20 rule improvement to the process: a sizable portion of the SAS code could be quickly, correctly, and most importantly, automatically, converted to ECL. This allowed me to halve the time it took engineering to convert a model. Since then, we've developed internal tools to improve this process even further.

It struck me as odd that "the way we've always done it" was so manual, tedious, time-consuming. It seemed no one ever thought to automate away this work.

Even with a fairly mature SAS to ECL converter, we still run into incorrect semantic translation (step two above), so we still require work on the validation (step three). I only recently found out that this process -- which is a burden that is put almost squarely on the modeler -- is about as manual as it can be.

This process is archaeology. For a given input record, SAS calculates a score and ECL calculates a score. If these values match up, great! If not, then there's some ECL code that doesn't quite do what it's supposed to (since we've defined SAS as being correct). Maybe it's an expression that lacks proper grouping, thus yielding different orders of operations. Maybe it's a floating point error. Maybe it's something with SAS' missing values, a notion for which ECL has no analog. Whatever the reason, the modeler then traces back what values go into the final score. Let's say there are a half dozen of them. How many of those values match between SAS and ECL? If only one of those values is off, then what values go into that value? They back-track this problem until they find a variable for which SAS and ECL differ but whose input values match.

A contrived example might be something like this SAS code:
if age <= 0 then age_m = 0.000000;
else if age < 18 
then age_m = 1.243567;
else if age < 26 then age_m = 2.345678;
else if age < 42 then age_m = 3.456789;
else if age < 60 then age_m = 4.567890;
else age_m = 5.678900;
And almost semantically equivalent ECL:
age_m := map(
   age < 0  => 0.000000,
   age < 18 => 1.243567,
   age < 26 => 2.345678,
   age < 42 => 3.456789,
   age < 60 => 4.567890,
The very minor difference (highlighted) between these expressions is the less-than-or-equal-to (<=) comparison against zero in SAS but only less-than (<) in ECL. In this example, if both languages report an age of zero, they will each report a different value for age_m -- 0.000000 for SAS and 1.234567 for ECL.

After much digging, the statistical modeler reports to us that the age_m calculation is problematic, and the engineer looks at the SAS and the ECL, finds the problem and fixes it.

On a 10k LOC model with a few hundred intermediate values, there is a lot of archaeology. Yet for my past five years and probably for years before that, this has been the process. Comparing results for model validations has been painstaking.

These inefficiencies bother me. We have so much work to do that wasting time doing tedious digging into 500MB of raw intermediate data should be out of the question. Unfortunately, it's not; it's the status quo, "the way we've always done it." I've therefore spent a good portion of the last two days at work metaprogramming a solution.

Unfortunately, ECL (at this time) has no introspection. Any kind of metaprogramming is done at compile time. That limits me, but not so much that I can't come up with a solution. In this case, I found a way to do a comparison of SAS and ECL intermediate variables, report on the differences and even provide a metadata summary of those differences. With essentially two LOC, I can see where my ECL differs from the SAS and on what proportion of our sample data. I reduce a half day of statistical modeler manual time to twenty seconds of thor time.

Now, coming back to the original point: Why are so many of these inefficiencies in place? I occasionally hear the phrase, "that's how we've always done it" -- never in a derisive tone, just usually as a shrugging off of new ideas as too alien or unproven.

Developing, translating and implementing logistic regression models is one portion of what we do, but I've run into this kind of thing many times over. Some of the most fundamental philosophies of computer science seem to get lost in practice. Code reuse, cost of developer time over processor time, encapsulation, abstraction, just to name a few.

There are reasons why we're always so ridiculously busy, and I think one of those reasons -- and not an unimportant one -- is because we don't sit down and think. We should think about how we're doing things, how they should be done, and whether the effort to change is worth it. More often than not, I think it is worth it.

No comments:

Post a Comment