Insidious Bug or Comedy of Errors?

A client presented me with an obvious and significant problem that required immediate attention. I worked on the problem and helped them solve it. Along the way, I discovered a whole bunch of things that merit further examination by software developers.

The names and facts are changed and I will present the code as pseudo-code. I never compromise client confidences and the technology doesn’t matter: this could have happened anywhere.

This is pretty much bread-and-butter backroom application software for a large enterprise that processes lots of orders for lots of dollars...

I was presented of a pdf of a Purchase Order that had been emailed to a Vendor. The problem? No prices. Yikes. This could be a huge problem. The company emails thousands of Purchase Orders to Vendors every day, full of data supporting critical legal and mission critical transactions. The fundamental data elements are Part Number, Quantity, and Price. How could the price be missing? And how could it only be missing from one (or a few) out of thousands of Purchase Orders?

I started by doing what any digital sleuth would do: I tried to recreate the problem. Fortunately, this worked on the first try. I reprinted the Purchase Order and sure enough, no prices. I reprinted several others and there were prices.

The next step was to isolate the problem, debugging backwards. Output Record? Blank price. Variable feeding Output Record? Null value. Price on Purchase Order data base record. Fine. Hmmm. Next I examined the logic pulling the data from the data base and placing it in the output variable. It was looking for the Price in Column 22, the column for Foreign Currency Price. On an order to a California Vendor? OK, I was onto something.

I zeroed in on these two lines in the print program:

CurrencyCode = PORec[45]

if CurrencyCode = "USD" then PriceCol = 21 else PriceCol = 22

What was in Column 45 of this PO Record for this California Vendor? "USD" and a bunch of delimitters. Hmmm. That would cause PriceCol to be 22 when we obviously want it to be 21. The Price was in Column 21 but we are pulling a null out of Column 22. Bingo.

The customers are screaming. The business is suffering. Now what?

Stupid way out: Get the Currency Code from the Vendor record, not the PO Record

Lazy way out: Strip the delmitters from PORec[45].

Right way out: Find out what's putting delimitters into Column 45 of the PO Record.

Long term solution: See below.

The right way out can be very difficult with a large code base. First I isolated the 614 programs that had been promoted into production in the last 90 days. (I figured that the problem was new so the culprit program must be fresh.) I searched for the string "45". 42 hits. Nothing suspicious. Next I looked at data dictionaries and canned functions that provided potential synonyms for Column 45 of the PO Record. I found four possibilities. Then I searched the 614 programs for each of these. Nothing. Hmmm. Standards that no one follows. OK.

Then I simply scoured the list of 614 programs. One name caught my eye: "PoSplitter". Brand new. Written by a contractor who didn't know the whole application. Promoted 3 weeks ago. I read the whole program. No reference to "45", "Foreign Currency", or anything seemingly related. But one variable looked suspicious: DatasetCols. What was this? A list of columns in the PO Record that had matching multiple values, one for each Part on the PO. DatasetCols was a global variable passed down by a master routine. I read that routine and (bingo!) found 45 in the list of DatasetCols. I traced the mods back to 2005 when it was added to the list.

I double-checked the data dictionaries and the common functions. All said that Column 45 of the PO Record must be a single Foreign Currency Code defaulted from the Vendor Record and joined to a preset table. On the other hand, the master PO routine had it in a dataset list. A dataset list that had never been referenced by any other program until that contractor used it in PoSplitter. So, as soon as his program went into production, for every Purchase Order that was "split", Column 45 kept its original Foreign Currency Code along with a delimitter for each Part on the PO. Which in turn caused the PO Print program to fail to secure "USD" and automatically default to Foreign Currency (note that this bug would never affect foreign orders).

The immediate (right) solution:

1. Remove Column 45 from the variable "DatasetCols" in the master routine. Recompile all affected programs.

2. Clean up the data base.

The long term solution:

1. The data dictionary must be the Bible. Have no other code, variables, or function that can possibly say something else. Variables like "DatasetCols" must never be hardcoded, but must be populated from the data dictionary. All synonyms must also be defined in the data dictionary, not in many other routines.

2. Don't use datasets. Normalize your data. (Enough said).

3. Don't have hanging conditionals. Will If...then cover all possibilites? No? Then make a Case, catching any errors. ("USD***" is NOT a valid Foreign Currency Code!)

4. If something breaks, break it! The first time an error was encountered (see #3 above), the PO Print program should have stopped and demanded a help desk intervention. But since errors weren't being captured at the point of failure, 3500 Purchase Orders were printed without prices for three weeks before anybody who cared noticed.

5. Learn the app before you change it. I realize that this is easier said than done, but I'd like to think that the contractor should have understood what all the columns in the PO Record that he was changing. He simply trusted the variable "DatasetCols". Do you imagine that a senior developer would have caught that Column 45 was inconsistently documented in the existing code base? I don't know, but it's an interesting question.

6. Parallel test. The Split Line enhancement was big enough to run an automated parallel test. Column 45 of the POs from the test data base would not have matched those from the Control data base. This would have stuck out like a sore thumb if anyone had bothered to check.

7. Regression test. Just because the stuff that should have changed did change as expected, did everything that should not have changed stay the same? (I know, I know, how do we test for "everything else".) There's no easy answer for this, but doing nothing is the worst possible alternative.

What else would you add to my Long Term Solution list?