Look for Horses, Not Zebras

Apr 29, 2014 | Coding debuggingzebras
Featured Image

I have two quotes from old friends that have stuck in my head in certain situations. The first is, "dancing is like standing still, only faster." As a runner, I have adapted that to, "running is like falling down, only slower." The second is, "when you hear hoofbeats, think of horses, not zebras." As a programmer, I have found there is not much better advice than this second quote.

You would think that the more experience you have, the more likely you are to look for the obvious answer. I’ve found that the opposite is true – while you are going to make "stupid" mistakes less frequently, the majority of your mistakes are still going to be "stupid". Unfortunately with great power comes great ego, and the most difficult bugs are caused by the occurrence of something that was "never going to happen". I happened to stumble through 3 of these self-induced hair-pulling-out exercises in a matter of days. To make matters worse, I don't really code at work anymore. This was a hobby project that I can only work on for an hour or two each night.

First was the infinite loop (and subsequent stack overflow) – one of the few errors that can't be captured with standard error logging techniques. This problem only happened on the production server so instead of doing the obvious and duplicating the data down to dev so I could debug, I decided to add debug logging until I found the problem. This was the first symptom of my bravado. My application is a state machine "engine" that runs in a loop. As things happen and the state changes, the loop eventually completes. I narrowed down the problem to the state where the error occurred but had trouble getting closer than that. I was convinced that some dark magic was at play and some data anomaly so I spent a few nights adding debugging, publishing, running, repeating.

Finally, I stopped looking for zebras. Instead of continuing to try to zero in on the line of code through trial and error addition of debug logging, I put 3 lines of code in the most obvious places the problem could occur – the 3 while loops that could get called inside my main loop. That immediately pointed me to the loop in question so then I threw an exception when a condition causing an infinite loop occurred. It probably took me all of 5 minutes to determine the state that caused the problem.

So next I had to figure out how the code got into that state. All of the information I needed to solve the rest of the problem, I saw within minutes – and completely ignored. I saw that the state in question could only come from the database because the key data was never set anywhere by code. Therefore, the code that allowed the infinite loop condition must have been in the code that loaded the data from the database. However, rather than looking right where all indications pointed, I looked for zebras. Since I knew how to find the problem, I could now duplicate it on my dev box and debug. That meant I could spend time inspecting variables at breakpoints to see why the problem was happening. After all, this infinite loop was caused by a recursive linked-list relationship – something that was "never going to happen". There couldn't have been an obvious reason for something like this to happen so I had dig for a needle in the haystack instead of following the thread attached to the needle. In spite of myself, I did eventually figure out that the problem could absolutely be solved when the data was loaded into the engine. Technically this is after the data is extracted from the database, but the recursive situation doesn't matter until it gets to the engine so that’s where I put the fix to truncate the recursive linked list since the duplicate reference shouldn't have been there anyway.

Fast forward a few days. With my problem solved, I was able to run my engine for a few days and then started to look at the output. Then I stumbled on a new error – "An error occurred while reading from the store provider’s data reader. See the inner exception for details" and the inner exception was "Arithmetic overflow error for data type tinyint, value = 2288". So, the expert programmer has managed to break the Entity Framework. Of course I checked the "obvious" problem – that my model had the wrong data type. Then I went off the deep end. Unfortunately, there are people who have had legitimate problems with EF or linq-to-SQL mapping parameters incorrectly (2288 was a parameter, not a database value). I converted my linq expression into a SQL command and then realized I was once again looking for zebras. Instead of looking at the columns in my database that would map to tinyint, I assumed the problem was in the Entity Framework – code written by Microsoft and used by thousands if not millions of people – and not in MY code. When I stepped through my code and looked at the parameters, I noticed there was a parameter that should’t have even been there. That was my problem – I passed the parameters into my method in the wrong order! Instead of passing "int, null, int, true" I was passing "int, int, null, true". The 2nd and 3rd parameters were both nullable ints so as far as the compiler was concerned this was all good. Unfortunately it meant it passed my very large primary key into a parameter that was expecting a much smaller value – something that should have been very obvious from the error.

This post has been sitting in draft for so long that I have now had several more "events". The most recent was while doing a demo on AngularJS directives with my team at work. I was trying to illustrate scope isolation and no matter what I tried I couldn't figure out why the attribute from the directive specified in my HTML wasn't making its way to my link function in the directive code. When I turned off scope isolation and pointed at the parent scope, everything worked fine. Even my more Angular-savvy team members were stumped. Then came the "duh" moment – I actually had 2 directives nested inside each other. I was looking at the code in the inner directive and passing data into the outer directive. The HTML of the inner directive wasn't passing the value from the parent. Yet another zebra chase.

So in conclusion, drop the attitude! When you have problems, they are most likely the most basic and amateur issues. Reign in those horses and let the zebras smack you in the head, don't go looking for them.

UPDATE: 29 January 2019

I have done it again! I was working on a new framework feature and couldn't get my test to work. I just KNEW it was a problem with the third-party software I was using. More specifically, I wasn't sure what I was trying to do was even supported. Then somehow I managed to get everything to work directly with said software. After days of struggling, I discovered I was writing the final portion of my test immediately after testing deleting the exact thing I am trying to test. As usual, my code was doing EXACTLY what I told it to do.

On a side note, since the original post, I was introduced to the concept of rubber duck code reviews. I highly recommend this and your coworkers will thank me.