forge

Abort, Retry, or Fail

I’ve had numerous technical support issues in the last few weeks. Everything from some phone line challenges to having issues with two HP desktop machines. (More on that in a future blog post.) However, through all of this, I realized that we’ve not really progressed from the DOS days in the late 80s where it was common to have the system come back with a prompt that notified you of some error and then asked you whether you wanted to “Abort, Retry, or Fail” the operation. I even remember that some inventive folks created a TSR (Terminate and Stay Resident) program that would automatically select retry for you.

I realized that somehow this got woven into the way that we do business in IT. When my hard drive started failing, it started retrying the reads automatically, without telling me. Suddenly performance of a hard drive dropped to 2MB/ (less than 1/10th of the performance it should have). I’m happy it was retrying rather than having me loose all my data, however, by the same token, I wish I would have gotten a notification that there was a problem. There is SMART (Self-Monitoring, Analysis, and Reporting Technology) technology that’s supposed to do this but as it happens there was an issue with the motherboard BIOS version that I was using so that SMART tests weren’t really performed. So I got automatic retry — without any control of it. It’s better than automatic failure, but still not real helpful.

On my Internet connection issues, I realized there was a relatively small amount of Internet packet loss. This is normal, TCP (Transmission Control Protocol) is actually specifically designed to automatically retransmit packets that are lost. That’s why I’ve had switches that were dropping packets for a long time before I realized it. It took really digging into performance issues before I realized what was going on. The lost packets would simply be retransmitted. In small quantities, not a big deal. In larger quantities, it creates a real performance issue. One that can be difficult to find.

Complicating this is that even if we could see the retries, most folks don’t know what to do about them. The AT&T UVerse issue I was having showed up as an outbound dialing issue. This lead to a series of tests, including one that identified crosstalk on the line. There’s a ton of different answers I got about what was received and what it meant. I got the answer that generally the UVerse service doesn’t run on the same POTS line as another number. Other technicians said that that was fine. The latest guy explained that the crosstalk they saw could have been in my house. (I don’t even want to touch the terminology and isolation issues here.) The short is that the whole system functions because it doesn’t have to actually make it work. It just has to be close enough that the retries don’t get noticeable.

So what’s wrong with this? Well, to some extent, nothing. It’s brilliant architecture. The architect figured out that the system could not work reliably so extra margin was built in for errors to happen. The problem isn’t that the architecture/design is bad for accepting faults. It’s bad because we don’t have any reporting on when a problem is occurring. It’s bad because it has caused the training to become so bad that no one really knows how things are supposed to work because observationally speaking they’ve made it “Work” several different ways. Which is right? Which is best? Who is to say, because there’s no way to really quantitatively say.

I’m not advocating that we don’t do the retries… it would just be nice to know when they’re happening so we can do something about them — or at least find someone who knows about it — if we can.