Minority Opinions

Not everyone can be mainstream, after all.

Perfectly Faulty Code

leave a comment »

I’m constantly hearing about problems with products I maintain.  Depending on the type of issue, I go straight to the database, or straight to the code.  Given enough information from the user, I can usually figure out what’s actually going on, and why it’s perceived as buggy.  There are cases, though, where that approach has completely failed me; cases in which I’m absolutely positive that the code is correct, the data is correct, and what’s happening isn’t possible.  Enlightenment takes serious effort, and frequently leaves me feeling foolish.

The case of the broken database

One day, the help desk tells me that nobody can use the site.  They can log in and navigate around, but the primary functionality is fundamentally broken.  It keeps complaining with an “Invalid request” error.  That’s really odd, because it’s an error that should only come up when someone has been manually fiddling with the URLs.  I also have a natural distrust for overly broad problem statements, but quickly verify that the specific instances are in fact completely busted.

The code looks perfect, and has not been changed recently; surely, such a problem could not have gone undetected for long.  It still works on the development and staging platforms, so only production manifests the issue.  So I start adding a few logging calls directly on the production server; bad practice, but I couldn’t break it any further, so why not?

So I find out that a certain record never exists in the database when it’s collected.  But that’s impossible, because it had just been created by the previous page.  It was even getting requested by its ID, as collected from a foreign key in another table.  The database consistency was seriously compromised.  A manual check of the database reveals that it’s there, but the code simply can’t find it.

Because the code is looking on another server.  Almost a year earlier, some performance issues led us to attempt many questionable optimizations, including splitting some heavy read-only queries to a slave database.  They were on the same subnet, so replication was almost always instantaneous, and I had forgotten about my original misgivings.  I had forgotten quite a bit about the whole episode, except the resolution to the real underlying problem.

Anyway, a quick configuration change switched those queries back to the master database.  Then, we identified the slave’s hangup, removed the row that had accidentally been inserted on the slave (in an entirely unrelated database, no less), and fixed the monitors that should have notified us about the replication failure.

The case of the empty webservice

A particular product has just been extended to work with multiple service platforms, switching between them as needed.  Things look good, and work almost perfectly in the platforms I’m able to thoroughly test.  It’s the third one, the one with the important new service line, the one hosted by a third party, that gives us all the trouble.  Even after working on the staging platform, production doesn’t let users access their material.  The login system tells them that it’s simply not there.

Permissions are all set up properly, when viewed by an admin user.  Logging shows that the relevant identifiers are now correct, though they hadn’t all been at first.  The code seems to be making all the right calls.  Just the answers are wrong.

Or they seem to be, anyway.  The answers get returned as XML, with zero or more elements of a certain tag.  Zero means the requested item doesn’t exist, in the successful case.  But the code didn’t verify that the XML reported success, just that it was well-formed.

Once we started identifying and logging webservice failures, the authentication error was obvious.  Our credentials had previously been correct, but had been changed slightly at some point between start of development and final release.  A simple configuration issue, quick to correct, once we notice that it’s a problem.

The case of the missing feature

One of our less lucrative products has finally reached the top of the priority queue, so I’ve implemented a few features requested by our client.  One of them is fairly tricky, so it got the most attention in the unit tests.  There are a few corner cases that I know I should get to, but it’s good enough to satisfy the immediate demands.  Almost as soon as it’s released, I start getting complaints that it’s not working at all.

Not many complaints, since it’s not that widely used a product.  But pressure mounts up on certain people, and they eventually unload on me.  I double-check my test cases, add and implement a few of the corner cases I had put off, and check that the reported behavior has been fixed.  Except that it hasn’t.

In fact, things are even worse than I had believed.  One of the major reasons for our update was simply not working, even on my development box.  We had redesigned the interface around this feature, and it’s been busted in production for months now.  No wonder we were getting kickback.

But the unit tests were still passing, so all my debugging has to be manual, using the full interface.  With several temporary logging calls, I soon discover the root of the problem:  The new feature is all on the server side, but the client side isn’t using it.  The relevant service calls had never been updated to include the new parameter, so they were still triggering the old behavior.  That’s what I get for failing to test the hard parts of the stack.  I’m still not really sure how to do so.

The case of the lost work

The user complains to a supervisor that everything they did yesterday has disappeared.  They then claim that it seems to have been happening quite a bit this month, which is why they don’t seem to have done anything lately.  This time, though, the supervisor seems to recall having seen them done the work, so the complaint gets escalated to me.

There’s nothing in the database.  There’s no record of the user logging in or out during the times in question.  There’s no progress in the backups.  There are no rows marked as deleted.  The database user doesn’t even have permission to truly DELETE rows.  There are no related accounts of the type that frequently gets used instead of the main one.  There’s no way I’m going to sift through the mounds of progress by thousands of other users within the vague timeframe to find something that may have been done by the user in question.

This isn’t even the first time.  It won’t be the last.  I can’t completely confirm that the user is lying, though there is plenty of motivation to do so.  Neither can I completely rule out the possibility of some huge gaping hole in my system.  I can’t even tell whether the user had been using another account, intentionally or otherwise, the whole time.  I’m getting all of this fourth-hand, without any real details other than the user’s name and occasionally an account ID.  Sometimes even less.

This time, though, I’ve figured out exactly what I want to see before I go chasing down another wild goose.  I want screenshots of the work in progress before it disappears.  With the full URL, including the ID that will tell me where it has gone.  Yes, that’s probably impossible for the user to generate at the time of complaint.  Almost exactly as impossible as it is for me to divine it from thin air.

Maybe I’ll still feel foolish for this one.  Whether it’s my fault for letting a bug remain in the code, or my fault for wasting so much time on a bunch of little lies.  I’m hoping that time will tell.


Written by eswald

18 Oct 2011 at 11:23 pm

Posted in Technology

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s