Minority Opinions

Not everyone can be mainstream, after all.

Database Lessons I’ve Learned

leave a comment »

Almost as soon as I started working at my first real full-time job, I was assigned to work on a website that was causing major concerns for the company’s largest client.  It had grown organically from supplemental material to critical information, and parts of it were so slow that it was almost unusable.  Fortunately, I had already used MySQL and PHP, the most important pieces of the site, so I was able to make some headway.  However, some of the lessons I’ve since learned would have made the whole process much simpler and faster.

Index your tables properly

The first major speedup was very simply to add indexes to the database tables, based on the main queries executed against them.  For a minor performance penalty on insert and update operations, you get huge benefits for select statements.  I wish I could take credit for this one; instead, it was one of the main contributions of a contractor who was with us for only a few weeks.

Parse text once, not repeatedly

The most aggravating page on the site took several seconds to load, and I quickly discovered that the bulk of its time was spent parsing a set of massive text fields.  Each field in the column contained multiple records, the more recent ones with timestamps, and I needed to know whether any contained new, unprocessed information.  I should have immediately parsed it into a new table with an indexed column pointing to the original one, despite all the work it would take to rewrite everything that used it.  Instead, I made minor adjustments to render it almost fast enough for comfort.

Switch to a new system gradually

When it became clear just how flawed the current system was, and just how much effort it would take to revamp it, I was asked point-blank whether it would be faster to patch it or rewrite from scratch, given the list of requirements demanded by our client.  At that point, I saw far too much mess to clean up, so I recommended the rewrite.  Fortunately, we soon had a clean slate with enough speed and functionality to save our bacon.  Unfortunately, all of our users and most of our data were still in the old format, and the two were completely incompatible.  The worst part was redirecting users to different login screens based on when they registered.

Now, I have the skills required to migrate the data and revise the pages piece by piece.  It would have been a tedious process, and I wouldn’t have been able to receive as much help, but it could have been done.  Almost all of the data got migrated into the new format anyway, so it would probably have been better to consider its needs in the first place.  Then again, I’m still maintaining the new system, and I’m not sure a gradually improved one would have been quite so understandable by this point.

Use a database upgrade system

We had a professional database architect thrust upon us for a week to design our new back end.  Practical concerns required us to make serious changes almost as soon as he was gone.  Before I was placed in charge of the database, it had already been fiddled with several times, and the production machine was severely out of sync with our development environments.

Since then, I’ve written my own simple database upgrade script, after hearing about the good ones that were available in other languages.  It changes sometimes, and still bears quirks from earlier design decisions, but overall has been a big help.  Only with such a system could I have ever done a gradual migration.

Cache the results of common slow queries

The new clean system has had its own ups and downs, particularly after collecting a dozen times as many users as the old one ever had.  A few pages in particular tend to be troublesome, partly because they’re used by the more vocal category of user, and partly because they display big mounds of data, some of it based on expensive calculations.  After a while, optimizing the calculation hit a limit, and we had to start caching the results.  In one notable case, we were able to add five columns to an existing table: four to store results, and one with a timestamp to indicate when they were calculated.  To invalidate the data, we could simply clear the timestamp; in some systems, it would be worthwhile to recalculate at that point, but this application triggered invalidation several times in succession, before requiring the results several times in a row, so we found it more effective to recalculate on demand.

Sometime, we will probably also switch to memcache storage for sessions, as our next big database bottleneck.  Some research indicates that it still needs a database backup to avoid random session loss, but we could at least reduce disk writes to one every few pages instead of two or three per page.

Outside of the main site, there are some report services that make extensive use of a warehouse database to store intermediate results.  Some of those results duplicate calculations of the website in an odd sort of NIH syndrome, but others are best guesses at information that wasn’t properly collected until after discovering that we needed it.

Micro-optimizations aren’t always worthwhile

During a crisis moment, we brought in a database expert to help us handle the sudden influx of new traffic.  He helped us rewrite some of our heavier queries, but in the process introduced a temporary table to cache a piece held in common by three queries.  A year later, when disk space proved to be a problem, we switched to innodb_file_per_table so that one problem table could be fixed without taking down the server.  Suddenly, performance tanked; we had been hit by a nasty bug that was extremely tricky to diagnose, because it slowed down queries only after the relevant one finished.  Even crazier, it seemed that rewriting the queries to each recalculate the cached piece was no slower on average than using the temporary table, so we removed them entirely.

Similarly, when I learned about the INSERT ... ON DUPLICATE KEY UPDATE syntax, I immediately spotted a few places where it could reduce a round trip to the database.  After implementing one, however, replication broke.  Although the statement works just fine on the master, it fails on the slave.  We worked around the failing query, and it broke again.  After the third time, we decided to revert the whole thing, going back to separate SELECT, INSERT, and UPDATE queries.  Sadly, this appears to be a known problem, due to conflicting statements within the binlog format.

Log only the important events

At one point, we were told to log enough information to duplicate the user’s experience exactly.  That much information quickly flooded the database, and even slowed down the user experience dramatically.  Even dialing it back to the main trigger events proved to chew up disk space far too rapidly for comfort.

It turns out that the really important events were far fewer than I expected:  Log in, log out (or at least last seen), entering certain major portions of the application, and obvious error conditions.  The first two are better handled by updating a session row than by scanning for the last appearance of a user identifier in a mess of logs.  Once you have that session row, linking a few other events to it can make all the difference.  With a bit of effort, it even gives you an accurate estimate of concurrent users.

Unfortunately, that has frequently left me with too little information when requested for more information about a potential bug.  I can at least enable excessive verbosity on my development machine, but that only helps when the user can give me steps to reproduce the issue.  Not even the QA department is always that helpful, so I’m stuck trying to balance reliability against reportability.


Written by eswald

25 Oct 2011 at 10:45 pm

Posted in Employment, Technology

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s