Minority Opinions

Not everyone can be mainstream, after all.

Breaking Spacing

leave a comment »

I’m not a typographer.  I barely notice Comic Sans, and my interaction with ligatures is mostly limited to decomposing them into their component letters.  I put two spaces between sentences simply because that’s how I was taught.  I’m aware of arguments for single-spacing sentences, but it still feels wrong to me.

HTML, as a subset of SGML, collapses all whitespace into a single space, so blogs like this need a bit of trickery to faithfully represent the spaces as I type them.  WordPress handles this by switching one space out of each pair to a non-breaking space, Unicode code point U+00A0.  Unfortunately, its current implementation changes the second space of each pair.  Most of the time, that works well, but when a paragraph just happens to be wide enough to create a line break between sentences, and those sentences are separated by a space/nbsp sequence, then the browser puts the non-breaking space at the beginning of the next line, shifting the first character of the new sentence just enough to notice.

It occurred to me at one point that WordPress is open source, so I could probably fix it myself.  Unfortunately, a few quick searches failed to reveal the point where the substitution takes place.  It seems to be in PHP, at least, and appears to substitute a UTF-8 character instead of its HTML entity.  That leads me to expect something on the order of str_replace("  ", " \xc2\xa0", $content) somewhere in the pipeline.  All I wanted to do was switch the order of the replacement, to put the non-breaking space first, where it would be at the end of a line instead of the beginning.

If that’s so easy, then why hasn’t someone already done it?  Because it’s wrong.

It took a bit to understand why the replacement goes this way.  The issue is with three or higher odd numbers of consecutive spaces.  They don’t come up often in prose, but they do in programming.  At one point, I honestly preferred three-space indentation.  Or maybe five; I don’t really remember, because it was before I started using Python on a regular basis.  It was probably back when I used PFE instead of Vim.

Even with tabs for indentation, or even-numbered indentation levels, spaces are often used to line up bits within a line, particularly comments.  On average, about as many of those inline spacers would have odd numbers of spaces as even, and we wouldn’t want each odd-spaced sequence to lose a space, would we?

Okay, forget about the fact that a <pre> block would be better for such code, and work with me here.  Or give an example that doesn’t rely on monospaced type over multiple lines.

Anyway, replacing the second space of each pair works reliably for any number of spaces:

( )
( _)
( _ )
( _ _)
( _ _ )
( _ _ _)

Replacing the first space, on the other hand, can leave space pairs behind, because the replacement engine searches the string left-to-right and ignores characters matched by a previous replacement:

( )
(_ )
(_  )
(_ _ )
(_ _  )
(_ _ _ )

There are a few potential ways to fix it, each with downsides.  First, we could reverse the string before and after the replacement:

$content = strrev($content);
$content = str_replace("  ", " \xc2\xa0", $content);
$content = strrev($content);

( )
(_ )
( _ )
(_ _ )
( _ _ )
(_ _ _ )

However, the content of a post isn’t exactly short, so the reversal could take longer than a server cares to spend.  We can trim that to two function calls by handling the end of the sequence specially, before running the generic replacement.

$content = preg_replace("/  ([^ ])/", "\xc2\xa0 \\1", $content);
$content = str_replace("  ", " \xc2\xa0", $content);

( )
(_ )
( _ )
( __ )
( _ _ )
( _ __ )

This sequence produces pairs of non-breaking spaces, but always leaves real spaces at the end, and uses real spaces at the beginning whenever possible.  However, it uses a heavier PCRE call.  It might be better use a simple second pass to catch the leftover ends:

$content = str_replace("  ", "\xc2\xa0 ", $content);
$content = str_replace("  ", "\xc2\xa0 ", $content);

( )
(_ )
(__ )
(_ _ )
(_ __ )
(_ _ _ )

This uses more non-breaking spaces than strictly necessary, but appears to solve all of the problems.  Whether it actually performs better than a pair of string reversals is subject to benchmarking.

Now, unless you can raise a better argument for the current behavior, I just need to track down some code.

Advertisements

Written by eswald

8 Jan 2013 at 6:33 pm

Posted in Technology

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s