One byte, two bytes - A journey into wtf-8 encoding.

Every profession has its war stories. Those challenges that separate the newcomers to the grizzled veterans. In the library world those of us who have to tangle deep in the weeds close to the bare metal to get at our data all seem to have a story about what is colloquially known as "wtf-8".

I'm not sure who coined the term "wtf-8", aside from that I'm pretty sure I heard it one fateful day in the #code4lib freenode channel. The term is a cunning play on the messed up, crazy mish-mashes of encoding standards we see in the library world. It is usually applied to some mangled mess of characters that got mortally wounded in the conversion from marc-8 to utf-8.

Some encoding basics

For those of you who follow this blog out of the hopes of seeing pictures of my fuzzy dog and for some reason are still reading, here's a crash course on character sets and encodings. As with most crash courses and simplifications, it's not entirely true. For the rest of you, feel free to skip this section.

Data on a computer is stored in a series of 0 and 1s. In order to store and represent the text you're reading, the computer associates certain binary numbers with certain characters. Since binary numbers are really, really long, programmers often display mappings using hex numbers (0-f) to characters. So the letter e is stored in ASCII as 0x65 (0x indicates a hex number). There's a table on the ASCII Wikipedia article.

48656C6C6F21 is the hex for the characters  Hello!

Now, ASCII is one of the great grandaddies of encoding formats. Before it, almost every computer would have a different mapping of numbers to different characters. ASCII became one of the well known standards and eventually the dominate one.

However, ASCII only maps out English characters, making it a problem in records that need to be in other languages. That's why libraries developed a character set standard, marc-8 that had more of an international focus way back in the early days of computing.

The rest of the computer world used a variety of approaches, with frequently a proliferation of mappings for every language. However, in the past twenty years or so there has been a push to establish an international standard, Unicode. Unicode actually makes the mapping a little abstract, having "codepoints" and then a family of encodings include utf-8 & utf-16 that use different numbers for those codepoints .

The marc-8 encoding allows for mixing in one record English and one other language whereas the Unicode approch is more flexible, allowing for mixing characters from all sorts of languages in one document. The joke runs that there's enough space in the mapping reserved for when we contact an alien race. A bonus is that Unicode is certainly better supported in general software than marc-8. A lot of library software and records have been migrated away from marc-8 to one of the Unicode encodings.

How some of our data went from utf-8 to wtf-8

A few years ago my workplace converted from marc-8 to utf-8. The records in today's story weathered the transition well. The problem is we have a workflow that has some data that mostly lives in SQL Server with an Access front-end. However, occasionally people use a form in the Access database that calls a stored procedure in SQL Server, which in turns does a query against the Oracle database via a Linked Server and use of OPENQUERY.

However, Ex Libris (Endeavor at the time), stores the information in the database marked as what they call AMERICAN_AMERICAN.US7ASCII. Now, I think this is the default for Oracle, but it may also have been a concious decision, I'm not sure. But I do know to get it to work, even with the data now in utf8, you still use US7ASCII in configurations. The string that seems to be stored in the Oracle database for titles seems to be utf-8 though. (I peeked at the raw hex of the characters by using the RAWTOHEX option). Oracle apparently lets you stuff anything into US7ASCII.

SQL Server can store string data in two main ways, one meant for unicode and the other for non-unicode character sets. We first created this database with a set of already existing unicode records and used the Unicode type. Then we started running into complaints of "strange characters and boxes", a tell-tale sign that either the font can't render the character or that there's really something messed up with the character. Sadly, a little testing proved the later.

Here's what happens. SQL Server seems to, maybe based off the US7ASCII, attempt to convert the string from ASCII to UCS-2, which is similar but not identical to utf-16. UCS-2 was implemeted in Microsoft products off of a early draft of what would become UTF-16. Forgive me if I accidentally call USC-2 UTF-16 once or twice in this post.. UCS-2 stores characters as two-bytes (or 4 hex numbers) of information where ASCII only stores it as one byte. However, in a bit of cleverness, the designers of utf-16 had it so that the mapping of the English characters is the same of that as ASCII, just with 00 added. So ...

Hello in ASCII and UTF-8, with spaces for readability

	ASCII	UCS-2
Characters	H e l l o	H e l l o
Hex	48 65 6C 6C 6F	4800 6500 6C00 6C00 6F00

You can see the "raw hex" in SQL Server by converting a varchar or nvarchar to varbinary(), like select convert(varbinary(4000),titles) from bibinfo ;.

The Problem

The Voyager data that is getting converted like this isn't ASCII, but utf-8. Utf-8 is a very clever encoding. In ASCII, you always know that a character is going to be two hex numbers, or one byte. In UCS-2 a character is always four hex numbers, or two bytes. This means the same information stored in a file encoded as UCS-2 rather than ASCII will be always be twice as large. Also, many but not all Unicode characters can actually be encoded in UCS-2/UTF-16.

Utf-8 gets around these two problems. Utf-8 is a variable-length encoding where each character might be one to four bytes long. You know how many bytes are going to make a character up by the leading byte. If the byte is 7E or smaller in hex, the computer knows the character to lookup up will be in the one byte table. That table is pretty much equivalent to the ASCII table.

So in other words, if you just use the ASCII characters, a utf-8 file and the ASCII file is the same. Score! However, the problem is this trick means that for the non-ASCII characters, the utf-8 hex and the UCS-2 hex is not the same. Also, if you have characters that are all 4 bytes, you end up with a much larger file than using UCS-2 or UTF-16.

So let's look at a "Heavy Metal" hello: Hellö. That's got a o with an umlaut that heavy metal bands seem to like so much. Imagine it was as title we got from Voyager, like Hellö: The Mick Shrimpton Story. Here's how that Hello should be in utf-8 and UCS-2

Hello in ASCII and UTF-8, with spaces for readability

	UTF-8	UCS-2
Characters	H e l l ö	H e l l ö
Hex	48 65 6C 6C C3BF	4800 6500 6C00 6C00 F600

However, because SQL Server just treats the whole string like it ASCII and doesn't warn when there's a non-ASCII character, what actually will get stored is...

4800 6500 6C00 6C00 C300 BF00

And anything accessing this will think that there are now two characters, C300 & BF00, which may or may not map to valid UCS-2 characters.

How we solved it

In this particular case, We I just wimped out. We are I am storing the data as a varchar and not as a nvarchar. That keeps the conversion from happening and the other future components from the process are tweaked to take this varchar string and treat the stuff in it as utf-8. Probably the better solution would be not to rely on the linked server but rather a process that will take in the Voyager data and properly convert it to ucs-2 for storage in the nvarchar field. We have some other applications that have as similar conversion process. The workflow in this particular case is a short-term solution to a workflow that is destined for a longer-term revamp from the ground up.

Some tips

As mentioned for Oracle, you can do select rawtohex(field) from table ; to see the hex of the data stored in that field.
For SQL Server, select convert(varbinary(4000),field) from table; will do the same
To see the encoding of actual files, I use the command-line tool xxd in linux.
If you want to type a unicode character in Ubuntu/Gnome, type Control - Shift - U and type the codepoint
For Windows, hold the Alt key, press the + on the numeric keypad and enter the codepoint and release the Alt key.

Some further reading

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

This is a essay post that has a nice summary of Unicode and character encoding issues. Old, but still good for a review. There's some technical jargon, but if you ignore the programming stuff he probably explains the Unicode stuff better than I do.
Coded Character Sets: A Technical Primer for Librarians by Michael Doran

Michael Doran, Systems Librarian Extraordinaire and Voyager Hacker Supreme, describes some of the issues with Voyager and character sets.
Wikipedia has some good articles on Unicode in general, UTF-8, UTF-16 and even ASCII.
I like the fileformat website for looking up particular characters. I know I've had others recommended to me, but I always forget them. Google seems to like fileformat, so I end up using it.
The Unicode Consortium site. i find it a bit confusing, but you'll be able to download various mappings of all sorts of strange characters to put into your files.

Update - some minor fixes & changes

I realized I hadn't run spell-check one last time (still getting used to the spell-check on my blog software) and made some minor spelling corrections. Also switched to using the royal we in my last section, fixed that. Also made a workplace reference a little more generic.