Feeds

Blog Feed ( rss | atom | rdf )

Blog Comments ( rss | atom | rdf )

« Unicode fun | Main | Info Addictions: Podcasts »
Tuesday
Jan312012

One byte, two bytes - A journey into wtf-8 encoding.

Every profession has its war stories.  Those challenges that separate the newcomers to the grizzled veterans.  In the library world those of us who have to tangle deep in the weeds close to the bare metal to get at our data all seem to have a story about what is colloquially known as "wtf-8".

I'm not sure who coined the term "wtf-8", aside from that I'm pretty sure I heard it one fateful day in the #code4lib freenode channel.  The term is a cunning play on the messed up, crazy mish-mashes of encoding standards we see in the library world.  It is usually applied to some mangled mess of characters that got mortally wounded in the conversion from marc-8 to utf-8. 

Some encoding basics

For those of you who follow this blog out of the hopes of seeing pictures of my fuzzy dog and for some reason are still reading, here's a crash course on character sets and encodings.  As with most crash courses and simplifications, it's not entirely true. For the rest of you, feel free to skip this section.

Data on a computer is stored in a series of 0 and 1s.  In order to store and represent the text you're reading, the computer associates certain binary numbers with certain characters.  Since binary numbers are really, really long, programmers often display mappings using hex numbers (0-f) to characters.  So the letter e is stored in ASCII as 0x65 (0x indicates a hex number).   There's a table on the ASCII Wikipedia article.

48656C6C6F21 is the hex for the characters  Hello!

Now, ASCII is one of the great grandaddies of encoding formats.  Before it, almost every computer would have a different mapping of numbers to different characters.  ASCII became one of the well known standards and eventually the dominate one.

However,  ASCII only maps out English characters, making it a problem in records that need to be in other languages.  That's why libraries developed a character set standard, marc-8 that had more of an international focus way back in the early days of computing. 

The rest of the computer world used a variety of approaches, with frequently a proliferation of mappings for every language.  However, in the past twenty years or so there has been a push to establish an international standard, Unicode.  Unicode actually makes the mapping a little abstract, having "codepoints" and then a family of encodings include utf-8 & utf-16 that use different numbers for those codepoints .

The marc-8 encoding allows for mixing in one record English and one other language whereas the Unicode approch is more flexible, allowing for mixing characters from all sorts of languages in one document.  The joke runs that there's enough space in the mapping reserved for when we contact an alien race.  A bonus is that Unicode is certainly better supported in general software than marc-8.  A lot of library software and records have been migrated away from marc-8 to one of the Unicode encodings.

 

How some of our data went from utf-8 to wtf-8

A few years ago my workplace converted from marc-8 to utf-8.  The records in today's story weathered the transition well.  The problem is we have a workflow that has some data that mostly lives in SQL Server with an Access front-end.  However, occasionally people use a form in the Access database that calls a stored procedure in SQL Server, which in turns does a query against the Oracle database via a Linked Server and use of OPENQUERY.

However, Ex Libris (Endeavor at the time), stores the information in the database marked as what they call AMERICAN_AMERICAN.US7ASCII.  Now, I think this is the default for Oracle, but it may also have been a concious decision, I'm not sure.  But I do know to get it to work, even with the data now in utf8, you still use US7ASCII in configurations. The string that seems to be stored in the Oracle database for titles seems to be utf-8 though.  (I peeked at the raw hex of the characters by using the RAWTOHEX option).  Oracle apparently lets you stuff anything into US7ASCII.

SQL Server can store string data in two main ways, one meant for unicode and the other for non-unicode character sets.  We first created this database with a set of already existing unicode records and used the Unicode type.  Then we started running into complaints of "strange characters and boxes", a tell-tale sign that either the font can't render the character or that there's really something messed up with the character.  Sadly, a little testing proved the later.

Here's what happens.  SQL Server seems to, maybe based off the US7ASCII, attempt to convert the string from ASCII to UCS-2, which is similar but not identical to utf-16.  UCS-2 was implemeted in Microsoft products off of a early draft of what would become UTF-16. Forgive me if I accidentally call USC-2 UTF-16 once or twice in this post.. UCS-2 stores characters as two-bytes (or 4 hex numbers) of information where ASCII only stores it as one byte.  However, in a bit of cleverness, the designers of utf-16 had it so that the mapping of the English characters is the same of that as ASCII, just with 00 added.  So ...

Hello in ASCII and UTF-8, with spaces for readability


ASCII UCS-2
Characters
H  e  l  l  o 
H    e    l    l    o
Hex
48 65 6C 6C 6F
4800 6500 6C00 6C00 6F00

You can see the "raw hex" in SQL Server by converting a varchar or nvarchar to varbinary(), like select convert(varbinary(4000),titles) from bibinfo ;.

The Problem

 The Voyager data that is getting converted like this isn't ASCII, but utf-8.  Utf-8 is a very clever encoding.  In ASCII, you always know that a character is going to be two hex numbers, or one byte.  In UCS-2 a character is  always four hex numbers, or two bytes.  This means the same information stored in a file encoded as UCS-2 rather than ASCII will be always be twice as large.  Also, many but not all Unicode characters can actually be encoded in UCS-2/UTF-16.

Utf-8 gets around these two problems.  Utf-8 is a variable-length encoding where each character might be one to four bytes long.  You know how many bytes are going to make a character up by the leading byte.  If the byte is 7E or smaller in hex, the computer knows the character to lookup up will be in the one byte table.  That table is pretty much equivalent to the ASCII table. 

So in other words, if you just use the ASCII characters, a utf-8 file and the ASCII file is the same.  Score!  However, the problem is this trick means that for the non-ASCII characters, the utf-8 hex and the UCS-2 hex is not the same.  Also, if you have characters that are all 4 bytes, you end up with a much larger file than using UCS-2 or UTF-16.

So let's look at a "Heavy Metal" hello: Hellö.  That's got a o with an umlaut that heavy metal bands seem to like so much. Imagine it was as title we got from Voyager, like Hellö: The Mick Shrimpton Story. Here's how that Hello should be in utf-8 and UCS-2

Hello in ASCII and UTF-8, with spaces for readability


UTF-8 UCS-2
Characters
H  e  l  l  ö
H    e    l    l    ö  
Hex
48 65 6C 6C C3BF
4800 6500 6C00 6C00 F600

 

However, because SQL Server just treats the whole string like it ASCII and doesn't warn when there's a non-ASCII character, what actually will get stored is...

 

4800 6500 6C00 6C00 C300 BF00

 

And anything accessing this will think that there are now two characters, C300 & BF00, which may or may not map to valid UCS-2 characters.

How we solved it

In this particular case, We I just wimped out.  We are I am storing the data as a varchar and not as a nvarchar.  That keeps the conversion from happening and the other future components from the process are tweaked to take this varchar string and treat the stuff in it as utf-8.  Probably the better solution would be not to rely on the linked server but rather a process that will take in the Voyager data and properly convert it to ucs-2 for storage in the nvarchar field.  We have some other applications that have as similar conversion process.  The workflow in this particular case is a short-term solution to a workflow that is destined for a longer-term revamp from the ground up.

Some tips

  • As mentioned for Oracle, you can do select rawtohex(field) from table ; to see the hex of the data stored in that field.
  • For SQL Server, select convert(varbinary(4000),field) from table; will do the same
  • To see the encoding of actual files, I use the command-line tool xxd in linux.
  • If you want to type a unicode character in Ubuntu/Gnome, type Control - Shift - U and type the codepoint
  • For Windows, hold the Alt key, press the + on the numeric keypad and enter the codepoint and release the Alt key.

Some further reading

 

 

Update - some minor fixes & changes

I realized I hadn't run spell-check one last time (still getting used to the spell-check on my blog software) and made some minor spelling corrections.  Also switched to using the royal we in my last section, fixed that.  Also made a workplace reference a little more generic.

 The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

PrintView Printer Friendly Version

EmailEmail Article to Friend

References (61)

References allow you to track sources for this article, as well as articles that were written in response to this article.
  • Response
    Response: or
    Thank you, I have recently been looking for information about this subject for a while and yours is the greatest I have came upon till now. But, what about the bottom line? Are you sure about the supply?
  • Response
    Response: on
    I have recently started a web site, the information you offer on this website has helped me greatly. Thank you for all of your time & work. "The inner fire is the most important thing mankind possesses." by Edith Sodergran.
  • Response
    Response: much
    Once I originally commented I clicked the -Notify me when new feedback are added- checkbox and now every time a comment is added I get four emails with the same comment. Is there any manner you possibly can remove me from that service? Thanks!
  • Response
    Response: hot
    I like this post, enjoyed this one regards for posting. "When you make a world tolerable for yourself, you make a world tolerable for others." by Anais Nin.
  • Response
    Response: great
    I like this web site very much, Its a rattling nice situation to read and find information. "You have to lead people gently toward what they already know is right." by Philip.
  • Response
    Response: were
    Howdy! I know this is kind of off topic but I was wondering which blog platform are you using for this site? I'm getting sick and tired of Wordpress because I've had problems with hackers and I'm looking at options for another platform. I would be great if you could point ...
  • Response
    Response: on
    Wonderful web site. A lot of useful information here. I am sending it to a few friends ans also sharing in delicious. And obviously, thanks for your effort!
  • Response
    Response: through
    Hi my friend! I want to say that this article is amazing, nice written and include almost all significant infos. I’d like to see more posts like this.
  • Response
    Response: can
    Thanks for your publication. One other thing is that often individual states in the United states of america have their very own laws in which affect householders, which makes it very difficult for the the legislature to come up with a brand new set of recommendations concerning home foreclosure on householders. ...
  • Response
    Response: three
    I gotta favorite this site it seems handy handy
  • Response
    Response: three
    Thanks for your information on this blog. 1 thing I would like to say is that purchasing electronics items over the Internet is not something new. In truth, in the past ten years alone, the market for online electronics has grown substantially. Today, you will find practically just about any electronic ...
  • Response
    Response: this
    I genuinely enjoy studying on this internet site , it contains wonderful articles . "Literature is the orchestration of platitudes." by Thornton.
  • Response
    Response: three
    Thanks for the suggestions you have provided here. Additionally, I believe there are several factors which really keep your car insurance premium all the way down. One is, to take into account buying cars that are from the good listing of car insurance businesses. Cars which can be expensive are more ...
  • Response
    Response: that
    Really wonderful info can be found on website . "Politics is applesauce." by Will Rogers.
  • Response
    Response: end
    Good post. I study something more difficult on totally different blogs everyday. It would all the time be stimulating to learn content material from other writers and observe just a little something from their store. I’d desire to make use of some with the content material on my weblog whether or ...
  • Response
    Response: they
    What I have seen in terms of laptop memory is that often there are technical specs such as SDRAM, DDR and so forth, that must go with the technical specs of the motherboard. If the pc's motherboard is rather current while there are no operating-system issues, improving the storage space literally ...
  • Response
    Response: or
    As I website possessor I believe the content material here is rattling magnificent , appreciate it for your hard work. You should keep it up forever! Good Luck.
  • Response
    Response: large
    I’m not sure where you're getting your info, but good topic. I needs to spend some time learning more or understanding more. Thanks for excellent info I was looking for this info for my mission.
  • Response
    Response: sentence
    whoah this blog is magnificent i love reading your articles. Keep up the great work! You know, a lot of people are looking around for this info, you can help them greatly.
  • Response
    Response: for
    I'd personally also like to express that most of those who find themselves without the need of health insurance are usually students, self-employed and people who are not working. More than half with the uninsured are under the age of 35. They do not feel they are looking for health insurance ...
  • Response
    Response: put
    certainly like your web-site but you have to check the spelling on quite a few of your posts. Many of them are rife with spelling issues and I to find it very troublesome to inform the truth then again I will certainly come again again.
  • Response
    Response: name
    It is my belief that mesothelioma can be the most deadly cancer. It has unusual traits. The more I look at it a lot more I am confident it does not conduct itself like a true solid tissue cancer. If perhaps mesothelioma is really a rogue viral infection, then there is ...
  • Response
    Response: large
    Terrific post but I was wondering if you could write a litte more on this subject? I'd be very grateful if you could elaborate a little bit more. Appreciate it!
  • Response
    Response: was
    I know of the fact that nowadays, more and more people are attracted to video cameras and the issue of images. However, as a photographer, you have to first shell out so much time period deciding the exact model of photographic camera to buy plus moving from store to store just ...
  • Response
    Response: does
    Wow, marvelous blog layout! How long have you been blogging for? you make blogging look easy. The overall look of your site is excellent, let alone the content!
  • Response
    Response: tell
    Hey! I'm at work browsing your blog from my new iphone 3gs! Just wanted to say I love reading your blog and look forward to all your posts! Carry on the excellent work!
  • Response
    Response: read
    Excellent website. Lots of useful info here. I'm sending it to some buddies ans also sharing in delicious. And of course, thanks for your sweat!
  • Response
    Response: by
    I've noticed that fixing credit activity really needs to be conducted with tactics. If not, you will probably find yourself destroying your rating. In order to reach your goals in fixing your credit score you have to confirm that from this moment in time you pay all your monthly costs promptly ...
  • Response
    Response: set
    I figured out more new things on this weight reduction issue. Just one issue is a good nutrition is very vital when dieting. A tremendous reduction in fast foods, sugary food items, fried foods, sweet foods, beef, and bright flour products could possibly be necessary. Keeping wastes parasites, and wastes may ...
  • Response
    Response: move
    What i don't realize is in truth how you're no longer really a lot more neatly-liked than you may be now. You're very intelligent. You already know therefore significantly in terms of this matter, produced me in my opinion consider it from a lot of varied angles. Its like men and ...
  • Response
    Thanks for expressing your ideas. The first thing is that learners have a solution between fed student loan plus a private student loan where its easier to go for student loan debt consolidation reduction than through the federal student loan.
  • Response
    Response: added input
    obviously like your website however you have to test the spelling on quite a few of your posts. A number of them are rife with spelling problems and I find it very bothersome to inform the reality nevertheless I will definitely come back again.
  • Response
    Response: similar story
    Hi there! This post couldn't be written any better! Reading through this post reminds me of my previous room mate! He always kept talking about this. I will forward this write-up to him. Pretty sure he will have a good read. Many thanks for sharing!
  • Response
    Response: added input
    Thanks for these tips. One thing I also believe is credit cards supplying a 0% apr often appeal to consumers in with zero interest rate, instant endorsement and easy on the web balance transfers, but beware of the top factor that will certainly void your current 0% easy street annual percentage ...
  • Response
    Response: similar story
    Simply a smiling visitant here to share the love (:, btw outstanding style and design .
  • Response
    Response: more reflection
    I’m not that much of a internet reader to be honest but your sites really nice, keep it up! I'll go ahead and bookmark your website to come back in the future. Many thanks
  • Response
    Response: deeper analysis
    Howdy just wanted to give you a quick heads up. The text in your post seem to be running off the screen in Internet explorer. I'm not sure if this is a format issue or something to do with web browser compatibility but I figured I'd post to let you know. ...
  • Response
    Response: more reflection
    Thanks for your post here. One thing I would really like to say is always that most professional job areas consider the Bachelor's Degree as the entry level standard for an online degree. Although Associate Certifications are a great way to get started, completing ones Bachelors reveals many entrance doors to ...
  • Response
    Response: more reflection
    Regards for helping out, good info .
  • Response
    Response: highlighted points
    I am no longer sure where you're getting your info, but good topic. I must spend some time finding out much more or figuring out more. Thank you for fantastic info I was searching for this info for my mission.
  • Response
    Response: in depth research
    I am really loving the theme/design of your blog. Do you ever run into any browser compatibility problems? A number of my blog readers have complained about my blog not working correctly in Explorer but looks great in Firefox. Do you have any ideas to help fix this issue?
  • Response
    Response: added input
    Hello would you mind sharing which blog platform you're using? I'm going to start my own blog in the near future but I'm having a hard time deciding between BlogEngine/Wordpress/B2evolution and Drupal. The reason I ask is because your design seems different then most blogs and I'm looking for something unique. ...
  • Response
    Response: in depth research
    Hello there! Do you know if they make any plugins to safeguard against hackers? I'm kinda paranoid about losing everything I've worked hard on. Any recommendations?
  • Response
    Response: more data
    Merely a smiling visitor here to share the love (:, btw outstanding layout.
  • Response
    Response: in depth research
    But wanna comment on few general things, The website style is perfect, the content material is rattling fantastic : D.
  • Response
    Response: similar story
    I was just looking for this info for a while. After 6 hours of continuous Googleing, at last I got it in your web site. I wonder what's the lack of Google strategy that don't rank this type of informative sites in top of the list. Usually the top sites are ...
  • Response
    Response: more data
    I am very happy to read this. This is the kind of manual that needs to be given and not the random misinformation that is at the other blogs. Appreciate your sharing this greatest doc.
  • Response
    Response: deeper analysis
    I delight in, result in I found just what I was taking a look for. You have ended my four day lengthy hunt! God Bless you man. Have a great day. Bye
  • Response
    Response: added input
    Another important component is that if you are a mature person, travel insurance intended for pensioners is something you ought to really think about. The mature you are, greater at risk you're for making something bad happen to you while overseas. If you are definitely not covered by quite a few ...
  • Response
    Response: in depth research
    Oh my goodness! an amazing article dude. Thanks Nevertheless I am experiencing concern with ur rss . Don’t know why Unable to subscribe to it. Is there anybody getting an identical rss problem? Anybody who knows kindly respond. Thnkx
  • Response
    Response: more data
    Only wanna comment on few general things, The website design and style is perfect, the content is rattling superb : D.
  • Response
    Response: added input
    Greetings from Florida! I'm bored to death at work so I decided to browse your blog on my iphone during lunch break. I love the info you present here and can't wait to take a look when I get home. I'm shocked at how fast your blog loaded on my cell ...
  • Response
    Response: deeper analysis
    An additional issue is video games are normally serious as the name indicated with the major focus on understanding rather than leisure. Although, it comes with an entertainment part to keep your children engaged, every single game is normally designed to focus on a specific experience or program, such as math ...
  • Response
    Response: different opinion
    My spouse and i were quite peaceful that Michael could finish off his survey while using the precious recommendations he had through your web site. It is now and again perplexing to simply be handing out hints which most people might have been making money from. Therefore we do know we've ...
  • Response
    Response: more reflection
    I genuinely enjoy reading through on this site, it has got great posts. "The living is a species of the dead and not a very attractive one." by Friedrich Wilhelm Nietzsche.
  • Response
    Response: in depth research
    I went over this internet site and I conceive you have a lot of good information, saved to fav (:.
  • Response
    Response: more data
    Howdy, i read your blog occasionally and i own a similar one and i was just curious if you get a lot of spam responses? If so how do you protect against it, any plugin or anything you can recommend? I get so much lately it's driving me insane so any ...
  • Response
    Response: added input
    Thank you, I have just been searching for information about this subject for ages and yours is the best I have discovered so far. But, what about the conclusion? Are you sure about the source?
  • Response
    Response: more reflection
    Nice read, I just passed this onto a friend who was doing a little research on that. And he actually bought me lunch because I found it for him smile Thus let me rephrase that: Thank you for lunch!
  • Response
    Response: more data
    Greetings! I've been reading your site for some time now and finally got the courage to go ahead and give you a shout out from Lubbock Tx! Just wanted to tell you keep up the great job!
  • Response
    Response: deeper analysis
    A further issue is that video gaming has become one of the all-time most important forms of entertainment for people of nearly every age. Kids enjoy video games, plus adults do, too. The particular XBox 360 is one of the favorite games systems for those who love to have hundreds of ...

Reader Comments

There are no comments for this journal entry. To create a new comment, use the form below.

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>