Feeds

Blog Feed ( rss | atom | rdf )

Blog Comments ( rss | atom | rdf )

Saturday
Feb252012

People do start to resemble their pets; or, how I love Halloween.

So since the past few blogs have been rather technical I figured I'd switch gears and do a lighter post.  Not too long ago I noticed in Facebook that my profile's photo series demonstrates how closely Veronica, my dog,  and I have grown to look alike:

Ok, ok.  So I'm wearing makeup.  Still, striking, no?  The two photos on the left are from my Halloween costume this year.  I was under the weather and feeling rather miserable this year, but I figured if chickenpox didn't stop me from getting dressed for Halloween when I was a kid, no flu was going to either.  My chickenpox year I ended up going as a pink vampire due to the calamine lotion and had to stand at the curb while my mother begged treats on my behalf.  Not surprisingly, pretending to be undead while sick doesn't take too much acting ability. 

I did make someone scream while at a my one halloween excursion this year, a bonfire with smores over at a friend's house.  It helped I wasn't moving much and a teenage girl started came near and said something along the lines of "Oh look at the clow.....IEEEEEEEEE."  The skull headed cane is pretty freaky all by itself.

 The wonderful makeup job on my face was mostly done by Colleen.  It has been a long time since I've had contacts and so I can't see my own face with my glasses off. One of these years I need to try just putting a white base on my face and put black around my sockets as I can see them.  Likely a result closer to the lastest Joker.

Colleen has a picture of the full outfit on flickr, minus my skull headed cane.

 If you're wondering who/what I'm supposed to be, I can't really say. This costume isn't based off any particular character but rather a mix of imagery involving a fool with a skull mask.  I suspect at least part of the imagery comes from various images of the Danse Macarbe.  For example, take the following thumbnail (found the image at the blog Artstor)....

 

The clown from Twisted Metal probably played some unconscious influence on my mental picture and probably some tattoo/punk/counter-culture type art as well.  I thought the skull & fool's cap motif was more common,  but I can't seem to find too many examples.

If someone really pushed me for a character, I did have a fallback option of claiming Yorrick, which was probably just as obscure. 

Perhaps next year I'll alter the makeup scheme just a little and go as V.  Of course, then the question comes, can I make V look like me?  Given how little she has liked Halloween costumes years past I don't think that one is very likely to happen.

Saturday
Feb042012

Unicode fun 

Amusingly enough, not long after my last post, there was a BoingBoing post on the Pile Of Poo Unicode character.  This is actually one of the "symbol" characters.  I must admit I didn't know that the Pile of Poo character existed, but I am dubious that most fonts actually have a rendering of that character. 


Some of my favorite symbol characters that I've found in the past have been:

So, if you want to find more fun characters I'd recommend going to the Unicode Code Charts and start exploring some of the different charts. Some good places to start are Miscellaneous symbols and Miscellaneous symbols and pictographs.

 

 

Tuesday
Jan312012

One byte, two bytes - A journey into wtf-8 encoding.

Every profession has its war stories.  Those challenges that separate the newcomers to the grizzled veterans.  In the library world those of us who have to tangle deep in the weeds close to the bare metal to get at our data all seem to have a story about what is colloquially known as "wtf-8".

I'm not sure who coined the term "wtf-8", aside from that I'm pretty sure I heard it one fateful day in the #code4lib freenode channel.  The term is a cunning play on the messed up, crazy mish-mashes of encoding standards we see in the library world.  It is usually applied to some mangled mess of characters that got mortally wounded in the conversion from marc-8 to utf-8. 

Some encoding basics

For those of you who follow this blog out of the hopes of seeing pictures of my fuzzy dog and for some reason are still reading, here's a crash course on character sets and encodings.  As with most crash courses and simplifications, it's not entirely true. For the rest of you, feel free to skip this section.

Data on a computer is stored in a series of 0 and 1s.  In order to store and represent the text you're reading, the computer associates certain binary numbers with certain characters.  Since binary numbers are really, really long, programmers often display mappings using hex numbers (0-f) to characters.  So the letter e is stored in ASCII as 0x65 (0x indicates a hex number).   There's a table on the ASCII Wikipedia article.

48656C6C6F21 is the hex for the characters  Hello!

Now, ASCII is one of the great grandaddies of encoding formats.  Before it, almost every computer would have a different mapping of numbers to different characters.  ASCII became one of the well known standards and eventually the dominate one.

However,  ASCII only maps out English characters, making it a problem in records that need to be in other languages.  That's why libraries developed a character set standard, marc-8 that had more of an international focus way back in the early days of computing. 

The rest of the computer world used a variety of approaches, with frequently a proliferation of mappings for every language.  However, in the past twenty years or so there has been a push to establish an international standard, Unicode.  Unicode actually makes the mapping a little abstract, having "codepoints" and then a family of encodings include utf-8 & utf-16 that use different numbers for those codepoints .

The marc-8 encoding allows for mixing in one record English and one other language whereas the Unicode approch is more flexible, allowing for mixing characters from all sorts of languages in one document.  The joke runs that there's enough space in the mapping reserved for when we contact an alien race.  A bonus is that Unicode is certainly better supported in general software than marc-8.  A lot of library software and records have been migrated away from marc-8 to one of the Unicode encodings.

 

How some of our data went from utf-8 to wtf-8

A few years ago my workplace converted from marc-8 to utf-8.  The records in today's story weathered the transition well.  The problem is we have a workflow that has some data that mostly lives in SQL Server with an Access front-end.  However, occasionally people use a form in the Access database that calls a stored procedure in SQL Server, which in turns does a query against the Oracle database via a Linked Server and use of OPENQUERY.

However, Ex Libris (Endeavor at the time), stores the information in the database marked as what they call AMERICAN_AMERICAN.US7ASCII.  Now, I think this is the default for Oracle, but it may also have been a concious decision, I'm not sure.  But I do know to get it to work, even with the data now in utf8, you still use US7ASCII in configurations. The string that seems to be stored in the Oracle database for titles seems to be utf-8 though.  (I peeked at the raw hex of the characters by using the RAWTOHEX option).  Oracle apparently lets you stuff anything into US7ASCII.

SQL Server can store string data in two main ways, one meant for unicode and the other for non-unicode character sets.  We first created this database with a set of already existing unicode records and used the Unicode type.  Then we started running into complaints of "strange characters and boxes", a tell-tale sign that either the font can't render the character or that there's really something messed up with the character.  Sadly, a little testing proved the later.

Here's what happens.  SQL Server seems to, maybe based off the US7ASCII, attempt to convert the string from ASCII to UCS-2, which is similar but not identical to utf-16.  UCS-2 was implemeted in Microsoft products off of a early draft of what would become UTF-16. Forgive me if I accidentally call USC-2 UTF-16 once or twice in this post.. UCS-2 stores characters as two-bytes (or 4 hex numbers) of information where ASCII only stores it as one byte.  However, in a bit of cleverness, the designers of utf-16 had it so that the mapping of the English characters is the same of that as ASCII, just with 00 added.  So ...

Hello in ASCII and UTF-8, with spaces for readability


ASCII UCS-2
Characters
H  e  l  l  o 
H    e    l    l    o
Hex
48 65 6C 6C 6F
4800 6500 6C00 6C00 6F00

You can see the "raw hex" in SQL Server by converting a varchar or nvarchar to varbinary(), like select convert(varbinary(4000),titles) from bibinfo ;.

The Problem

 The Voyager data that is getting converted like this isn't ASCII, but utf-8.  Utf-8 is a very clever encoding.  In ASCII, you always know that a character is going to be two hex numbers, or one byte.  In UCS-2 a character is  always four hex numbers, or two bytes.  This means the same information stored in a file encoded as UCS-2 rather than ASCII will be always be twice as large.  Also, many but not all Unicode characters can actually be encoded in UCS-2/UTF-16.

Utf-8 gets around these two problems.  Utf-8 is a variable-length encoding where each character might be one to four bytes long.  You know how many bytes are going to make a character up by the leading byte.  If the byte is 7E or smaller in hex, the computer knows the character to lookup up will be in the one byte table.  That table is pretty much equivalent to the ASCII table. 

So in other words, if you just use the ASCII characters, a utf-8 file and the ASCII file is the same.  Score!  However, the problem is this trick means that for the non-ASCII characters, the utf-8 hex and the UCS-2 hex is not the same.  Also, if you have characters that are all 4 bytes, you end up with a much larger file than using UCS-2 or UTF-16.

So let's look at a "Heavy Metal" hello: Hellö.  That's got a o with an umlaut that heavy metal bands seem to like so much. Imagine it was as title we got from Voyager, like Hellö: The Mick Shrimpton Story. Here's how that Hello should be in utf-8 and UCS-2

Hello in ASCII and UTF-8, with spaces for readability


UTF-8 UCS-2
Characters
H  e  l  l  ö
H    e    l    l    ö  
Hex
48 65 6C 6C C3BF
4800 6500 6C00 6C00 F600

 

However, because SQL Server just treats the whole string like it ASCII and doesn't warn when there's a non-ASCII character, what actually will get stored is...

 

4800 6500 6C00 6C00 C300 BF00

 

And anything accessing this will think that there are now two characters, C300 & BF00, which may or may not map to valid UCS-2 characters.

How we solved it

In this particular case, We I just wimped out.  We are I am storing the data as a varchar and not as a nvarchar.  That keeps the conversion from happening and the other future components from the process are tweaked to take this varchar string and treat the stuff in it as utf-8.  Probably the better solution would be not to rely on the linked server but rather a process that will take in the Voyager data and properly convert it to ucs-2 for storage in the nvarchar field.  We have some other applications that have as similar conversion process.  The workflow in this particular case is a short-term solution to a workflow that is destined for a longer-term revamp from the ground up.

Some tips

  • As mentioned for Oracle, you can do select rawtohex(field) from table ; to see the hex of the data stored in that field.
  • For SQL Server, select convert(varbinary(4000),field) from table; will do the same
  • To see the encoding of actual files, I use the command-line tool xxd in linux.
  • If you want to type a unicode character in Ubuntu/Gnome, type Control - Shift - U and type the codepoint
  • For Windows, hold the Alt key, press the + on the numeric keypad and enter the codepoint and release the Alt key.

Some further reading

 

 

Update - some minor fixes & changes

I realized I hadn't run spell-check one last time (still getting used to the spell-check on my blog software) and made some minor spelling corrections.  Also switched to using the royal we in my last section, fixed that.  Also made a workplace reference a little more generic.

 The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Saturday
Jan142012

Info Addictions: Podcasts

 

Lately I've been listening to a lot of podcasts.  Indeed, listening to podcasts is more natural to me than watching tv.  I can move around, do cleaning, walk the dog, and slowly drive Colleen nuts as I wander around laughing at comments she can't hear.  That's a nice change of pace from hearing her laughing at things I can't hear.  So below are a few of my favorites.

If you are brave, download the blogroll and add all of them to your favorite podcatcher.  Personally, I have been using Beyondpod for a while now, although I've also heard good things about Dogcatcher. 

For the obsessive-compulsive, these aren't in any particular order.  Sorry.

Tech

I'm a geek and I love learning about technology.  So a fair amount of my podcast listening involves technology.

This Week in Tech (TWiT)

Usually headed by Leo Laporte of Screensaver's fame, this is a weekly discussion show about recent events in the sphere of technology.  Informative, but occasionally the humor dominates the show. To me it's most interesting when guests like Jerry Pournelle offer both a historical analysis of the technology and commentary of the current state of the industry.

FLOSS Weekly

FLOSS as in Free, Libre, and Open Source Software, not dental floss.

I'm a great believer of open source software and this show is an excellent way to find out about both established and brand-new projects.  The only issue is that my "want-to-do" list has exploded since starting to listen to this show.  FLOSS has even interviewed some Code4Libbers on the open source Evergreen ILS.   They have a video cast as well, but I almost always listen to the audio version.

Linux Outlaws

Dan & Fab can be a bit crude and explosive at times.  Ok, that's mainly Fab.  But both are genuine and engaging as hosts.  Their love of open-source and linux comes through clearly.  This show covers a lot of tech news besides just Linux and open source stuff.  It's a good compliment to TWiT.

Security Now!

Hosted by Steve Gibson and usually Leo Laporte is the co-host.  Covers weekly security news, but also delves into a large range of topics that are associated with computer security such as how the Internet works, Identity systems and other such topics.  Very useful podcast to listen to, even if you're not a computer expert.  After a while, you might just find yourself becoming one.

Tech News Today

This show I'm a little hesitant to recommend just for the fact it is a daily show and runs about an hour. If you're a hardcore geek and want to know a lot of detail about technology, particularly the business side, it's worth at least listening to a few episodes a week.

Music

Rathole Radio

Dan from the above-mentioned Linux Outlaws plays a variety of music in this podcasts, mostly from folks releasing songs with licenses that allow sharing and downloading.  There's a lot of cool stuff out there.  It's a bit of an eclectic mix though, so some episodes might be absolutely excellent while the next I might end up feeling neutral towards.

PC Podcast

This is a music podcast that I just started listening to, but the best of 2011 episodes were so good I added it to this list.  I'm looking forward to catching up on more of them.  Again, a mix of music from artists you may never had heard of, but should.

 

Board Games

The Dice Tower

The Dice Tower is one of the longest-running and most respected podcasts about card & board games.  The hosts are Tom Vasel & Eric Summerer, but the show features segments from a variety of contributers that cover topics such as reviews, history of gaming, and the science behind games.

The Little Metal Dog

I've only been listening for a few months, but I consider this one of the best gaming podcasts and something every hardcore game buff should listen to at least once.  While most other game podcasts focus on talking about particular games, Michael Fox interviews a huge variety of people involved in the game industry.  He has interviews with designers like Peter Olotka, Richard Garfield, Martin Wallce and many others.  He doesn't just do interviews with designers but finds other interesting people involved with games such as Rich Sommer of Mad Men fame.

Game On! with Cody & John

An entertaining podcast "for the common gamer" by Cody & John.  They cover a variety of types of games and do so quite well.  Cody & John are enthusastic and engaging hosts and it gives a very personal feeling to the podcast.

The Spiel

This is hosted by Stephen Conway and David ColesonThis one can run a little long, which might be the only weakness.  I tend to sometimes skip over some of the detailed rules explanation, although it can really help me understand a game.  By far my favorite part fo the show is when they explain some of the history surrounding a game theme or the history of the game itself. 

Ludology

This show features one of my favorite contributers to the Dice Tower, Geoff Engelstein.  This show has been a bit hit or miss for me so far.  The episodes that feel more like Geoff's segment on the Dice Tower, "GameTek" are really fascinating.  One such recent episode, Episode 18, had an interview with a scientist who is sponsoring an AI design competition around simulating learning strategies.   However, I end up zoning out during the discussion about terminology in gaming and the more general discussions betwen Geoff and Ryan Sturm.

Just For Fun

Basic Brewing

This is the podcast I've been listening to the longest that is on this list.  It revolves around homebrewing and beer.  James Spencer is an excellent host.  If you are interested in homebrewing, I'd definitely recommend checking out some of the early episodes, particulary some of the ones on sanitation and cleaning.  I find myself skipping episode within the past year a little more frequently, but it still a pretty solid show.

Radiolab

This is an actual radio show that gets published as a podcast. It is a bit educational and a bit wacky.  Lots of interesting topics and people, although the intro sets Colleen's teeth on edge.   The hosts explore a variety of topics to with professional sound editing.  It's difficult to convey how the sound editing interacts with the topics, you just need to listen to the show

In Our Time

A BBC podcast that brings in panels of experts to talk on various historical topics and aspects of history.  There's a broad range of always interesting topics.

The Splendid Table

Another radio show, a NPR regular that gets re-published for downloads.  There's interviews with cookbook authors, reviews of the best road food, and call-in segments asking questions about cooking.  It usually manages to make me hungry and inspires me to cook more.

FSL Tonight

If you are a science-fiction or fantasy fan and love puns, wordplay, and silliness, this is a hysterical podcast. It is an actual fantasy sports league where all the members are characters from some of your favorite novels, tv shows, and movies.  The Mordor Crows currently loom high over the Canton Jaynes, but who knows what weakness could unmake the dark shadows that pour from Mordor.  Perhaps even the All-Seeing eye can be distracted at a crucial time.

Tales from the Liberry"CAST"

I'm a little torn about posting this one.  The shows are based from one of my favorite blogs, Tales from the Liberry.  The narrator, Juice, is the original writer of the Tales of the Liberry.  The podcast is always good for a laugh.  (In fact, be careful listening while driving). 

There are occasional sound-editing issues. Also, I've been meaning to make this post for a while and recently Juice started a content to give away mugs to people who direct traffic to his site.  I'm pretty sure I would have posted this without any such bribery, but feel compelled to mention it.   This is one I'd recommend listening to from the beginning.

 

Final Comments

if it seems like I'm listening to a lot, yeah, I am.  I listen to a lot of podcasts at 1.5 or 2 times the normal play speed.  It's was pretty easy to get used to, but perhaps that is just my high school experience of debate serving me well.  I doubt I'd be able to listen to as many if I didn't speed it up. 

So what podcasts are other people listening to? 

 

Thursday
Jan122012

Obligatory First Post

Hi there.  Not going to spend too much time, but obviously you've stumbled across my site one way or the other.  If you want to know more about me, wander over to my about me page.  You should be able to see some links to the various social networks I'm invovled with over on the right.

Expect to see some post about technologies, libraries, and the occasional personal note like how wonderful my wife is or something cute my dog has done. You know, like stop for a belly rub in the middle of an agility run.

I had a blog for a while over at http://codexmonkey.blogspot.com/.  I thought about importing those posts into this blog, but I decided since I hadn't posted over a year and and a half I would do a fresh start.  I might import certain posts from time to time in here as a look back at the classic.

Page 1 ... 1 2 3 4 5