Jonathan Gorman

I don't tend to follow a lot of video podcasts, mostly since I don't watch tv as regularly as I listen to podcasts. I can listen to audio while waiting for the bus or walking to lunch, but I'm never sure when I'm going to watch tv. Getting a roku has helped, but I still find myself more often going on watch marathons of a tv show rather than watch the tech channels.

Still, I've been watching more and more of Revision3. First started with Ben Heck's show, but now I've been watching Hak5 and also their quick tips called HakTip. (I'll probably try to catch some episodes of Scam School soon as well, but that's a bit of an aside.).

If you're a fan of Security Now over at the TwiT network, I can highly recommend watching some of the Hak5 episodes. Great info on security. I've been watching some of their old episodes on encryption techniques and the inner workings of SSL.

There's also excellent amount of material on linux. The episodes are a real welcome change from some of the more boring tech video podcasts consisting of one guy talking into a camera with some screen capture software.

Currently I'm working through the Linux 101 HakTIp episodes. Mostly it is material I already know, but they're pretty short and a useful reminder of some things I'd forgotten with just enough new stuff to keep watching episodes. If you've always wanted to get better on the command line, they're a great place to start.

PostedJuly 9, 2012

AuthorJon Gorman

Categorieswork

Tagslinux, podcasts, security, video podcasts

CommentPost a comment

Just a quick note. It seems there's still a lot of folks out there who haven't heard of Ed Summers' excellent little tool, http://jobs.code4lib.org. It's a semi-automated harvesting system that's pulling in a ton of job posting related to libraries. You can also filter out the @code4lib twitter posts that start with Job:, they're being supplied by this nifty little system. And like many code4lib projects, it's open source and can be found as shortimer at git.

I know Ed was hoping it might even lead to part-time and contract work where someone might set up a short-term need and a programmer with a lot of library knowledge could fill it. Say, adapt Evergreen so it interacts with some digital library software that it doesn't currently work with. I haven't seen many of those yet, but hopefully we'll see those become more common. It would make earning a little extra money consulting far easier.

PostedJune 19, 2012

AuthorJon Gorman

Categorieswork

Tagscode4lib, jobs

CommentPost a comment

Those following me in my various social circles are probably already sick of hearing about this, but unglue.it launched yesterday. Unglue.it, started by Eric Hellman (aka @GlueJar) and other folks w/ connections to Code4Lib, is a effort to release copyrighted books to the world. Working with right-holders in a kickstarter style to raise enough money to license an ebook a creative commons non-commercial license. It's a way of "front-loading" profits so the author can be compensated for their work, but the world gets access.

They also have a mechanism for adding books to a "wishlist" that will give them an indication of works that people want and what right-holders they should track down.

This is a brilliant way to deal with some of those really important and hard to find out of print books. For example, I've wishlisted a very good biography by Greg Rickmann of Philip K. Dick called To the High Tower. It's a work that I stumbled across that's sought out by a small circle of Philip K Dick fans.

It's not clear if there's enough demand for another printing, but unglue.it offers a chance that it could be made easily available again while also giving the author further profit he's not going to see of this long-since sold out book otherwise. I have had the good fortune to read it due to the fact I work at a major library that has a ton of access of books, but I know many a sci-fi fan that doesn't have the resources I do.

I also must admit that I'm interested to see if this model works. I've thought about trying to do a small hobby side business of making value-added public domain works and perhaps doing something similar to unglue.it with near-orphaned copyright works. However, tracking down right-holders has proved troublesome enough that it's remained in my large bucket o' ideas I'd like to do someday. I'm hoping unglue.it takes off enough that it'll create a infrastructure that might make it easier to do projects of this nature.

Here's what I'm pledging to:

PostedMay 18, 2012

AuthorJon Gorman

Categorieswork

Tagscode4lib, creative commons, ebooks, unglue.it

1 CommentPost a comment

There's an excellent series of posts over at Robot Librarian by Bill Dueber with some Solr hacking. If you're at all interested how systems like VuFind and Blacklight are searching our records, it's worth a read. The series inspired me to get off my duff and write about a useful set of tools, YAZ, that not enough people seem to know about.

Anyone dealing with the cataloging side of librarianship will at some point have a pile of records that needs conversion. It might be MARC-8 records that need to be converted to UTF-8, or perhaps a pile of MARCXML records that need to be converted to MARC.

I've seen people try to use MarcEdit or the Perl MARC::Record libraries to solve these problems. MarcEdit is a wonderful tool, but it's difficult to automate. Using the Perl libraries can take a while and there's a risk of bugs, particularly with complicated issues like character sets. Many of these simple tasks can be handled deftly by YAZ.

YAZ is centered around the Z39.50 protocol for searching and retrieving metadata records. The library offers programmers a lot of hooks for working with a Z39.50 server or even setting up their own. However, the yaz packages also offer a set of command-line tools for working with MARC records. (If you're curious about the z39.50 tool, Appendix I in my article Respect My Authority has an example.)

Don't let the fact that these YAZ tools are command-line scare you away. There's two strengths to the command-line we want to take advantage of here:

being very flexible in specifying what files should be modified
very easy to automate

Play along and get some records

The Internet Archives has a entire section devoted to records, Open Library Data. For example, you can go download some MARC records from San Fransico Public Library. I decided to download the file SanFranPL12.out pretty much at random. One word of warning, most of these collections are rather large and so might take some time to download.

The next few sections require you to have a terminal open if you want to follow along. If you don't know what the terminal means, jump down to "Getting to the command-line" at the bottom of this post. You'll also need to follow the yaz install instructions. If you're a linux user, I'd recommend compiling from source or installing the libyaz and yaz package from IndexData. Most linux distributions seem to have an ancient version of the program in their package repositories.

I downloaded the file SanFranPL12.out to ~/blog/yaz_examples and typed cd ~/blog/yaz_examples. (The ~ is a shortcut for home directory in most Linux/Unix systems).

Quickly viewing records

Typing yaz-marcdump SanFranPL12.out | more gives a readable version of the files you can page through by hitting the space bar. You can quit by hitting q or control-c. Yaz-marcdump by default converts marc records into a marc-breaker type format. The | more sends it to the "more" program for paging through the results of the conversion. (Normally I'd use the pager less which has more features, but Windows systems don't usually have less installed).

The results look something like...

02070ccm 2200433Ia 4500

001 ocmocm53093624
003 OCoLC
005 20040301153445.0
008 030926s2003    wiumcz         n    zxx d
020    $a 0634056603 (pbk.)
028 32 $a HL00313227 $b Hal Leonard
040    $a OCO $c OCO $d ORU $d OCoLC $d UtOrBLW
048    $a ka01
049    $a SFRA
050    $a M33.5.L569 $b K49 2003
092    $f SCORE $a 786.4 $b L779a
100 1 $a Lloyd Webber, Andrew, $d 1948-
240 10 $a Musicals. $k Selections; $o arr.
245 10 $a Andrew Lloyd Webber : $b [18 contemporary theatre classics] / $c [arranged by Phillip Keveren].
260    $a Milwaukee, WI : $b Hal Leonard, $c [2003?]

300    $a 64 p. of music ; $c 31 cm.

The first line is the leader and the rest of the lines are parts of the first MARC record in the set of records. (Since this is one file composed of multiple MARC records).

The position 9 in the leader seems blank for all the records I randomly sampled which means that they're encoded in marc-8.

Yaz-marcdump converting from marc-8 to utf-8.

(Quick reminder if you're following along and tried the above, hit q or control-c to exit more)

Converting a file to marc-8 is pretty easy, just type the following:

yaz-marcdump -f marc-8 -t utf-8 -o marc -l 9=97 SanFranPL12.out > SanFranPL12_utf8.mrc

Let's break down the various options

-f marc-8: The input is marc-8.
-t utf-8: The output should be utf-8.
-o marc: The output should be in marc. (Other commonly used options include line-format and MARCXML)
-l 9=97: The leader should be set to a. (97 is the decimal character code for a in utf-8).

Now try doing yaz-marcdump SanFranPL12_utf8.mrc | more, you'll see that the leader has the character 'a' in the leader 09 field. There's also an argument -i where you can supply the input format, but this defaults to marc. The documentation says you can use a character like -l 9=a instead of the decimal character code, but I've never gotten that to work.

Yaz-marcdump converting from marc to marcxml.

Converting to marcxml is just a matter of changing the output format from -o marc to -o marcxml.

yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 SanFranPL12.out > SanFranPL12_utf8.xml.

This can be really, really handy as there's many processes that can manipulate MARCXML that can't touch MARC.

Debugging a record

Some systems do not give a very detailed error message when they reject a MARC record. This is where yaz-marcdump's verbose mode can come in useful. I've taken one of the marc records from the SanFranPL12.out file and inserted some characters and didn't adjust the directory in the record, which will cause errors in some systems.

The record is for "Reality Check!", Volume 2 and i added "Codexmonkey was here!" to the beginning of the 245 subfield $a. This causes the information in the leader and directory to be wrong for this record. if you download the bad record and run through yaz-marcdump bad_record_mod.mrc | more you'll get warnings about separators being in unexpected places. You can download the unmodified record and run yaz-marcdump bad_record_orig.mrc | more and you'll notice you won't get the warnings.

Adding the command option -v produces a verbose output that shows how the file is being parsed by the yaz-marcdump program. This generates a lot of information but can be really useful if you want to understand how programs understand marc records. Let's look at some snippets from yaz-marcdump -v bad_record_mod.mrc | more and yaz-marcdump -v bad_record_orig.mrc | more

(Directory offset 132: Tag 092, length 0018, starting 00232)
(Directory offset 144: Tag 100, length 0019, starting 00250)
(Directory offset 156: Tag 245, length 0097, starting 00269)
(Directory offset 168: Tag 260, length 0037, starting 00366)

This occurs early in the program and is when yaz-marcdump is actually parsing through the directory, the part of a marc records that describes how long each variable field will be. Any parser will expect Tag 245 to be 97 bytes long, but I added a bunch more by just typing it in via the vi editor.

Let's first look at the non-modified record when it gets to the 245 tag.

(Tag: 245. Directory offset 156: data-length 97, data-offset 269)
245 10 $a Reality Check! $n Volume 2 / $c by Rikki Simons ; & [illustrations by] Tavisha Wolfgarth-Simons.
(subfield: 61 52 65 61 6C 69 74 79 20 43 68 65 63 6B 21)
(subfield: 6E 56 6F 6C 75 6D 65 20 32 20 2F)
(subfield: 63 62 79 20 52 69 6B 6B 69 20 53 69 6D 6F 6E 73 ..)
(Tag: 260. Directory offset 168: data-length 37, data-offset 366)
260 $a Los Angeles : $b Tokyopop, $c c2003.

It got to the 245 tag and pulled out the 97 characters that comprise the field. You'll notice the parser is breaking the field into the subfields. The hex numbers are the characters in the subfield, including the subfield flag. (615265616C = aReal)

Now a look at the one that's been modified:

(Tag: 245. Directory offset 156: data-length 97, data-offset 269)
245 10 $a CodexMonkey was here! Reality Check! $n Volume 2 / $c by Rikki Simons ; & [illustrations by] Tav
(subfield: 61 43 6F 64 65 78 4D 6F 6E 6B 65 79 20 77 61 73 ..)
(subfield: 6E 56 6F 6C 75 6D 65 20 32 20 2F)
(subfield: 63 62 79 20 52 69 6B 6B 69 20 53 69 6D 6F 6E 73 ..)
(No separator at end of field length=97)
(Tag: 260. Directory offset 168: data-length 37, data-offset 366)
260 sh $ Wolfgarth-Simons.

The parser gets to field 245. After all the subfields have been parsed, yaz-marcdump complains that it could not find the separator that should be there after 97 bytes to indicate the field actually ended. This ends up messing the following 260 and each field after it. In this case the parser can't be sure if the directory is off or the character just happens to be missing.

Splitting a MARC file into several MARC files.

The yaz-marcdump also has some tools that can make dealing with MARC records easier. I ran into a case recently where a process couldn't handle dealing with the very large XML file that taking a file of MARC records and converting it into one giant XML file produced.

Thankfully, the yaz-marcdump tool provides the ability to split an input file into several output files, also called chunking by software geek types. Unfortunately it only seems able to do this with an input type of marc. So let's say I decided I wanted to split the original file into more manageable sized files where each one has only has 10,000 records per file and convert those to xml.

Splitting is easy, but doing some of the other steps requires some advanced command-line foo that does not work on Windows. I'll need to do this in a couple of steps. ($ is the prompt, don't type it. Just using it to make clear where new commands start).

$ mkdir split_files
$ cd split_files
$ yaz-marcdump -s sfpl -C 10000 ../SanFranPL12.out > /dev/null
$ find . -name 'sfpl*' | xargs -n 1 -I{} sh -c 'yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 {} > {}.xml'
$ mkdir ../xml
$ mv *xml ../xml
$ cd ..

Now if you do ls -1 xml/* you should see something like...

sfpl0000000.xml
sfpl0000001.xml
sfpl0000002.xml
sfpl0000003.xml
sfpl0000004.xml

Let's break down the command yaz-marcdump -s sfpl -C 10000 ../SanFranPL12.out > /dev/null

-s sfpl: This tells yaz-marcdump what to prefix to each generated file as well as to split the files
-C 10000: This is the number of records per file. It defaults to one. Also notice that it is an upper-case C, not c. Case matters.
../SanFranPL12.out: Since we're down in the split_files directory, we need to tell the tool that the SanFranPL12.out is located in the parent directory
> /dev/null: For some reason, this program will still output the files to the terminal, even though it's also writing to the files. This redirects the output to /dev/null, essentially a file that never retains any data. You can also use the command-line option -n to suppress , but then you'll still get some output as yaz tries to correct issues it sees with various records.

The really complicated line after that, find . -name 'sfpl*' | xargs -n 1 -I{} sh -c 'yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 {} > {}.xml', finds all the files with the prefix and gives that to a program called xargs, which calls the yaz-marcdump command to do a conversion for each file. It's the same as doing...

yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 sfpl0000000 > sfpl0000000.xml
yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 sfpl0000001 > sfpl0000001.xml
yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 sfpl0000002 > sfpl0000002.xml
yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 sfpl0000003 > sfpl0000003.xml
yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 sfpl0000004 > sfpl0000004.xml

If the designers of yaz-marcdump had included an option for a output file name, the line would have been a bit less ugly.

Getting to the command-line

I feel a little silly writing this section, but when teaching/training people in the past I've had some people really confused on how to get to the command-line. If you're running Mac OS X, you want to launch the Terminal application, which at least used to be in Utilities. In Windows, go to Start -> Run and type cmd. Both of these will launch a terminal window that you can type in.

The next few sections require you to have a

PostedApril 22, 2012

AuthorJon Gorman

Categorieswork

Tagslibrary, marc, yaz, yaz-marcdump

2 CommentsPost a comment

Recently a newcomer to the Code4Lib mailing list, Cliff, posted a question asking for information about sharing code and also possible ethical considerations as some of the shared code might be based off of other's efforts.

I did a short response that focused more on the first part of his query covering some thoughts about code sharing in the Code4Lib community, which I'm cleaning up and posting here.

There seems to have been a push over the past few years in Code4Lib to share more and more code, even with small projects. There are a lot of individuals scattered about in the library world writing code to accomplish similar tasks, small and large. One common example is the glue between certain academic enterprise systems and our catalogs. This code, particularly in the past, got developed in little pockets without ever getting shared. Occasionally code sharing flourishes as a gated community surrounding a particular vendor, but I think these communities suffer by just not being large enough. There seems to be a conscious push against the tendency of isolated development by releasing often and without regard to size. GitHub in particular has made it really easy and painless to share smaller chunks of code and offer patches to projects.

I have been bad about releasing and sharing source myself. This has been a hindrance as I find myself creating similar code in different internal projects instead of taking a step back and generalizing the code. If I did, not only could the code be shared among my projects, it could be shared with the community.

There is also a barrier in our lawyers. I have not put in the energy needed to get the attention of the office that makes decisions on whether or not to release code as open source. That office also does not make it easy or comfortable to ask questions. I suspect from what I've heard that one really needs to call or try to visit in person, something I tend to sub-consciously avoid in my typical approaches to communication.

On a community level, it feels like Code4Lib is starting to see tension about releasing small projects and lots of code that manifests in a variety of ways.

There is the perception that there are projects have been abandoned or just don't have the level of support and community necessary to sustain development.
Large scale of adoption of code/projects by people who don't have the technical skills to contribute patches and need help to use the project.
Competition among projects that share goals and need to compete with each other for community. I think choices are good, but choices introduce tension and too many choices can lead to people choosing nothing. I don't think the library software world has hit that point, but I can see a future not to far away where this is more of a problem.

There have been a couple of articles over the years on these topics in the code4lib journal that describe it in more detail than the general approach I've taken here that worth reading.

First, an argument on why to just put stuff out there and why so often we seem to fail to by Dale Askey: COLUMN: We Love Open Source Software. No, You Can’t Have Our Code

On the other hand, see Terry Reese's excellent article in the latest issue presenting an argument why one should be prepared to support the code published: Purposeful Development: Being Ready When Your Project Moves From ‘Hobby’ to Mission Critical

Finally, Michael Doran gave an excellent talk a few years back that really stuck in my head with the very issue I've been reluctant to put more effort into: lawyers and code: The Intellectual Property Disclosure: Open Source in Academia. (Powerpoint slides)

In re-reading the original post, I realized I glossed over the ethical part, which is a shame. There are some fascinating issues concerning the ethical dimension of sharing code that was based and inspired off of other code. Of course, on one level are the legal issues involved with copyright and derivative works depending on exactly what "based on" entails.

However, I'm more interested in the learning and sharing aspect of code development. It is extremely useful for me to read code developed by others. Like critical reading of prose, you can learn a lot by not just trying to figure out what the code does, but thinking about how the code you are reading communicates to the reader. Does it flow? Does it jump around? Are abstractions employed that makes it easier to conceptualize? It's a fascinating topic and really deserves longer treatment with another post.

My thanks go out to Peter Murray (aka @DataG) who shared a link to my email. Also thanks to Becky Yoose (aka @yo_bj) for retweeting. In doing so they made me realize perhaps it would be worth revising and posting the email as a blog post.

PostedMarch 20, 2012

AuthorJon Gorman

Categorieswork

Tagscode4lib, open source, sharing code

1 CommentPost a comment

Every profession has its war stories. Those challenges that separate the newcomers to the grizzled veterans. In the library world those of us who have to tangle deep in the weeds close to the bare metal to get at our data all seem to have a story about what is colloquially known as "wtf-8".

I'm not sure who coined the term "wtf-8", aside from that I'm pretty sure I heard it one fateful day in the #code4lib freenode channel. The term is a cunning play on the messed up, crazy mish-mashes of encoding standards we see in the library world. It is usually applied to some mangled mess of characters that got mortally wounded in the conversion from marc-8 to utf-8.

Some encoding basics

For those of you who follow this blog out of the hopes of seeing pictures of my fuzzy dog and for some reason are still reading, here's a crash course on character sets and encodings. As with most crash courses and simplifications, it's not entirely true. For the rest of you, feel free to skip this section.

Data on a computer is stored in a series of 0 and 1s. In order to store and represent the text you're reading, the computer associates certain binary numbers with certain characters. Since binary numbers are really, really long, programmers often display mappings using hex numbers (0-f) to characters. So the letter e is stored in ASCII as 0x65 (0x indicates a hex number). There's a table on the ASCII Wikipedia article.

48656C6C6F21 is the hex for the characters  Hello!

Now, ASCII is one of the great grandaddies of encoding formats. Before it, almost every computer would have a different mapping of numbers to different characters. ASCII became one of the well known standards and eventually the dominate one.

However, ASCII only maps out English characters, making it a problem in records that need to be in other languages. That's why libraries developed a character set standard, marc-8 that had more of an international focus way back in the early days of computing.

The rest of the computer world used a variety of approaches, with frequently a proliferation of mappings for every language. However, in the past twenty years or so there has been a push to establish an international standard, Unicode. Unicode actually makes the mapping a little abstract, having "codepoints" and then a family of encodings include utf-8 & utf-16 that use different numbers for those codepoints .

The marc-8 encoding allows for mixing in one record English and one other language whereas the Unicode approch is more flexible, allowing for mixing characters from all sorts of languages in one document. The joke runs that there's enough space in the mapping reserved for when we contact an alien race. A bonus is that Unicode is certainly better supported in general software than marc-8. A lot of library software and records have been migrated away from marc-8 to one of the Unicode encodings.

How some of our data went from utf-8 to wtf-8

A few years ago my workplace converted from marc-8 to utf-8. The records in today's story weathered the transition well. The problem is we have a workflow that has some data that mostly lives in SQL Server with an Access front-end. However, occasionally people use a form in the Access database that calls a stored procedure in SQL Server, which in turns does a query against the Oracle database via a Linked Server and use of OPENQUERY.

However, Ex Libris (Endeavor at the time), stores the information in the database marked as what they call AMERICAN_AMERICAN.US7ASCII. Now, I think this is the default for Oracle, but it may also have been a concious decision, I'm not sure. But I do know to get it to work, even with the data now in utf8, you still use US7ASCII in configurations. The string that seems to be stored in the Oracle database for titles seems to be utf-8 though. (I peeked at the raw hex of the characters by using the RAWTOHEX option). Oracle apparently lets you stuff anything into US7ASCII.

SQL Server can store string data in two main ways, one meant for unicode and the other for non-unicode character sets. We first created this database with a set of already existing unicode records and used the Unicode type. Then we started running into complaints of "strange characters and boxes", a tell-tale sign that either the font can't render the character or that there's really something messed up with the character. Sadly, a little testing proved the later.

Here's what happens. SQL Server seems to, maybe based off the US7ASCII, attempt to convert the string from ASCII to UCS-2, which is similar but not identical to utf-16. UCS-2 was implemeted in Microsoft products off of a early draft of what would become UTF-16. Forgive me if I accidentally call USC-2 UTF-16 once or twice in this post.. UCS-2 stores characters as two-bytes (or 4 hex numbers) of information where ASCII only stores it as one byte. However, in a bit of cleverness, the designers of utf-16 had it so that the mapping of the English characters is the same of that as ASCII, just with 00 added. So ...

Hello in ASCII and UTF-8, with spaces for readability

	ASCII	UCS-2
Characters	H e l l o	H e l l o
Hex	48 65 6C 6C 6F	4800 6500 6C00 6C00 6F00

You can see the "raw hex" in SQL Server by converting a varchar or nvarchar to varbinary(), like select convert(varbinary(4000),titles) from bibinfo ;.

The Problem

The Voyager data that is getting converted like this isn't ASCII, but utf-8. Utf-8 is a very clever encoding. In ASCII, you always know that a character is going to be two hex numbers, or one byte. In UCS-2 a character is always four hex numbers, or two bytes. This means the same information stored in a file encoded as UCS-2 rather than ASCII will be always be twice as large. Also, many but not all Unicode characters can actually be encoded in UCS-2/UTF-16.

Utf-8 gets around these two problems. Utf-8 is a variable-length encoding where each character might be one to four bytes long. You know how many bytes are going to make a character up by the leading byte. If the byte is 7E or smaller in hex, the computer knows the character to lookup up will be in the one byte table. That table is pretty much equivalent to the ASCII table.

So in other words, if you just use the ASCII characters, a utf-8 file and the ASCII file is the same. Score! However, the problem is this trick means that for the non-ASCII characters, the utf-8 hex and the UCS-2 hex is not the same. Also, if you have characters that are all 4 bytes, you end up with a much larger file than using UCS-2 or UTF-16.

So let's look at a "Heavy Metal" hello: Hellö. That's got a o with an umlaut that heavy metal bands seem to like so much. Imagine it was as title we got from Voyager, like Hellö: The Mick Shrimpton Story. Here's how that Hello should be in utf-8 and UCS-2

Hello in ASCII and UTF-8, with spaces for readability

	UTF-8	UCS-2
Characters	H e l l ö	H e l l ö
Hex	48 65 6C 6C C3BF	4800 6500 6C00 6C00 F600

However, because SQL Server just treats the whole string like it ASCII and doesn't warn when there's a non-ASCII character, what actually will get stored is...

4800 6500 6C00 6C00 C300 BF00

And anything accessing this will think that there are now two characters, C300 & BF00, which may or may not map to valid UCS-2 characters.

How we solved it

In this particular case, We I just wimped out. We are I am storing the data as a varchar and not as a nvarchar. That keeps the conversion from happening and the other future components from the process are tweaked to take this varchar string and treat the stuff in it as utf-8. Probably the better solution would be not to rely on the linked server but rather a process that will take in the Voyager data and properly convert it to ucs-2 for storage in the nvarchar field. We have some other applications that have as similar conversion process. The workflow in this particular case is a short-term solution to a workflow that is destined for a longer-term revamp from the ground up.

Some tips

As mentioned for Oracle, you can do select rawtohex(field) from table ; to see the hex of the data stored in that field.
For SQL Server, select convert(varbinary(4000),field) from table; will do the same
To see the encoding of actual files, I use the command-line tool xxd in linux.
If you want to type a unicode character in Ubuntu/Gnome, type Control - Shift - U and type the codepoint
For Windows, hold the Alt key, press the + on the numeric keypad and enter the codepoint and release the Alt key.

Some further reading

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

This is a essay post that has a nice summary of Unicode and character encoding issues. Old, but still good for a review. There's some technical jargon, but if you ignore the programming stuff he probably explains the Unicode stuff better than I do.
Coded Character Sets: A Technical Primer for Librarians by Michael Doran

Michael Doran, Systems Librarian Extraordinaire and Voyager Hacker Supreme, describes some of the issues with Voyager and character sets.
Wikipedia has some good articles on Unicode in general, UTF-8, UTF-16 and even ASCII.
I like the fileformat website for looking up particular characters. I know I've had others recommended to me, but I always forget them. Google seems to like fileformat, so I end up using it.
The Unicode Consortium site. i find it a bit confusing, but you'll be able to download various mappings of all sorts of strange characters to put into your files.

Update - some minor fixes & changes

I realized I hadn't run spell-check one last time (still getting used to the spell-check on my blog software) and made some minor spelling corrections. Also switched to using the royal we in my last section, fixed that. Also made a workplace reference a little more generic.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

PostedJanuary 31, 2012

AuthorJon Gorman

Categorieswork

Tagsencodings, wtf8

1 CommentPost a comment

Hak5 & HakTip

Code4Lib Jobs

Unglue.it launched!

Yaz-marcdump: Simple but powerful MARC batch tool.

Play along and get some records

Quickly viewing records

Yaz-marcdump converting from marc-8 to utf-8.

Yaz-marcdump converting from marc to marcxml.

Debugging a record

Splitting a MARC file into several MARC files.

Getting to the command-line

One byte, two bytes - A journey into wtf-8 encoding.

Some encoding basics

How some of our data went from utf-8 to wtf-8

The Problem

How we solved it

Some tips

Some further reading

Update - some minor fixes & changes

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Hak5 & HakTip

Code4Lib Jobs

Unglue.it launched!

Yaz-marcdump: Simple but powerful MARC batch tool.

Play along and get some records

Quickly viewing records

Yaz-marcdump converting from marc-8 to utf-8.

Yaz-marcdump converting from marc to marcxml.

Debugging a record

Splitting a MARC file into several MARC files.

Getting to the command-line

Sharing Code

One byte, two bytes - A journey into wtf-8 encoding.

Some encoding basics

How some of our data went from utf-8 to wtf-8

The Problem

How we solved it

Some tips

Some further reading

Update - some minor fixes & changes

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)