Feeds

Blog Feed ( rss | atom | rdf )

Blog Comments ( rss | atom | rdf )

Sunday
Apr222012

Yaz-marcdump: Simple but powerful MARC batch tool.

 

There's an excellent series of posts over at Robot Librarian by Bill Dueber with some Solr hacking.  If you're at all interested how systems like VuFind and Blacklight are searching our records, it's worth a read. The series inspired me to get off my duff and write about a useful set of tools, YAZ, that not enough people seem to know about.

Anyone dealing with the cataloging side of librarianship will at some point have a pile of records that needs conversion.  It might be MARC-8 records that need to be converted to UTF-8, or perhaps a pile of MARCXML records that need to be converted to MARC.

I've seen people try to use MarcEdit or the Perl MARC::Record libraries to solve these problems. MarcEdit is a wonderful tool, but it's difficult to automate.  Using the Perl libraries can take a while and there's a risk of bugs, particularly with complicated issues like character sets.  Many of these simple tasks can be handled deftly by YAZ. 

YAZ is centered around the Z39.50 protocol for searching and retrieving metadata records.  The library offers programmers a lot of hooks for working with a Z39.50 server or even setting up their own.  However, the yaz packages also offer a set of command-line tools for working with MARC records. (If you're curious about the z39.50 tool,  Appendix I in my article Respect My Authority has an example.)

Don't let the fact that these YAZ tools are command-line scare you away.  There's two strengths to the command-line we want to take advantage of here:

  • being very flexible in specifying what files should be modified
  • very easy to automate 

 

Play along and get some records

The Internet Archives has a entire section devoted to records, Open Library Data.  For example, you can go download some MARC records from San Fransico Public Library.  I decided to download the file SanFranPL12.out pretty much at random.  One word of warning, most of these collections are rather large and so might take some time to download.

The next few sections require you to have a terminal open if you want to follow along.  If you don't know what the terminal means,  jump down to "Getting to the command-line" at the bottom of this post. You'll also need to follow the yaz install instructions.  If you're a linux user, I'd recommend compiling from source or installing the libyaz and yaz package from IndexData.  Most linux distributions seem to have an ancient version of the program in their package repositories.

I downloaded the file SanFranPL12.out to ~/blog/yaz_examples and typed cd ~/blog/yaz_examples. (The ~ is a shortcut for home directory in most Linux/Unix systems).

 

Quickly viewing records

Typing yaz-marcdump SanFranPL12.out | more gives a readable version of the files you can page through by hitting the space bar.  You can quit by hitting q or control-c.  Yaz-marcdump by default converts marc records into a marc-breaker type format.  The | more sends it to the "more" program for paging through the results of the conversion.  (Normally I'd use the pager less which has more features, but Windows systems don't usually have less installed).

The results look something like...

02070ccm  2200433Ia 4500

001 ocmocm53093624
003 OCoLC
005 20040301153445.0
008 030926s2003    wiumcz         n    zxx d
020    $a 0634056603 (pbk.)
028 32 $a HL00313227 $b Hal Leonard
040    $a OCO $c OCO $d ORU $d OCoLC $d UtOrBLW
048    $a ka01
049    $a SFRA
050    $a M33.5.L569 $b K49 2003
092    $f SCORE $a 786.4 $b L779a
100 1  $a Lloyd Webber, Andrew, $d 1948-
240 10 $a Musicals. $k Selections; $o arr.
245 10 $a Andrew Lloyd Webber : $b [18 contemporary theatre classics] / $c [arranged by Phillip Keveren].
260    $a Milwaukee, WI : $b Hal Leonard, $c [2003?]

300    $a 64 p. of music ; $c 31 cm.

The first line is the leader and the rest of the lines are parts of the first MARC record in the set of records.  (Since this is one file composed of multiple MARC records).

The position 9 in the leader seems blank for all the records I randomly sampled which means that they're encoded in marc-8. 

Yaz-marcdump converting from marc-8 to utf-8.

(Quick reminder if you're following along and tried the above, hit q or control-c to exit more)

Converting a file to marc-8 is pretty easy, just type the following:

yaz-marcdump -f marc-8 -t utf-8 -o marc -l 9=97 SanFranPL12.out > SanFranPL12_utf8.mrc

Let's break down the various options

  • -f marc-8: The input is marc-8. 
  • -t utf-8:     The output should be utf-8.
  • -o marc:    The output should be in marc. (Other commonly used options include line-format and MARCXML)
  • -l 9=97:    The leader should be set to a. (97 is the decimal character code for a in utf-8).

Now try doing yaz-marcdump SanFranPL12_utf8.mrc | more, you'll see that the leader has the character 'a' in the leader 09 field.   There's also an argument -i where you can supply the input format, but this defaults to marc. The documentation says you can use a character like -l 9=a instead of the decimal character code, but I've never gotten that to work.

Yaz-marcdump converting from marc to marcxml.

Converting to marcxml is just a matter of changing the output format from -o marc to -o marcxml.

 yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 SanFranPL12.out > SanFranPL12_utf8.xml.

This can be really, really handy as there's many processes that can manipulate MARCXML that can't touch MARC.

Debugging a record

Some systems do not give a very detailed error message when they reject a MARC record. This is where yaz-marcdump's verbose mode can come in useful.  I've taken one of the marc records from the SanFranPL12.out file and inserted some characters and didn't adjust the directory in the record, which will cause errors in some systems.

The record is for "Reality Check!", Volume 2 and i added "Codexmonkey was here!" to the beginning of the 245 subfield $a. This causes the information in the leader and directory to be wrong for this record. if you download the bad record and run through yaz-marcdump bad_record_mod.mrc | more you'll get warnings about separators being in unexpected places. You can download the unmodified record and run yaz-marcdump bad_record_orig.mrc | more and you'll notice you won't get the warnings.

Adding the command option -v produces a verbose output that shows how the file is being parsed by the yaz-marcdump program.  This generates a lot of information but can be really useful if you want to understand how programs understand marc records. Let's look at some snippets from yaz-marcdump -v bad_record_mod.mrc | more and yaz-marcdump -v bad_record_orig.mrc | more

(Directory offset 132: Tag 092, length 0018, starting 00232)
(Directory offset 144: Tag 100, length 0019, starting 00250)
(Directory offset 156: Tag 245, length 0097, starting 00269)
(Directory offset 168: Tag 260, length 0037, starting 00366)

This occurs early in the program and is when yaz-marcdump is actually parsing through the directory, the part of a marc records that describes how long each variable field will be. Any parser will expect Tag 245 to be 97 bytes long, but I added a bunch more by just typing it in via the vi editor. 

Let's first look at the non-modified record when it gets to the 245 tag.

(Tag: 245. Directory offset 156: data-length 97, data-offset 269)
245 10 $a Reality Check! $n Volume 2 / $c by Rikki Simons ; & [illustrations by] Tavisha Wolfgarth-Simons.
(subfield: 61 52 65 61 6C 69 74 79 20 43 68 65 63 6B 21)
(subfield: 6E 56 6F 6C 75 6D 65 20 32 20 2F)
(subfield: 63 62 79 20 52 69 6B 6B 69 20 53 69 6D 6F 6E 73 ..)
(Tag: 260. Directory offset 168: data-length 37, data-offset 366)
260    $a Los Angeles : $b Tokyopop, $c c2003.

It got to the 245 tag and pulled out the 97 characters that comprise the field.  You'll notice the parser is breaking the field into the subfields.  The hex numbers are the characters in the subfield, including the subfield flag.  (615265616C = aReal)

Now a look at the one that's been modified:

(Tag: 245. Directory offset 156: data-length 97, data-offset 269)
245 10 $a CodexMonkey was here! Reality Check! $n Volume 2 / $c by Rikki Simons ; & [illustrations by] Tav
(subfield: 61 43 6F 64 65 78 4D 6F 6E 6B 65 79 20 77 61 73 ..)
(subfield: 6E 56 6F 6C 75 6D 65 20 32 20 2F)
(subfield: 63 62 79 20 52 69 6B 6B 69 20 53 69 6D 6F 6E 73 ..)
(No separator at end of field length=97)
(Tag: 260. Directory offset 168: data-length 37, data-offset 366)
260 sh $  Wolfgarth-Simons.

The parser gets to field 245.  After all the subfields have been parsed, yaz-marcdump complains that it could not find the separator that should be there after 97 bytes to indicate the field actually ended.  This ends up messing the following 260 and each field after it. In this case the parser can't be sure if the directory is off or the character just happens to be missing.   

Splitting a MARC file into several MARC files.

The yaz-marcdump also has some tools that can make dealing with MARC records easier.  I ran into a case recently where a process couldn't handle dealing with the very large XML file that taking a file of MARC records and converting it into one giant XML file produced.

Thankfully, the yaz-marcdump tool provides the ability to split an input file into several output files, also called chunking by software geek types.  Unfortunately it only seems able to do this with an input type of marc.  So let's say I decided I wanted to split the original file into more manageable sized files where each one has only has 10,000 records per file and convert those to xml. 

Splitting is easy, but doing some of the other steps requires some advanced command-line foo that does not work on Windows. I'll need to do this in a couple of steps. ($ is the prompt, don't type it.  Just using it to make clear where new commands start).

$ mkdir split_files
$ cd split_files
$ yaz-marcdump -s sfpl -C 10000 ../SanFranPL12.out > /dev/null
$ find . -name 'sfpl*' | xargs -n 1 -I{} sh -c 'yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 {} > {}.xml'
$ mkdir ../xml
$ mv *xml ../xml
$ cd ..

Now if you do ls -1 xml/* you should see something like...

sfpl0000000.xml
sfpl0000001.xml
sfpl0000002.xml
sfpl0000003.xml
sfpl0000004.xml

Let's break down the command yaz-marcdump -s sfpl -C 10000 ../SanFranPL12.out > /dev/null

  • -s sfpl:  This tells yaz-marcdump what to prefix to each generated file as well as to split the files
  • -C 10000:  This is the number of records per file.  It defaults to one.  Also notice that it is an upper-case C, not c.   Case matters.
  • ../SanFranPL12.out:  Since we're down in the split_files directory, we need to tell the tool that the SanFranPL12.out is located in the parent directory
  • > /dev/null: For some reason, this program will still output the files to the terminal, even though it's also writing to the files.  This redirects the output to /dev/null, essentially a file that never retains any data.  You can also use the command-line option -n to suppress , but then you'll still get some output as yaz tries to correct issues it sees with various records.

The really complicated line after that, find . -name 'sfpl*' | xargs -n 1 -I{} sh -c 'yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 {} > {}.xml', finds all the files with the prefix and gives that to a program called xargs, which calls the yaz-marcdump command to do a conversion for each file.  It's the same as doing...

yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 sfpl0000000 > sfpl0000000.xml
yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 sfpl0000001 > sfpl0000001.xml
yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 sfpl0000002 > sfpl0000002.xml
yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 sfpl0000003 > sfpl0000003.xml
yaz-marcdump -f marc-8 -t utf-8 -o marcxml -l 9=97 sfpl0000004 > sfpl0000004.xml

If the designers of yaz-marcdump had included an option for a output file name, the line would have been a bit less ugly.

Getting to the command-line

I feel a little silly writing this section, but when teaching/training people in the past I've had some people really confused on how to get to the command-line.  If you're running Mac OS X, you want to launch the Terminal application, which at least used to be in Utilities.  In Windows, go to Start -> Run and type cmd.  Both of these will launch a terminal window that you can type in. 

 

The next few sections require you to have a

Thursday
Apr192012

Fun Photo Friday: Allerton, Episode I.

I've got a post coming soon on some basics of using the yaz-marcdump tool. Meanwhile to tide you over I have not just one, but two photos. Last week's photo reminded me of some photos I took at Allerton last May when we went there for the first time.

It was pretty muddy in the hiking trails, so we ended up first going to see the Foo Dogs. In the fenced in part with ivy V suddenly got obsessed with one spot. I figured I'd take a picture of her sniffing, when all the sudden there was some movement. I managed to snap a picture quickly...

After Colleen and I realized it was quite a young fawn (which was quite hard to see from human height), we got the V-dog away as we didn't want to spook the mother.

And after all that excitement, V got to try out to be a Foo Dog. Didn't have the heart to tell her that Foo Dogs are actually lions. At least the one she posed next to seemed to be impressed by her.

Friday
Apr132012

Fun Photo Friday: Allerton 

 

 

Yay for hiking trips! Colleen, V and I went to Allerton last Sunday so I could unwind from a morning of doing updates.

 

 

Friday
Apr062012

Fun Photo Friday: Veronica napping after hard work napping.

Welcome to Fun Photo Friday!  Recently I was looking through some of my photos and realized it would make a good regular feature to make sure this blog never gets too serious. 

Today's photo was taken a week ago as I was waiting at home for a quote for a water heater repair or replacement. The V dog spent most of the morning sleeping on the floor. It was so exhausting she had to take a nap on the couch afterwards.

Tuesday
Mar202012

Sharing Code

Recently a newcomer to the Code4Lib mailing list, Cliff, posted a question asking for information about sharing code and also possible ethical considerations as some of the shared code might be based off of other's efforts.

I did a short response that focused more on the first part of his query covering some thoughts about code sharing in the Code4Lib community, which I'm cleaning up and posting here.

There seems to have been a push over the past few years in Code4Lib to share more and more code, even with small projects. There are a lot of individuals scattered about in the library world writing code to accomplish similar tasks, small and large.  One common example is the glue between certain academic enterprise systems and our catalogs.  This code, particularly in the past, got developed in little pockets without ever getting shared.  Occasionally code sharing flourishes as a gated community surrounding a particular vendor, but I think these communities suffer by just not being large enough.  There seems to be a conscious push against the tendency of isolated development by releasing often and without regard to size.  GitHub in particular has made it really easy and painless to share smaller chunks of code and offer patches to projects.

I have been bad about releasing and sharing source myself. This has been a hindrance as I find myself creating similar code in different internal projects instead of taking a step back and generalizing the code.  If I did, not only could the code be shared among my projects, it could be shared with the community.

There is also a barrier in our lawyers. I have not put in the energy needed to get the attention of the office that makes decisions on whether or not to release code as open source.  That office also does not make it easy or comfortable to ask questions.  I suspect from what I've heard that one really needs to call or try to visit in person, something I tend to sub-consciously avoid in my typical approaches to communication.

On a community level, it feels like Code4Lib is starting to see tension about releasing small projects and lots of code that manifests in a variety of ways. 

  • There is the perception that there are projects have been abandoned or just don't have the level of  support and community necessary to sustain development.
  • Large scale of adoption of code/projects by people who don't have the technical skills to contribute patches and need help to use the project.
  • Competition among projects that share goals and need to compete with each other for community.  I think choices are good, but choices introduce tension and too many choices can lead to people choosing nothing.  I don't think the library software world has hit that point, but I can see a future not to far away where this is more of a problem. 

There have been a couple of articles over the years on these topics in the code4lib journal that describe it in more detail than the general approach I've taken here that worth reading.

First, an argument on why to just put stuff out there and why so often we seem to fail to by Dale Askey: COLUMN: We Love Open Source Software. No, You Can’t Have Our Code 

On the other hand, see Terry Reese's excellent article in the latest issue presenting an argument why one should be prepared to support the code published: Purposeful Development: Being Ready When Your Project Moves From ‘Hobby’ to Mission Critical

Finally, Michael Doran gave an excellent talk a few years back that really stuck in my head with the very issue I've been reluctant to put more effort into: lawyers and code: The Intellectual Property Disclosure: Open Source in Academia. (Powerpoint slides)

In re-reading the original post, I realized I glossed over the ethical part, which is a shame.  There are some fascinating issues concerning the ethical dimension of sharing code that was based and inspired off of other code. Of course, on one level are the legal issues involved with copyright and derivative works depending on exactly what "based on" entails. 

However, I'm more interested in the learning and sharing aspect of code development. It is extremely useful for me to read code developed by others.  Like critical reading of prose, you can learn a lot by not just trying to figure out what the code does, but thinking about how the code you are reading communicates to the reader.  Does it flow?  Does it jump around?  Are abstractions employed that makes it easier to conceptualize?  It's a fascinating topic and really deserves longer treatment with another post. 

My thanks go out to Peter Murray (aka @DataG) who shared a link to my email.  Also thanks to Becky Yoose (aka @yo_bj) for retweeting. In doing so they made me realize perhaps it would be worth revising and posting the email as a blog post.