The Collation

Research and Exploration at the Folger

Folger Tooltips: Making a spreadsheet from raw Hamnet data

Hamnet, the Folger’s online catalog, is more than just a searchable inventory of printed books, manuscripts, engravings, paintings, and other resources in the collection. It is also a giant data set, freely available for machine analysis. But there’s a catch: library catalog data is encoded in MARC (MAchine Readable Cataloging), the coelecanth of the digital world. Developed in the 1960s, this data standard is now a living fossil. It still functions for its original purpose, but doesn’t easily lend itself to analysis by digital humanists.

Raw MARC records are basically eye-readable, but they don’t make any sense until they’re parsed. The long string of digits at the start of the record tells the system how to parse what follows:

Image of block of undifferentiated numbers and letters
Example of raw MARC

There are currently two main ways for researchers to get raw MARC data from Hamnet:

  1. Search Hamnet and extract selected records yourself  using the “Print/Save” button, as described in Folger Tooltips: Getting raw Hamnet data (which only works for relatively small batches of a few hundred records at most)
  2. Ask, and we’ll do our best to send you what you need.

That’s the easy part. Converting that raw MARC data into something useful outside the context of a library catalog is where it gets tricky. This post demonstrates how to create a simple spreadsheet by extracting selected MARC data in tab delimited format using MarcEdit, a free utility developed by Terry Reese, Head of Digital Initiatives at The Ohio State University. 1

To begin with, you need to have MarcEdit installed on your computer, and you need a set of MARC records. I will be using MarcEdit version 6.0 on a Windows 7 machine, and a file containing all 5,405 Folger bilbiographic records for books published in the British Isles or in the English language through the year 1603 (records that I extracted back in January and happen to have handy; I’ve also uploaded it here as a zip file called Pre-1604_STCs, if you’d like to download and follow along). Note that because the MARC file contains just the “bibliographic” records, not their associated “holdings” records, the record count bears no relationship to the number of copies of each book the Folger happens to have. It’s just a count of titles.

By default, MarcEdit’s main window only shows the most commonly used tools, so you will either have to open the “Export Tab Delimited” program from the drop-down menu (Tools > Export > Export Tab Delimited Records) or use the little gear-wheel icon in the lower right to add the Export Tab Delimited icon to the window, like I’ve done on my machine:

MarcEdit home screen

When you first open the Export Tab Delimited utility, you’ll see that “Step 1” is to fill in four blanks: the file name and location of your MARC data, a file name and location for your soon-to-be-created extract from that file, a Field Delimiter (“Tab” in this example, but you could also have picked “Comma” or “Pipe”), and an In Field Delimiter (could be any character, but I like to use a pipe because I know it’s not going to show up anywhere else in the data):

Export Tab Delmited Step 1
Step 1: Selecting the source file and naming a destination file in the two “Set File Paths” boxes

Next comes “Step 2: Define Fields/Subfields to Export.” Here, it’s important to know which MARC fields and subfields contain the data you’d like to analyze. See MARC 21 Format  for Bibliographic Data (Library of Congress Network Development and MARC Standards Office) and Bibliographic Formats and Standards (OCLC Support & Training) for the full array of choices, but note that relatively few fields are required, and that any library’s practice will have varied over the years.

For this exercise, I’m going to look at the primary languages found in the Folger’s collection of pre-1604 STCs, so I will extract two fields:

AddFields

After entering the fields I want in my data set, I click “Export” and wait for the “Items have been exported…” confirmation message (there’s no hourglass icon or spinning beach ball to indicate something is being processed, so be patient; if you have a large file or a slow computer, this could take a long time).

Now that I have an exported file of tab-delimited data, I’m going to import it to an Excel spreadsheet in order to play with it. In Excel, I go to the “Data” tab and select “Get External Data” > “From Text”:

GetExternalData

Then I select the file I exported from MarcEdit and click the “Import” button:

Import External Data
Import External Data

The data is tab delimited, so I accept the default file type, “Delimited”:

TextImportStep1

Clicking “Next” lets me pick the type of delimiter, which is “Tab” in this case:

TextImportStep2

Lastly, I select all columns and identify them as “Text” format instead of the default, “General.” This ensures that Excel doesn’t strip any leading zeros, try to convert eight-digit numbers into dates, etc.

TextImportStep3

Clicking “Finish” shows the resulting data in Excel (I’ve used a fixed-width font to make it easier to see how the data in the MARC 008 lines up in columns):

Results1

Now I need to isolate the “primary language” portion of the MARC 008, the three character code near the end that shows up in the screen shot as “eng” over and over again (meaning “English” — not surprisingly, most STC publications are in English). I’ll do this using the “Text to Columns” function on the “Data” tab:

TextToColumns1

After selecting “Fixed width,” I click “Next” …

TextToColumns2

After clicking on either side of “eng” to make a column for those three characters, I click “Next”to view the results:

TextToColumns3

Now I delete the two columns I don’t need, leaving me with just the MARC 001 (the unique identifier for each record) and the three-digit language code, which I’ve titled “Language”:

Texttocolumns4

Now I’m going to insert a “pivot table” in order to count how many records there are for each language in the list (even though only “eng” shows up in the screen shots, I know there are other languages, too).

PivotTable1

In the “Create Pivot Table” box, I select the data I want to look at (columns A and B), and name a place to put the resulting table (column E):

PivotTable2

Next I check “001” and “Language” in the “Choose fields to add to report” box, and drag “001” from the “Row labels” box to the “Values” box. This automatically creates a table giving the count of how many rows there are for each language code:

PivotTable3

Some language codes are easy to figure out: “eng” is English, “lat” is Latin, etc. Others are trickier: “roa” is Romance: Other (in this case, almost always Law French). The full list of MARC language codes can be found on the Library of Congress website.

Here’s how the languages of pre-1604 titles in the Folger’s STC collection break down:

Dutch 5
English 4667
English, Old (ca. 450-1100) 1
French 10
Greek, Ancient (to 1453) 2
Greek, Modern (1453-) 10
Italian 24
Latin 496
Multiple languages 2
Romance (other) 177
Scots 4
Spanish 6
Welsh 1

This is a relatively small sample set, but you get the idea. Needless to say, there’s plenty more you could look at, but trying to show that with screen shots in a blog post would be madness.

  1. MarcEdit is optimized to work with Windows, but it is also available for Linux and Mac operating systems (though the Mac version is still under development, and does not contain the full tool set yet).
  2. I could also export the MARC Bibliographic 041: Language code data element, one or more three-letter codes used when the primary language isn’t the only language associated with the material, but I want to keep this example simple.

One Comment


  • Fascinating, Erin. What strikes me is that although we have five pre-1604 STC’s in Dutch, there are none in German.


Leave a Reply

  • (will not be published)