An Introduction to Web Archiving at the Folger

As a resident Digital Archivist at the Folger, I’ve been tasked with the management of Folger web archiving efforts.

The Folger Shakespeare Library web collecting mission.

Now, you might be asking: what is web archiving exactly? The International Internet Preservation Consortium (IIPC) defines web archiving as the process of “collecting portions of the World Wide Web, preserving the collections in an archival format [most often via the WARC file format], and then serving the archives for access and use.” There are a number of ways to achieve this: from home-grown technical processes relying on a combination of open-source tools, to a number of vendor options which package popular web collecting and organization methods into one service. No matter the route, the under-the-hood mechanics for the collection process remain virtually the same. Generally, web content is harvested through a process in which “web crawlers” (such as the popular Heritix tool) systematically access and gather content from designated URLs through a process referred to as crawling. The results of these crawls are captures of web content that can then be archived and curated into organized collections.

You might be familiar with web archives already: the Internet Archive has been archiving the web since 1996: to date they’ve saved over 390 billion web pages, which can be accessed through the Wayback Machine. Want to explore the Folger Shakespeare Library website as it looked at its point of first capture in 1997? You can find that here. Similar large scale web-collecting projects can also be found at the Library of Congress Web Archives and the UK Web Archives.

The Folger Shakespeare Library official website as captured by the Internet Archive on February 20, 1997.

At its heart, web archiving is a means to preserve a realm of cultural heritage that now is widely available (and sometimes only available) on the internet. Despite common assumptions otherwise, digital information is very sensitive. Internet content can disappear frequently and often without notice. If you’ve ever felt the frustration of encountering a dead or broken link on the internet before (a problem is commonly referred to as link rot) you’ll understand why archiving the web is important. Recent research has shed light on the growing issue of link rot, especially in reference to scholarly sources: for example, a recent Harvard study found that 49% of hyperlinks cited in Supreme Court opinions are no longer functional.

So how is the Folger involved in web archiving? As previously explained, the Folger began archiving select websites in 2011 using Archive-It. Archive-It, a creation of the Internet Archive, is a subscription service tool which allows partner institutions to build, manage, and provide access to their own web archive collections. You can access existing Folger Shakespeare Library collections here.

Current Folger Shakespeare Library web collections as accessed via Archive-It.

The first Folger web collection is titled Folger Shakespeare Library Websites and Social Media. This is an institutional collection in that it archives our own web presence. The collection currently contains 35 seeds (seeds are individual URLs within a web archive collection). It contains all Folger domains, blogs, and social media profiles. These sites are crawled for new content on a quarterly basis.

Our second collection, titled Shakespeare Festivals and Theatrical Companies, is a thematic collection which contains 285 seeds. The purpose of this collection is to archive official websites for theatrical companies and drama festivals which focus on Shakespeare performance. The scope of this collection is primarily limited to the United States, however our international resources are growing in number. These sites are crawled for content on a semi-annual basis.

The content within these collections is organized using the Dublin Core metadata standard. Below is an example of a typical record for an item in the collection. Note that within the scope of content crawling are related media files found within the specific seed: as you can see in our record for The Collation, we are also preserving videos (provided they are deemed by our crawl limitations to be considered in scope of our efforts). We also collect related images, documents, and PDFs as available within the URL and as needed.

The official metadata record for the Collation Blog as organized in the Folger Shakespeare Library Websites and Social Media web collection.

The official metadata record for The Collation as organized in the Folger Shakespeare Library Websites and Social Media web collection.

The benefit of web archiving is the preservation of a resource which is otherwise ephemeral in nature. At the Folger, we are already seeing the rewards of our efforts. For example, in the short 4 months between the most recent two crawls on the Shakespeare Festivals and Performing Companies collection we’ve noticed 8 seed URLs within that collection are no longer live on the web. If we hadn’t grabbed those sites beforehand, they would have been lost to us forever.

This is not to say web archiving is without its caveats. There are still strides to be made in the effective capturing of more complicated sites: for example, websites with more recent implementations of Java, or sites with scroll-down feeds. Web collecting technology and tools are continually improving to address these needs.

As the Folger moves forward in our web collecting efforts, we are interested in hearing the types of websites related to Shakespeare that you, our readers, are finding useful on the web. Do you have a website in mind which would fit the scope of our current efforts? Do you have a new collection concept in mind? Really, web archiving is a collaborative effort: one person or team might run the day-to-day management of collections, but it takes input from the collections’ users to make sure the right sites are being saved. So we would like to know what kinds of websites you, our audience, find useful in your research and interactions with Shakespeare on the web. Please feel free to leave a comment below or contact me at jmccurry@folger.edu to continue our discussion.

Jaime McCurry

was a resident Digital Archivist embedded at the Folger Shakespeare Library as part of the 2013-14 National Digital Stewardship Residency program, a nine-month grant-funded program created by the Library of Congress and the Institute of Museum and Library Services (IMLS). — View all posts by Jaime McCurry

Stay connected

Enter your email address to follow this blog and receive notifications of new posts by email.

Website navigation

What's on

Visit

Explore

Teach

Research

Join and Support

An Introduction to Web Archiving at the Folger

Leave a Reply Cancel reply