The Collation

Research and Exploration at the Folger

Tagging manuscripts: how much is too much?

When it comes to the subject of tagging or encoding manuscript transcriptions in XML (extensible markup language) for Early Modern Manuscripts Online (EMMO), two important questions are how much should we tag and when should we do it.

With thousands of pages from a variety of genres, the “how much” question is a big one. For example, should tags be used to provide information about ink color, shifts in hand, size or ornamentation of letters, illustrations, marginalia, flourishes, indentations, spacing, symbols, quotations, layout, structure, lines, paper material, historical/literary connections, etymology, smudges, etc., etc.? The images of manuscript pages below give some idea of the challenges involved:

Summary of accounts of the offices of the tents and revels from 1550 to 1555. (L.b.315)
Summary of accounts of the offices of the tents and revels from 1550 to 1555. (L.b.315)

Book of magic, with instructions for invoking spirits, etc. (V.b.26 (1))
Book of magic, with instructions for invoking spirits, etc. (V.b.26 (1))
Topographical descriptions, regiments, and policies (V.b.182)
Topographical descriptions, regiments, and policies (V.b.182)

Should online transcriptions attempt to recreate the intriguing graphs, charts, and illustrations we see on these pages as well as the text and white space? If so, how? Should each line and symbol be tagged in relation to its position on the page? How about gradations, marks, or blots showing on the paper? Given the multitude of options available with encoding, doing so granular a version of a transcription may be possible. After all, the X in XML stands for extensible. This flexibility for what is noted—along with broader accessibility for the text—explains why the Folger Shakespeare Library is doing this type of work in EMMO.

But where to start with tagging? Entities such as the Text Encoding Initiative (TEI) provide extremely detailed and ever-widening guidelines for digital projects in the humanities (and elsewhere), and these are a helpful resource. However, the quantity of markup elements in TEI, not to mention associated attributes for the elements, might best be described as vast. Just because a huge amount of choices are available does not mean all of them have to be pursued, though, or at least not all at once.

Many projects dealing with digital texts end up using a customized and/or limited set of encoding tags to make a project feasible. EMMO is no exception. With thousands of manuscripts in the (always growing) Folger collection and the skill, effort, and time required to transcribe the archaic hands in which those pages are written, some limits had to placed on what could reasonably be accomplished in the three-year initial phase of the project.

With that in mind, the EMMO team decided we would focus on tagging the text of the manuscripts for now and let the accompanying high-resolution images provide additional information about the page. Of course, even the adjustable image will not show everything about the actual manuscript, but we think the digital representation and the transcription text together will serve as a valuable resource.

Accordingly, we set out to identify and test a tag set for primarily textual elements, but even this reduced scope contains many questions and possibilities. Would the lineation of the text be preserved? Would the original abbreviations, contractions, punctuation, and spelling (including apparent mistakes) be left as written? Would cross-outs and corrections made by the scribe be reflected in the transcription?

For answers, we turned to the methods of transcription already in use as a guide. Since semi-diplomatic transcription is a generally accepted model in the field of early modern paleography and the one taught here at the Folger Shakespeare Library, the starting tag set developed for EMMO closely reflects semi-diplomatic conventions. This simply means certain minor changes are made in the transcription for the sake of clarity and comprehension by a 21st-century readership. Semi-diplomatic transcription has set boundaries, and so do transcriptions for EMMO. Unfamiliar or just plain unusual spellings in the manuscripts (even those appearing to be mistakes) are maintained as is punctuation (often quite different from modern usage), capitalization, insertions, cancellations, and lineation. If words or parts of words are illegible or indecipherable, they are marked as gaps. The Collation‘s transcriptions follows semi-diplomatic conventions, so these may be familiar to readers.

A slightly trickier category of changes that are made in a semi-diplomatic standard, however, deal with abbreviated forms, contractions, and brevigraphs; in these cases, transcriptions show the expanded version of these letter constructions, since many may be unfamiliar to modern readers. Some of these abbreviations can be quite difficult (and interesting) for transcribers as they strive to recognize words with letters missing and/or raised.

For two brief examples, “wch” and “wth” commonly appear in 16th- and 17th-century manuscripts as abbreviations for “which” and “with”—though in secretary and mixed hands it is often a test to differentiate between the superscript “ch” and “th” paired with a leading “w.” In a semi-diplomatic transcription, “wch” would be entered as “which” with the “hi” tagged as an expansion and displayed in italics to signify it was expanded. Similarly, the “ch” would be tagged as lowered superscript and then displayed as regular text for readability. The abbreviated form “wth” (with) follows the same pattern. Telling the two apart often comes down to context and/or minute comparisons at high levels of magnification. Other regular abbreviations such as “ye” (the) work similarly, but in this case, the archaic thorn letter (looks with a modern “y”) is entered as “th”, tagged as a thorn brevigraph (expansion), and displayed as “th” in italics. Like “wch” the superscript “e” in “ye” is tagged as lowered superscript in the semi-diplomatic transcription to be displayed as “the.” Dromio, our online collation/transcription tool, simplifies this work shortcut buttons that enter frequently appearing abbreviations with the appropriate tags all in one step.

An image of a short letter from the Bacon-Townshend collection contains the common abbreviations discussed above and shows how the system works in practice. In EMMO, the image will always appear with the transcription, so the former serves as the full diplomatic version and the latter serves as the semi-diplomatic version for reading and searching.

Letter from Anne Bacon (L.d.18)
Letter from Anne Bacon (L.d.18)

A basic semi-diplomatic transcription of the letter with the expansions of these abbreviations and brevigraphs shown is below:

Madame, I thought it good for me to writ somthinge to
your Ladyship though I wrot the lesse, least by not writinge
I sholde seame to forget that dutie, which I iustly owe vnto
your Ladyship. I vnderstande allmost every weake by Mr
Stringar of your Ladyship good health & my fathers, which I am
very glad to heare of. I myself am somwhat sickly.
Some perswade me, that it is to a good end. My owne experi=
ence (as your Ladyship knoweth) is small to iudg. Suer I am, ther
is yet no certeintie of thathei saie. Yet I hope well, & I
praie to God I be not deceived of my hope. Thus with
humble remembrance of my dutie to my father I take
my leave, wisshinge your Ladyship longe to live in perfect health.
Anne bacon
Anne bacon
Anne bacon

In comparing the original image with the transcription, sharp eyes may notice a few other examples of expansions:

  • “yor” is expanded to “your”
  • “yt” is expanded to “that”
  • “yei” is expanded to “thei” (i.e., they)
  • “La:” is expanded to “Ladyship” (notice the colon is dropped in the semi-diplomatic transcription)
  • “pswade” and “pfect” are expanded, respectively, to “perswade” and “perfect” (the special “p” in these words has a stylized loop in the descender to signal the expansion). What we call the special p demonstrates the flexibility of early modern writing in which one brevigraph can stand for differing combinations of letters depending on style and use.

Signs of correction in a document are more straight-forward since most people have experience with crossing out words and replacing them, even if we use word-processing programs today, but how to display such instances still involves questions: should crossed-out items be included? In EMMO transcriptions, deleted words are entered (if they can be determined)and then tagged as deletions so that they display as struck through. If a deleted word is partially or totally illegible to the transcriber, periods “…” representing illegible letters would be entered and then tagged as a gap. Insertions (often to replace deleted words) are similarly entered then tagged as insertions and displayed as superscript in the line. An excerpt from the Inventories of the Townshend family shows some examples of these:

an inventory of the lobby (V.b.147)
an inventory of the lobby (V.b.147)

A basic transcription is shown below with the tagged deletions and insertions.

In the Lobie.
Item one livery beadstide matt & Cord a
fether bead and boulster iij blankettes
A rogge greene of yellowy Read a canope of Cadows
Item a Trundell beadstid matt & Corde a fether
bead & ij boulster iiij a pare of blankettes a dornex  & a yelow Roge
Item a pallett bead & boulster for my Ladys Chamber
a pare of blanckettes & a dornex covering & the
Lobbye hanged Round with panted Clothe
Item ij downe pillowes
Item a Corse woole beade

The deleted and inserted words are hard to miss in the image. A few more examples of common expansions are also included in the transcription above: “Itm” becomes “Item” and the “es” terminal brevigraph appears at the end of “blankettes” and “blanckettes“.

A major advantage of tagging (or encoding) letters/words in this way, of course, is that it allows not only targeted searches but also flexible display options online. With the tags in place, we can show a transcription with the expansions and lowerings or without them. Similarly, deletions can appear or not. One could also choose to view only deleted words (or expanded ones) to look for patterns.

The EMMO tag set in its first phase does contain a few tags to identify the presence of non-textual items like illustrations and notational marks (e.g., a manicule), and a very limited number of tags to describe certain kinds of content (e.g., for postscripts in letters). Such tags, however, have intentionally been kept to a minimum in order to keep the transcription process manageable. Producing a consistent, neutral set of textual transcriptions for reading and analysis is the goal of the EMMO’s initial stage. Further tagging of the text as well as other features of manuscripts is expected and encouraged in the years ahead. Encoding  for physical characteristics or iconography, or specific content, for example, could make for fascinating future projects. This is how the when question relates to the how much question, and we think the possibilities in this regard are exciting.


  • As someone who has spent a fair amount of time using Dromio, I think that on the whole it has found a great point of equilibrium in transcription, giving the transcriber a versatile but relatively simple set of tools. I think your idea that the digital images available on Luna minimize the need for an overly complex set of tags is spot on. However, I do wonder about tools to help replicate complexities of formatting – things written in double column, charts, etc., which are hard to reproduce coherently in a simple line by line transcription. The photos of L.b. 315 and V.b.26 that illustrate your post are excellent examples of these kinds of pages (V.b.26 has those tough symbols as well!)
    I’ve come across similar pages while looking for things to transcribe in Dromio, and have always just skipped them because I’ve never figured out how to create a coherent transcription in Dromio out of an intricately formatted page.
    I know these are questions about formatting more than about tagging per se, but are their plans to include richer formatting options in Dromio at some point in the future?

    • An excellent question! Although focused on the text, we are considering ways to transcribe/encode these complex manuscript pages while providing some meaningful indication as to the formatting, layout, etc. Since the words and structure of the information on the pages are often related, one of the ideas under discussion is to include attention to formatting along with transcription for part of the next Advanced Paleography Workshop (in December, 2015). Employing spatial coordinates may help, but we will be interested in suggestions of the participants as intricate pages are examined. Other special events may be organized in the future to tag certain difficult pages. Potentially, additional tags could be added to Dromio, of course, as we did in the last Advanced Paleography workshop. Also, further tagging could be done as part of a separate project on a particular manuscript or set of manuscripts.

  • This is an excellent explanation of the EMMO tag set. What it lacks is a few illustrative examples of tags, for those of us who haven’t yet seen them. Any chance of an addendum?

    • Yes, a discussion of particular encoding tags and attributes will likely show up on a future post about EMMO. You may wish to check out the link to the Text Encoding Initiative (TEI) for examples and explanations of some of the tags mentioned such as “” for expansion or “” for deletion. We are using tags that will be compliant with TEI-P5.

  • There is no real way of pre-determining the “right” or “sufficient” or a “superfluous” amount of tagging. In my consulting practice on knowledge management (which is a different animal than purer information management!) I always argue that “this lies in the eyes of the beholder”. I.e. some researcher in some distant future might want to write his/her thesis about “smudges over the centuries”. And maybe, the information would not be readily available at the time. But maybe the researcher then does an all-encompassing field study, develop specific metrics etc. Wouldn’t it be great if he/she could then add his/her information (knowledge) to the existing corpus? This is what a learning organization would do – provide a basic structure to begin with (certainly author and title were always indispensable, even in the Alexandrian Library) but then ALLOW and ENCOURAGE other experts to be able to add their expertise and preserve it “down the ages”. HTML, doi and other markups now allow this to be done technically – all we need to do is make it desirable and possible.

Leave a Reply

  • (will not be published)