Digital Medievalist 7 (2011). ISSN: 1715-0736.
© Thomas Hansen, 2011. Creative Commons Attribution-NonCommercial licence

TEI - Keeping It Simple

[ Skip to Abstract | Return to Top ]

Peer-Reviewed article

Accepting Editor: Christine McWebb, University of Waterloo.
Recommending Reader: Julia Flanders, Brown University.
Received: September 11, 2011
Revised: December 1, 2011
Published: February 7, 2012

[ Skip to Navigation | Return to Colophon ]

Abstract

This article discusses the reasons for implementing TEI P5 in a Danish publishing project. We argue that while the standard performs well as a sustainable storage and interchange format, it is generally too complicated to operate efficiently. We show how to cope with this difficulty by introducing a template that makes the daily work easier.

Keywords: Descriptive Markup Portability; Manuscript Description; Sustainability; Text Encoding Initiative (TEI); eXtensible Markup Language (XML).


[ Return to Navigation]

Introduction

§ 1    Alvin Toffler, in his book The Third Wave (1980), described how civilization had evolved through what he characterized as three waves sweeping through history. The First Wave brought an end to the nomadic lifestyle of hunting and gathering and marked the beginning of the Agricultural Age, in which people settled in communities supported by agriculture and animal husbandry. The Agricultural Age lasted roughly from 8000 BC to 1650-1750, when in Europe another wave started rolling. This Second Wave was the beginning of the Industrial Age, in which the development of machines enabled mass production of everything from consumer goods to communications. Then again, around 1980, a Third Wave was gathering momentum in the US. This time, change was driven by information processing machines introducing yet another significant increase of productivity. The age of The Third Wave was the Information Age, in which we now live.

§ 2    During the First Wave, people mostly produced for self-use, and thus economy depended on access to land and labor. During the Second Wave, machines and technology introduced an increase in productivity, making the production of energy and know-how crucial parameters. In the Information Age, a state of disruption has reigned: As Information and Communication Technologies allow know-how, which was previously bound to specialists, to be captured, stored, and shared as needed, people are able to produce customized goods and services for themselves. This increase in productivity causes entire Second Wave markets of standardized, mass-produced goods to fall apart, and to give way to a multitude of standards and forms of dissemination (Toffler 1980, 196-204; Simons and Black 2009).

§ 3    Despite the inherently shareable nature of digital data, keeping them in a closed circuit of proprietary formats and tools proves counter-productive; especially if information is supposed to transcend computer environments, domains of application, and the passage of time. In this paper, we shall discuss how the commitment to creating portable data has been met as the technology of descriptive markup has matured. Section 1 describes the introduction of SGML at the Diplomatarium Danicum as a measure to prevent lock-in by word-processor formats. At this stage, markup is produced purely for self-use, first in print, then later in a Web publication. In Section 2, we consider how implementing The Text Encoding Initiative, TEI, as document format might enable higher productivity and better sharing of data. The following Section 3 attempts to outline a general approach to the implementation process. While simplifying the application format might yield a product which is consistent and easy to manage, the outcome might not prove equally easy for encoders to use. To address this issue, Section 4 describes how markup routines can be rationalized by a template that may be transformed into a more richly structured TEI document. Section 5 follows with some closing remarks on the possible impact of using well-defined markup formats.

Producing for self-use

§ 4    As part of The Society for Danish Language and Literature, DSL, Diplomatarium Danicum publishes critical editions of medieval legal documents. In the late 1990s, problems with word-processor formats and their lack of semantic and pragmatic coding were starting to prevent information from flowing even to the nearest printer. In an attempt to address this portability issue DSL introduced SGML as storage format on Diplomatarium Danicum, and, to make for a gentle transition, a document type was modeled from the legacy print publication. The DTD defined a mere 30 element types, most of which were free-form text fields. For instance, a manuscript is described within a MANUSCRIPT wrapper subordinating two elements: first an ID (identifier) element with a siglum value, then an INF (information) element, in which all information relating to the text-witness is recorded. Similarly, a single PUBLICATIONS element recorded all bibliographic information about the manuscript. The SGML files were converted into print and proofread before being typeset.

§ 5    At first, most markup work was done by assistants, but, generally, the SGML revolution was a quiet one, and soon everybody had adopted the SGML modus operandi. As part of a community that places a premium on flexibility and integrity, the methodology of defining one's own categorization scheme added a sense of continuity, and, of course, it was also free and required little more than a text editor to operate.

§ 6    Shortly thereafter, influenced by the internet and the uptake of the growing tool chest of XML technologies, DSL saw an opportunity for a rationalization even more in keeping with the Third Wave promise of efficiency: information captured digitally should also be distributed digitally, and at a fraction of the usual cost. So, after a total of approximately 15,000 texts, the print edition was discontinued and succeeded by a Web publication in 2001. Although SGML receded to XML, descriptive markup was still produced for self-use, primarily to let editors continue doing what they were used to, and only to be transformed into HTML in a customized web application.

§ 7    In addition to the advantage of being stored in a scalable, text-based format, the material is syntactically coherent and examples of tag abuse are rare. However, in terms of data longevity, the transition from word processing to the XML work chain had only replaced one closed circuit with another one that was less closed. The fact that the format is largely undocumented, completely idiosyncratic, and too coarsely structured to lend itself well to format conversion and query, means it is difficult to maintain. Moreover, whether the end result was a print or online publication, it was still a one way road from data capture to exposure, and it was obvious that the result was not taking advantage of everything the technology had to offer.

Introducing TEI

§ 8    In 2007, TEI released version P5 (Burnard and Bauman 2007) of a standard that had been in development since 1990. From the perspective of the Diplomatarium, a significant improvement was the incorporation of the manuscript description module. At the same time, a three-year grant from the Carlsberg Foundation had enabled the development of a repository capable of holding all the documents published under the project. Initially, the repository was supposed to support the ongoing work on some 8,500 documents from the period 1413-1450. Then, as soon as a common format had been established, the old one was to be deprecated and the existing 3000 XML documents from the period 1401-1412 to be converted and incorporated into the holdings. A digitization of the 15,000 printed documents, however, was not part of the plan. The first deliverables were two technical reports presenting a way of expressing the features of interest in the TEI header and text markup (Hansen 2010a, 2010b). While the tech reports explore the details of the implementation, we will assess the strategic reasons behind it. Contrary to the deprecated XML application, TEI has the advantage of being viable outside the project, plus offering better possibilities of multi-purpose content. In other words, TEI provides a format which on the one hand is general and popular, and, on the other, articulate and flexible.

General and popular

§ 9    With the prospect of having to manage some 25,000 documents, the main motivation for defining a general document format is to establish a joint basis for tools and procedures operating on the material. Not having to configure software for multiple formats should, ceteris paribus, minimize the need for one-off integration and make the development and maintenance of tools and procedures less error-prone. However, more significantly, since data must also be exchangeable as static documents, and are no longer produced with the sole purpose of being consumed by custom-fit applications, a popular, well-documented format is an advantage.

§ 10    In considering possible use cases for such documents, an exchange could take place in-house with other DSL projects. For instance, the documents could be processed and used in language corpora for corpus-based dictionaries; something that would expose the material to fields like linguistics and language history. But also, given the high costs of creating digital resources, a lot of the money that used to be spent on digitization and publication is now diverted into preservation purposes instead. Indeed, the plans for large centralized repositories, research infrastructures like CLARIN (<http://www.clarin.eu/external/>), bear testament to an incipient specialization between projects that produce data and organizations that preserve them. With potential consumers as research infrastructures entering the field, a market for text resources emerges; and since markets – understood here simply as switchboards of goods – are likely to turn to standards for quality assessment, well-documented formats like TEI are the ones major players like CLARIN are willing to adopt.

§ 11    On the other hand, because particular extensions to the standard are less likely to be universally deployed, we have exercised some self-constraint not to deviate from the standard. While this is mainly to minimize the risk of rendering any effort obsolete, it also reflects a modest hope that such larger concentrations of standard-conformant material might stimulate further development of tools and techniques.

Articulate and flexible

§ 12    If the elements in the running text are supposed to let us infer the meaning of passages in the marked up document (as suggested in Sperberg-McQueen, Huitfeldt, and Renear 2000) then, of course, the element types have to fit the content; otherwise, tag abuse and communication breakdown might occur. The fact that TEI is developed to mark up features of any written artefact (Lavagnino 2006) means that its terminology is general enough to allow common features of an otherwise heterogeneous document material to be expressed in standard terms. With the incorporation of the so-called manuscript module in the P5 version of the standard in 2007 (Driscoll 2006), it has become particularly useful for detailed annotation of just the kind of material (European medieval manuscripts) with which the Diplomatarium Danicum deals.

§ 13    Besides reflecting the breadth and depth of coverage, TEI's comprehensive schema provides enough structure to facilitate the level of processing we want. The expressive power of XML's hierarchical content model allows many processing details to be derived from an element's place in the document hierarchy, and the ability to precisely address parts of documents by means of path expressions adds robust handles to the text. This is the reason why we have opted for a much more structured and granular markup approach than in the deprecated model.

§ 14    Finally, in terms of flexibility, TEI is designed for a wide range of implementations; a wealth of information may be given in more or less fine-grained and structured ways. However, since a schema is only fully functional if it helps avoid compromising the product with format inconsistencies (e.g. dates, language codes appearing in different shapes) and structural irregularities, some important customization details are expected to be settled on the implementation level. Basically, this boils down to a question of picking out the elements and attributes needed, and deciding how these should be populated.

Applying the TEI

§ 15    Although concise schemas like TEI Lite (Burnard and Sperberg-McQueen 2006) have been widely adopted, the Diplomatarium draws upon features not included here. On the other hand, since TEI Lite also offers features which are not needed, a functional schema is best established either by adding to subsets like the tei_bare, or by stripping away from the entire tei_all schema. Either way, since we are aiming for portability, the modification should comply with TEI's conformance criteria (Burnard and Bauman 2007, ch. 23.3). According to these, documents should:

  1. be well-formed;
  2. validate against the tei_all schema;
  3. use the definitions in the Guidelines;
  4. contain only elements in the TEI namespace: http://www.tei-c.org/ns/1.0, and
  5. have a schema derived from an ODD (One Document Does it all) file.

§ 16    A good reason for using the ODD format (Burnard and Bauman 2007, ch. 22) is to have a transparent way of documenting the customization with respect to the unmodified starting point; something enabling others to know which of the 500+ elements of the tei_all schema are in use, and whether this usage accords with the Guidelines. At the same time, the ODD is a source for generating different types of schemas and documentation by means of designated tools. But, more significantly, since we are committed to working within the TEI framework and not complicating it with extensions, we regard the implementation of TEI as a process of simplification.

§ 17    The first step is an elimination of the tei_all elements not needed. Using ODD to build a schema accepting elements from the msdescription (manuscript description) module, we use an empty moduleRef element with the key attribute set to "msdescription", and add the list of the elements not needed as values of the except attribute:

<moduleRef key="msdescription" except="accMat acquisition adminInfo altIdentifier binding bindingDesc catchwords collation collection colophon custEvent custodialHist depth explicit finalRubric foliation heraldry incipit institution locus locusGrp msPart musicNotation objectType origDate origPlace origin provenance recordHist rubric scriptDesc secFol signatures source stamp surrogates textLang typeDesc watermark"/>

§ 18    Having carved out a block of elements by going over the relevant modules as sketched above, we focus on the remaining elements of the application; each one can be re-declared by an elementSpec (element specification) with identifier, module, and mode attributes. Without going into details, we will concentrate on simplifying the content model in the content element, and the list of attributes in the attList:

<elementSpec ident="dimensions" module="msdescription" mode="change">
<content> … </content>
<attList> … </attList>
</elementSpec>

§ 19    For instance, in the unmodified schema, the content model of the dimensions element allows for the omission of children elements, or instead filling the element with an unlimited number of either dim elements or the elements height, depth, and width constituting the model.dimLike class. Expressed in ODD as a RELAX NG pattern, this rather wide range of possibilities looks like this:

<content>
<rng:group>
<rng:zeroOrMore>
<rng:choice>
<rng:ref name="dim"/>
<rng:ref name="model.dimLike"/>
</rng:choice>
</rng:zeroOrMore>
</rng:group>
</content>

§ 20    The model we are looking for, however, is one that requires the operator to provide exactly one height and one width element every time. So instead we may write:

<content>
<rng:ref name="height"/>
<rng:ref name="width"/>
</content>

This modification is clean, because documents validating against the modified schema also validate against the unmodified tei_all outset.

§ 21    A look at the attributes of the dimensions element suggests further simplifications. Originally, the dimensions element ships with 27 attributes, but, for our purpose, only the unit attribute is necessary; and so deleting the rest should minimize the risk of misplacing them. Each attribute is therefore re-declared in an attDef element with the attribute name as the value of the ident (identifier) attribute, and the mode of the change set to "delete":

<attList>
<attDef ident="type" mode="delete"/>
<attDef ident="quantity" mode="delete"/>
<attDef ident="extent" mode="delete"/>

§ 22    A portable data standard is not only characterized by an agreed set of elements and attributes, but also by an agreed set of permissible values for these. A document exchanged between two parties would not be mutually understandable if either or both parties used internal coding schemes to populate elements. Having deleted all but one of 27 attributes, we not only want the unit attribute to be required, but also to make "cm" (centimetres) the only applicable value of it. This is done by providing a list of values containing the value item identified by the string "cm" as a replacement for any other value list:

<attDef ident="unit" mode="change" usage="req">
<valList type="closed" mode="replace">
<valItem ident="cm">
<desc>centimetres</desc>
</valItem>
</valList>
</attDef>

§ 23    Where we wish to constrain element values—say, have "Danish Society for Language and Literature" as the only valid content of the publisher element—we write a pattern stating it:

<elementSpec ident="publisher" module="core" mode="change">
<content>
<rng:value>Danish Society for Language and Literature</rng:value>
</content>
</elementSpec>

§ 24    The ability to translate business rules, such as to always measure the dimensions of a document and always in centimetres, into required elements and fixed values improves the consistency of the end product. Of course, since we are dealing with many different kinds of information, far from all values can be fixed this way. However, when it comes to dealing with missing information, a general coding scheme seems to make sense. For example, if a manuscript has been issued without seals, then according to the original schema the seal description (sealDesc) element may be omitted. But such omissions could also mean that the information is missing because it is irrelevant, because it is still undetermined, or simply left out by mistake. Having already been made mandatory, such elements are also assigned a set of values to help clarify this particular issue:

Value Type Meaning
Empty strings Information does not exist
0 numbers Information does not exist
1000 dates Information does not exist
Nil strings Information is undetermined
99999999 numbers/dates Information is undetermined

§ 25    These values are applicable almost everywhere information is to be recorded. "Almost," since one of the few places where TEI have declared a closed set of permissible attribute values is the cert attribute, which may only be populated with "high," "medium," or "low". Had the values been added, the modification would have been an extension. There are also compromises. In order to be able to state non-existent date information we have chosen "1000", because the W3C data type xs:date does not accept "0".

TEI by proxy

§ 26    Although modifications might yield more functional schemas with better guidance for the encoder, the real benefits seem to accrue to those managing the data. In order to make documents more manageable, they have been made structurally homogeneous, but at the same time quite verbose. Indeed, despite the help offered by schema and tools, instance documents with line upon line of deeply nested elements are not necessarily easy for encoders to use.

§ 27    To make data entry easier we devised a template implemented as an XML Schema Document, from which TEI documents are derived using an XSLT stylesheet. Appreciating the sense of continuity that a fully self-invented application can provide, the architectural principle has been to respect the existing workflow of the project. Compared to the TEI document, the template is "flat" and comprehensible, and it comes with default values "nil" and "99999999" declaring the information undetermined. While working, the editor resolves whether the features are present or irrelevant in the particular situation.

§ 28    The relation between template and TEI documents is not one-to-one. Where tasks can be automated and spare editors from unnecessary typing, this is done by the stylesheet. For instance, to keep texts and translations parallel, the number of words and paragraphs are computed as in TEI snippet below:

<extent>Base text, number of words: <num n="words">535</num>, paragraphs: <num n="paragraphs">23</num>.
Translation, number of words: <num n="words">592</num>,
paragraphs: <num n="paragraphs">23</num>
</extent>

§ 29    When a template has been proof-read and reviewed, it is discarded; after all, template documents are meant to live and die with the project, whereas the TEI product is intended for a multitude of purposes.

§ 30    Stored in a repository, the TEI documents may be enriched with new markup by applying either automatic routines or manual procedures. For instance, when the texts have been established incipits could be automatically generated and placed in the TEI header, while place and personal names could be recorded in semi-automatic procedures using XQuery. This ability to segment work is a strategic advantage at a time where long-term project funding becomes increasingly difficult to obtain.

§ 31    To give an idea of what information is recorded by the project, we will walk through the template explaining how it maps to TEI P5.

editorInitials

§ 32    The editor begins with identifying himself by choosing his initials from a closed set of values. Currently there are six editors and six values. So, for example, the snippet <editorInitials> mh </editorInitials> transforms into more structured TEI syntax:

<editor> <name xml:id="mh">
<forename type="first">Markus</forename>
<surname>Hedemann</surname>
</name>
</editor>

textId

§ 33    To identify the text, the editor then fills in a number after the pattern yyyymmddxyz. The template <textId>14201127001</textId> yields TEI <idno type="dd"> 14201127001 </idno>.

revision

§ 34    In order to be able to track the status of a document, the editor enters initials and date in a log registering four stages: first when the document is established; then during editing at three proof-reading stages, i.e. proofFirst, proofSecond, and proofThird. In this template snippet, the document has been established and proof-read once:

<revision>
<established who="#alk" when="2010-06-02"/>
<proofFirst who="#jon" when="2010-10-10"/>
<proofSecond who="#nil" when="99999999"/>
<proofThird who="#nil" when="99999999"/>
</revision>…

§ 35    The stylesheet renders this in more human-readable TEI:

<revisionDesc>
<change when="2010-06-02" who="#mh">Document established by Markus Hedemann, June 2, 2010</change>
<change when="2010-10-10" who="#jon">Proof read once by Jonathan Adams, October 10, 2010</change>
<change when="99999999" who="#nil">nil</change>
<change when="99999999" who="#nil">nil</change>
</revisionDesc>

textCreationTimeEarliest, textCreationTimeLatest

§ 36    Having described the file itself, the editor turns to an account of the circumstances surrounding the issuing of the original document. This starts with a terminus ante quem and a terminus post quem, which in datable manuscripts is often the same value. These are defined by means of the XML Schema built-in datatype xs:date, which accepts a pattern such as yyyy-mm-dd:

<textCreationTimeEarliest>1420-11-27</textCreationTimeEarliest>
<textCreationTimeLatest>1420-11-27</textCreationTimeLatest>

textCreationTimeCertainty

§ 37    Depending on whether the date of issuing appears explicitly, textCreationTimeCertainty is filled with one of two values: "high", indicating that the information can be read from the text-witness, or "low," stating that the information cannot be read, but has been established by other criteria:

<textCreationTimeCertainty>high</textCreationTimeCertainty>

Stating levels of certainty is a way of meeting the processing expectation of uncertain dates being rendered in square brackets.

textCreationPlace, textCreationPlaceCertainty

§ 38    Similar to the account of the document date, the place and the certainty must also be given if possible. This is done in the textCreationPlace and textCreationPlaceCertainty elements. Contrary to textCreationTimeCertainty, textCreationPlaceCertainty element can be "switched off" by means of the empty value. A template snippet containing the previous five elements:

<textCreationTimeEarliest>1420-11-27</textCreationTimeEarliest>
<textCreationTimeLatest>1420-11-27</textCreationTimeLatest>
<textCreationTimeCertainty>high</textCreationTimeCertainty>
<textCreationPlace>Roskilde</textCreationPlace>
<textCreationPlaceCertainty>high</textCreationPlaceCertainty>

transforms into the following TEI structure:

<creation>
<date not-before="1425-02-01"
not-after="1425-02-01"
cert="high">1425, 1 February</date>
<placeName cert="high">Roskilde</placeName>
</creation>

summaryText

§ 39    Finally, a summaryText corresponds directly to the TEI summary describing the "intellectual content of an item" under msContents wrapper:

<msContents>
<summary> King Erik 7. of Pomerania summons… </summary>
</msContents>

witness

§ 40    The witness element is a wrapper for 16 elements, most of which are mapped directly to equivalent TEI P5 element types. The first five of these elements identify the text-witness in much the same way as the elements under the TEI P5 msIdentifier (manuscript identifier) element.

witnessSigil

§ 41    First, the editor provides a unique witness siglum. This value corresponds to the value of the xml:id attribute in the TEI witness element.

archivePlaceName

§ 42    Second, a value corresponding to the TEI settlement element is provided. This element is meant to contain "the name of a settlement, such as a city, town, or village, identified as a single geo-political or administrative unit".

archiveName

§ 43    The archiveName corresponds to the TEI repository element.

inventoryNumber

§ 44    Intentional value corresponding to the TEI idno element.

manuscriptName

§ 45    Intentional value corresponding to the TEI msName element. The template element <manuscriptName>Langebeks Diplomatarium, p.7</manuscriptName> mirrors the TEI <msName>Langebeks Diplomatarium, p. 117</msName>. When processed, the template elements:

<witnessSigil>A</witnessSigil>
<archivePlaceName>Copenhagen</archivePlaceName>
<archiveName>Rigsarkivet</archiveName>
<inventoryNumber>NKR c-2732</inventoryNumber>
<manuscriptName>empty</manuscriptName>

are transformed into TEI P5 as:

<witness xml:id="A">
<msDesc>
<msIdentifier>
<settlement>Copenhagen</settlement>
<repository>Rigsarkivet</repository>
<idno>NKR c-2732</idno>
<msName>empty</msName>
</msIdentifier>

manuscriptMaterial

§ 46    Having identified the manuscript, the editor accounts for the physical description of the material in a series of nine elements. First, the manuscriptMaterial is selected from a closed set of five string values:

  1. "empty" – the manuscript material is irrelevant;
  2. "mixed" – the manuscript material is part paper, part parchment;
  3. "nil" – the manuscript material has not been determined yet;
  4. "paper" – the manuscript material is paper;
  5. "parch" – the material is parchment.

manuscriptWidth, manuscriptHeight, and manuscriptPlica

§ 47    The dimensions of the original document are given in centimeters as xs:decimal values. While the two first correspond directly to the TEI width and height elements, manuscriptPlica is not defined in TEI terms. The element describes a fold reinforcing the inferior part of the manuscript (Cárcel Ortí 1997, 127). The template snippet:

<manuscriptMaterial>parch</manuscriptMaterial>
<manuscriptHeight>17.2</manuscriptHeight>
<manuscriptWidth>24.3</manuscriptWidth>
<manuscriptPlica>0.6</manuscriptPlica>

transforms into the following chunk of TEI as:

<extent> <dimensions unit="cm">
<height>17.2 (plica: 0.6)</heigh>
<width>24.3</width>
</dimensions>
</extent>

conditionDescription

§ 48    The conditionDescription describes the physical state of the document and thus corresponds to the TEI condition element. The template: <conditionDescription>The document is severly damaged by fire and water</conditionDescription> transforms into TEI:

<condition>
<ab>The document is severely damaged by fire and water</ab>
</condition>

layoutDescription

§ 49    The layoutDescription holds a set of layout descriptions applicable to a manuscript; it corresponds to the TEI layoutDesc element. A template snippet: <layoutDescription> The text is arranged in two columns</layoutDescription> is transformed into:

<layoutDesc> <ab>The text is arranged in two columns</ab>
</layoutDesc>

handDescription

§ 50    The handDescription element corresponds to the TEI handNote (note on hand) element; it describes a particular style or hand distinguished within a manuscript. This template:

<handDescription>The text is written by the same scribe as
<ref target="14251102001"/>, <ref target="14251102002"/> and
<ref target="14251102003"/>
</handDescription>

transforms into TEI:

<handDesc>
<handNote>
<ab>The text is written by the same scribe as
<ref target="14251102001"/>,
<ref target="14251102002"/> and
<ref target="14251102003"/>
</ab>
</handNote>
</handDesc>

additionsToText

§ 51    An account of significant additions found within a manuscript, such as marginalia or other annotations, is delivered in the additionsToText element. It corresponds to the TEI additions element. A template such as <additionsToText> On the verso the inscription: <q>Item Hr. Peder Griis<ex>s</ex>es gaffvebreff. 1413</q></additionsToText> corresponds to TEI:

<additions>
<ab> On the verso the inscription:
<q>Item Hr. Peder Griis<ex>s</ex>es gaffvebreff. 1413</q></ab>
</additions>

seal

§ 52    Another feature of interest is the presence of seals. A seal is described by a wrapper (seal) element subordinating four elements:

  1. sealNumber;
  2. sealStatus;
  3. sealDescription; and
  4. sealReferenceWork.

sealNumber

§ 53    First, the seals are numbered from left to right with xs:integer values. If a document happens to be issued without seals, the default value "99999999" is changed to "0", stating that the information is irrelevant.

sealStatus

§ 54    Depending on whether the document has, has had, or simply was issued without seals, a value from a closed set of four values is selected:

  1. 'empty'– the document bares no traces of seals;
  2. 'missing' – the seal is missing;
  3. 'nil' – it is undetermined whether the document is sealed or not;
  4. 'pendant' – the seal is pendant.

sealDescription

§ 55     If a seal is extant, or in any way known, the information is given here. First, the name of holder; then, the method of sealing, and, finally, the material is stated.

sealReferenceWork

§ 56    Whenever possible, a bibliographic reference to sigillographic sources is given. For instance, a document issued with seals is described in the template as:

<seal>
<sealNumber>1</sealNumber>
<sealStatus>pendant</sealStatus>
<sealDescription>Seal of Jens Olufsen in black wax. Legend: <q>S IOHANNES OLAVI</q></sealDescription>
<sealReferenceWork>DAS 1061</sealReferenceWork>
</seal>

In TEI:

<sealDesc>
<seal n="1" type="pendant">
<ab>The seal of Jens Olufsen in black wax. Legend:
<q>S IOHANNES OLAVI</q> <ref>DAS 1061</ref>
</ab>
</seal>
</sealDesc>

§ 57    A document issued without seals retains the seal element, but it is filled in with values stating explicitly that there never were seals on the document:

<seal>
<sealNumber>0</sealNumber>
<sealStatus>empty</sealStatus>
<sealDescription>empty</sealDescription>
<sealReferenceWork>empty</sealReferenceWork>
</seal>

TEI:

<sealDesc>
<seal n="0" type="empty">
<ab>empty <ref>empty</ref></ab>
</seal>
</sealDesc>

witnessHistory

§ 58    When known, facts from the history of a manuscript are recorded. The witnessHistory element corresponds to the TEI history element. The template: <witnessHistory> The letter is registered in the registry of the letters at Vallø (1541), Brevkister 137 </ref></witnessHistory> corresponds to TEI:

<history>
<ab>The letter was registered in the registry of the letters at Vallø (1541), published <ref>Thiset, Adel. Brevkister 137</ref></ab>
</history>

filiationDescription

§ 59    In case other surviving manuscripts are related to a document, such information may be given in the filiationDescription element. This element is modeled on the TEI filiation element. Thus, a snippet like this: <filiationDescription> The document is an apograph from the document of 1388, January 21, Diplomatarium Danicum III, 331</filiationDescription> converts into TEI:

…</summary>
<msItemStruct>
<filiation>
<ab>The document is an apograph from the document of 1388, January 21, Diplomatarium Danicum III, 331</ab>
</filiation>
</msItemStruct>

bibliographicEntry

§ 60    Bibliographic information is recorded in bibliographicEntry elements, each one corresponding to the TEI bibl elements which are wrapped in a listBibl (bibliographic list). A template series of three bibliographicEntry elements:

<bibliographicEntry>Kirkehist. Saml. V 99</bibliographicEntry>
<bibliographicEntry>Bull. Dan. 358 nr. 466</bibliographicEntry>
<bibliographicEntry>Rep. nr. 5872 (i udtog)</bibliographicEntry>

is rendered in TEI as:

<additional>
<listBibl>
<bibl>Kirkehist. Saml. V 99</bibl>
<bibl>Bull. Dan. 358 nr. 466</bibl>
<bibl>Rep. nr. 5872</bibl>
</listBibl>
</additional>

samplingMethod

§ 61    The samplingMethod element wraps three sub-elements:

  1. textCompleteness stating whether the text appears in extenso (version), or is an excerpt;
  2. sourceSiglum containing an xs:IDREF pointing to one of the witnessSiglum values described earlier;
  3. samplingNote containing an account of possible omissions
Thus the following:

<samplingMethod>
<textCompleteness>excerpt</textCompleteness>
<sourceSiglum>A</sourceSiglum>
<samplingNote>The first three paragraphs have been omitted as they are unrelated to Danish matters</samplingNote>
</samplingMethod>

becomes in TEI:

<samplingDecl>
<ab>Excerpt from <ref>A</ref>. The first three paragraphs have been omitted because they are unrelated to Danish matters</ab>
</samplingDecl>

textLanguage

§ 62    The textLanguage element is filled in with one of currently five enumerated values. Language codes are constructed according to BCP 47 (<http://www.rfc-editor.org/rfc/bcp/bcp47.txt>), and, where possible, follow the ISO 639-1 standard. textLanguage is an open set, with five values:

  1. 'gda' – Old Danish;
  2. 'gmh' – German Middle High;
  3. 'gml' – German Middle Low;
  4. 'la' – Latin;
  5. 'xno' – Anglo-Norman.
Thus, <textLanguage>la</textLanguage> transforms into TEI:

<langUsage>
<language ident="la">Main language: latin</language>
</langUsage>

text

§ 63    The text element is similar to a TEI P5 div element. In the Diplomatarium template, it may only be structured by TEI p (paragraph) elements. Below paragraph level, a mixed content of text and eight TEI element types is allowed: The elements available are:

  1. app – critical apparatus;
  2. cit – citation;
  3. damage;
  4. ex – expansion;
  5. gap;
  6. hi – hightlighted;
  7. ref (reference);
  8. supplied.
For instance:

<text>
<p> Christierno Hen<ex>n</ex>ingi presbitero Roskildensis diocesis </p>
<p> Benigno etc.</p>
<p> Cum itaque <damage>si</damage>cut exhibita nobis …</p>

</text>

translates into roughly the same, but with numbered paragraphs:

<text>
<body>
<div xml:lang="la">
<p n="a#1"> Christierno Hen<ex>n</ex>ingi presbitero Roskildensis diocesis </p>
<p n="a#2">Benigno etc.</p>
<p n="a#3"> Cum itaque <damage>si</damage>cut exhibita nobis … </p>

</div>
<body>
</text>

translation

§ 64    Similar to text, the translation element has a mixed content of text and elements; however, the only to elements allowed here are note and ref (reference).

Conclusion

§ 65    Creating the kind of multi-purpose content that can be shared when needed clearly means going further than the ambiguous commitment to descriptive markup and XML. The application format also has to be commonly known in order to make sense for others; this is why we consider TEI and its documentation format ODD the best bet for a sustainable storage and exchange format. We like to think of our usage of it as a simple one: first, because it is unextended and tries to stay clear of idiosyncrasies; second, because the different TEI instance documents remain structurally the same.

§ 66    Still, regardless of the strategic reasons for adopting a standard, the format must not prevent people from being productive. For many, adopting the XML modus operandi already means entering a different work chain with special tools and texts interspersed with angular brackets. If standards also complicate matters, a successful implementation might be a long way off. Therefore, since easier use of a standard like TEI is actually an attainable goal, for example by using a template, then, clearly, this should be promoted.

§ 67    Although descriptive markup is a true Third Wave technology enabling customized applications, some of these, like DocBook and TEI, have evolved into complicated market standards best handled by specialists. However, that this kind of specialization is actually happening in projects such as the Diplomatarium Danicum and research infrastructures inspires confidence that the disruption which was the very hallmark of the Third Wave is perhaps starting to wear off. With a market where standard texts are checked in and out of repositories, we can hope for development of better tools and technologies that would take the field of scholarly text processing even further.

Works cited

Burnard, L. and Bauman, S., eds. 2007. TEI P5: Guidelines for electronic text encoding and interchange. <http://www.tei-c.org/release/doc/tei-p5-doc/en/html//index.html>.

Burnard, L. and Sperberg-McQueen, C. M., eds. 2006. TEI lite: Encoding for interchange: An introduction to the TEI — revised for TEI P5 release. <http://www.tei-c.org/release/doc/tei-p5-exemplars/html/teilite.doc.html>.

Driscoll, M.J. 2006.P5-MS: A general purpose tagset for manuscript description. Digital Medievalist 2.1. Accessed December 14, 2010.

Hansen, T. 2010a. Metadata for diplomatarium danicum texts. Technical report. Copenhagen: Society for Danish Language and Literature. <http://diplomatarium.dk/docs/Metadata_DD_texts.pdf>.

---. 2010b. General text format and markup for diplomatarium danicum texts. Technical report. Copenhagen: Society for Danish Language and Literature. <http://diplomatarium.dk/docs/General_text_format_DD_texts.pdf>.

Lavagnino, J. 2006. When not to use TEI. In Electronic textual editing, eds. Lou Burnard, Katherine O'Brien O'Keeffe, and John Unsworth. Modern Language Association. <http://www.tei-c.org/About/Archive_new/ETE/Preview/lavagnino.xml>.

Ortí, María Milagros Cárcel. 1997. Vocabulaire international de la diplomatique. Valencia: Commission internationale de diplomatique.

Simons, G.F, and Black, H.A. 2009. Third wave writing and publishing. SIL Forum for Language Fieldwork 2009-005. <http://www.sil.org/SILepubs/Pubs/52287/SILForum2009-005.pdf>

Sperberg-McQueen, C.M., Huitfeldt, C., and Renear, A.H. 2000. Meaning and interpretation of markup. Markup languages: Theory and practice 2.3:215-234. http://cmsmcq.com/2000/mim.html.

Toffler, A. 1980. The third wave . New York: Bantam Books.