© Fred Gibbs, 2011. Creative Commons Attribution-NonCommercial licence
Citations from the text of this article should be by paragraph number (found on the ID attribute of the p element).
New technologies and methodologies in the digital humanities can help alleviate some limitations inherent in the traditional methods of creating and publishing critical editions, especially how typical practices privilege major texts and create an artificial version of a text that obscures its textual history. I argue that those who work with manuscripts should place a greater emphasis on creating digital noncritical editions that will capture traditionally lost transcription work, harness community expertise, and create a vast interdisciplinary textual archive. This article describes some key benefits of a Platonic web-based transcription tool that will encourage large-scale collaborative transcription and editing in order to make manuscripts much more visible, accessible, connectable, correctable, and usable.
Finding manuscripts relevant to a particular research project, as well as understanding how such texts changed over time, remains a daunting task for many medievalists. This challenge persists even in the face of increasing digitization efforts because many manuscripts (at least of medieval and early modern texts) remain only viewable as images rather than accessible as full text. Regardless of kind, digital versions remain largely isolated from each other in libraries and individual project silos. Needless to say, the required time, effort, and expense of producing full text resources seriously hinders their production. Such limitations carry several unfortunate consequences: firstly, minor texts or texts without obvious utility get short-shrift; secondly, a tremendous amount of parsing, evaluating, and transcribing of understudied manuscripts gets left behind; and thirdly, invisibility and disconnectedness between manuscripts constrains our ability to build new research corpora.
New technologies and methodologies in the digital humanities can help meet some of these challenges. I do not mean that they might make traditional editing or transcription practices more efficient. Rather, I would like to argue for a greater emphasis on creating digital noncritical editions that can capture traditionally lost transcription work, harness community expertise, and create a new kind of textual archive. I offer here a theoretical justification for community transcription practices and the textual archive it will produce. In what follows, I describe some key benefits of embracing a web-based transcription tool that would provide a number of advantages over conventional textual practices. In particular, I argue how such a tool can encourage large-scale collaborative transcription and editing in order to make more manuscripts much more visible, accessible, connectable, correctable, and usable.
The venerated critical edition has served as the primary vehicle for delivering medieval and early modern manuscripts to scholars who need them. Yet two particular criticisms of traditional editorial practices, especially as clearly formulated by Jerome McGann 1983, have echoed throughout the last several decades:
Both practices obscure the textual transformations between
editions (and influences of related works) that can teach us about the production,
evolution, and transmission of the texts themselves. Rather than compress data and
redact textual variations to get the
As availability and access to manuscripts has grown, so too has
the desire to improve the granularity of our knowledge about which texts were
available at a certain place at a certain time. One brief example from my own field
of study, medieval medicine, can illustrate the point. An important but somewhat
enigmatic twelfth-century text on women's medicine known as the
Although the shortcomings of the critical edition have been
discussed for some time, little could be done to respond to them in practice. The
limitations and conventions of traditional publishing and scholarly practice, for
example, have made it virtually impossible to print variant editions of manuscripts,
or to edit them in large collaborative teams. So how do we make more texts visible
and available, whether for their individual value or value as part of a larger
research corpus? How can we embrace the unstable and
Edward Vanhoutte has suggested that the reason for neglect of noncritical editing in theory and practice is a "lack of a satisfactory ontology of the text on which a methodology of noncritical editing can be modeled" (Vanhoutte 2006). In my view, this is not a sufficiently different problem than persists with critical editing itself: an editor or transcriber must always make decisions about what constitutes the text. Instead, I would suggest a much less sophisticated answer: that there has not been any practical way to create noncritical editions that would not be prohibitively idiosyncratic and that could be used by the community at large to put more manuscripts in conversation with each other. This of course is one of the principal reasons for noncritical editing in the first place.
The creation of noncritical texts certainly raises new issues of authority and quality. Bob Rosenberg tells us that "the most important point to be made about any digital documentary edition is that the editors' fundamental intellectual work is unchanged" (Rosenberg 2006). Arguably, creating metadata and mark-up does in fact create new intellectual challenges and choices for the editor. Regardless, I here plead for a new kind of documentary electronic edition where the fundamental intellectual goals and practices are in fact rather different: I am suggesting a shift in values from privileging the critical edition to prioritizing the creation of visible manuscript text. However, I do not argue against the critical edition so much as suggest some textual practices that can co-exist with it and provide complementary functions.
Before outlining some advantages for community transcription of noncritical editions, I should lay bare a few presumptions I have about the gap between theory and practice with respect to the future of textual analysis. First, I wholly indulge in the fantasy that centralized databases and other repositories of metadata can be obviated through the widespread application of well-standardized semantic web technologies. But this is, of course, like Tantalus's next meal, continually out of reach. One reason for this is that the technical difficulties and heavy labor requirements create a serious bottleneck for producing appropriately encoded and marked-up texts. Another reason is that there is hardly any agreement about which of several viable standards will be most usable in the long term. In both the short and long term, we need a more scalable solution than individual mark-up projects that, despite laudable goals, tend to rediscover the difficulties of text encoding. Secondly, despite promising recent advances, usable OCR (optical character recognition) remains significantly far off with respect to both manuscript and even early printed texts. The variety of characters, hands, and layouts will make manuscript OCR a significant challenge for quite some time. Even once we have more reliable OCR technology, it would be nice to have an infrastructure to allow the manuscripts to be viewed together and improved by user expertise.
I want to emphasize that my interest here lies in promoting the methodological practice of archiving quick and dirty transcriptions, rather than solving all of the technical and design challenges that such a transcription tool presents — though I would argue that they are best solved in practice, anyway. By utilizing an open web platform that can uniformly implement mark-up standards and avoid impossible-to-maintain hardware and software requirements, the availability of new texts will be a boon to scholars across all disciplines. By no means an exhaustive list, I present some advantages of new approaches such a methodology.
Even though relatively few scholars work predominately as textual editors, many others often engage in localized textual editing efforts as part of larger research projects. In this way, we are all part of a decentralized team working toward the same (indirect) goal of making manuscripts more usable. A quick thought experiment: imagine if all the rough transcription that scholars have done over the centuries — work that has been reduced to a few quotations in footnotes — had been more fully preserved and was easily accessible. How might our texts, and especially our interpretations based on them, differ? Recent ease of publication and distribution makes it almost trivial to create such an archive from now on. Towards this end, I suggest that the scholarly community should think less about editions and more about versions of texts, configurable texts, and working in textual communities that will help scholars leverage community experience and expertise. Thus, a web-based transcription tool gives a practical embodiment to Reiman's aging but still insightful suggestion to emphasize "versioning" over "editing" (Reiman 1987). It encourages us to shift our emphasis from idiosyncratic final texts to the processes and practices in revealing and connecting texts as a collaborative effort.
I contend that embracing the notion of textual communities will
dramatically increase the visibility and usability of manuscripts as a whole, as well
as the possibilities for interdisciplinary work. While partial transcriptions are, of
course, unsuitable for traditional publications, availability of texts no longer
needs to be bottlenecked by antiquated academic convention. Even with incomplete or
imperfect transcriptions, the resulting increase in visibility will make the rich
manuscript tradition accessible to high-level searches that scholars have come to
rely on. Furthermore, researchers will be able to create
One of the principal criticisms of community transcription has
been called the Babel objection: the idea that inferior contributions will create so
much extra noise to filter out that we won't end up with anything useful at all.
Won't we be creating essentially a black hole of data with no hope of separating the
wheat from the chaff? What do we do with all the junk? Perhaps we need, to borrow a
phrase from Bill Turkel, a methodology for the infinite archive, like better storing
and searching protocols. But the reality is that transcriptions of medieval and early
modern texts will always be far from infinite or even overabundant. Ultimately, the
I have used the term "community transcription," but surely many readers will recognize this approach as crowd-sourcing. But there is an important distinction that must not be overlooked when thinking about how to build scholarly research corpora. To think of Wikipedia (as many do, in my experience) and its highly variable article quality as representative of what will happen with community transcription is to make a category mistake. While just about anyone might feel like they can contribute to Wikipedia (indeed, that is the point), users of an online transcription tool for medieval manuscripts are a far more self-selecting group. While anyone could view texts, user registration would be required to edit texts. To assume that all work must be vetted by a firm editorial voice is to ignore the vast potential of highly trained and motivated community practitioners who want to work together to discover relevant texts.
While quality of data remains a valid concern, I side with Anselm in that something that exists in reality is better than something that exists in the mind. Practically speaking, even when transcription quality is in doubt, it will be relatively easy for a researcher to determine if a manuscript warrants further study for a particular research project. Having an unrepresentative variant of a text is far better than having no knowledge of a text's existence. That is, we ought to prioritize visibility over accuracy. Another similar concern is that scholars will be confronted with too many adequately transcribed but simply unnecessary variants. But I propose that this tool encourages just the opposite. With a light editorial hand and proper interface, the greater visibility will bring more texts into the field of view and encourage engagement with them. Gradual emergence of standard or typical readings will come from community consensus and practice. Variations will be quickly viewable, but will not stand in the way.
It must be emphasized that the goal of collaborative online
editing is not perfect diplomatic transcriptions or mark-up. Nor do I suggest that
crowd-sourced transcriptions serve the same function as, or could replace, the
time-honored critical edition. But even for cases of texts that have been heavily
edited over time, the tool provides easy ways of viewing change and particular
editions. More importantly, a transcription tool focused on community contributions
over time, even if partial and imperfect, can free scholars from the constraints of
the critical edition, and let people see texts that
The quality of transcriptions from the community at large, at
least in the short term, is perhaps not as useful for philologists or linguists, who
often require the most precise possible transcriptions, as well as transparency in
the interpretive work done between the manuscript itself and its transcription. The
general editing principles behind the tool, and the slightly more uncertain editorial
authority, make precise textual work problematic. But this tool
Even if the benefits to the community are clear, why would individuals bother to use an online tool for transcribing medieval manuscripts? At an entirely functional level, such a collaborative approach can help with transcription challenges — like making sense of unusual abbreviations, unfamiliar words, or obscure references — by drawing on the collective intelligence and experience of the community. Users could, of course, silently expand abbreviations during rough transcription (as they often do). But they could also quickly represent them with regular keyboard characters (faster than finding Unicode values), creating over time a dictionary of abbreviations that can be used to provide suggestions when transcribing.
It should be emphasized that users are not obliged to use this functionality; a transcriber need not enter abbreviations at all. Obviously, preserving arbitrary scribal characters is a huge task in itself and adds considerable time to the task. But again, because the primary goal is visibility, a fully diplomatic or complete transcription is not as crucial, especially since no single standard for capturing the many variations has ever emerged (Vander Meulen and Tanselle 1999).
With respect to preserving the visual and linguistic artifacts of a manuscript itself, semantic web technologies and descriptive mark-up schemas like TEI hold great promise not only for their ability to preserve document structure, but also for the way in which they can help scholars find texts relevant to their research that would otherwise remain unknown to them. But the learning curve is steep, and text encoding projects remain slow and expensive.
To improve matters, a community transcription tool will reduce significantly the barrier to entry and encourage mark-up of texts. To be sure, this is a complex user interface challenge. But this is not the place to hash out design solutions, but rather to re-orient our thinking about how and why mark-up can and should be carried out incrementally by individuals over time in order to realize the potential of text encoding and further improve visibility and connectivity of manuscripts. Of course, users would not be required to mark-up texts that they transcribe, but a highly polished interface for transcription offers the perfect platform to enable basic TEI mark-up of broad structures. Admittedly, marking up a complex revision process will continue to require dedicated editors. But as with the transcriptions of the texts, mark-up completeness is not essential. It is simply not necessary either to do it right or not at all, providing that we can expect and embrace incremental advancements from the community.
Any effort to bridge theory and practice of electronic noncritical text editing must address (at least) two primary needs. First, to provide a way of maintaining a historical record of a text that has been edited by the community: who has done what, and when? Second, to mediate between authority and autonomy — that is, to allow researchers to contribute changes that they think are valuable, even to the same text at the same time — while retaining the ability for individual users to decide what they want and don't want to use, or even see.
To address both issues, I suggest that we borrow from the principles and practices of open-source software development — namely the use of Distributed Version Control (DVC). Such a system maintains versions of texts that are publicly available, and yet also allows users to create private transcriptions that can then, or not, be returned to the community. Distributed version control improves on the premise of centralized version control in which everyone must take one version of a text as the master copy, and thus remains limited by centralized and top-down editorial authority. DVC is much more flexible in that regard. Even though the tool would provide a central repository for transcripted texts, it does not require that everyone must work with the same version of the document at the same time. People can work on different parts independently, sharing or not sharing work as they go. In this way, the advantage of distributed versions over centralized versions is that on the whole they mediate between authority and autonomy. Despite some extra overhead and logistical challenges, decentralization retains crucial freedom for individual editors. At the same time, version control maintains authority — researchers can know who has done what with the transcriptions. DVC also enables citation of changing texts. Because it maintains a full history of edits, it is possible to view (and cite) the text as it was at any given time.
Leaving aside the technical details, the workflow might go something like this. First, anyone interested in working on a particular text will get or create a version of it. When done with a discrete set of edits (smaller ones are easier to manage), they upload them to the repository. Any conflicts with others who have edited the same part of the same document in the meantime are reconciled (this happens more in software development than it will in manuscript transcription, I imagine). They might then be approved by one of many editors who make sure it is reasonable but impose no other editorial control. Then it goes into the hands of the community, where it might be reviewed, reused, or lay dormant. When contributors upload transcriptions to the repository, DVC software can automatically merge changes that are independent of each other, but direct conflicts must be resolved manually. Of course, conflicts do not need to be resolved at all. Though not as practical with code as with manuscripts, it will often be valuable to maintain multiple (possible) versions of a text. TEI gives us the ability to obviate conflicts by simply embedding the variants within a single edited version of the text. This model has been used successfully, if somewhat opaquely, by papyri.info, and should be extended to more complicated textual traditions as well.
Additionally, as the humanities rethink how to recognize digital and non-traditional scholarship, DVC can, as mentioned, track edits by particular users and thus provide a mechanism to recognize (even partial) transcription work as a serious contribution to the scholarly community. The criticism that quantification will encourage non-substantive contributions to inflate the apparent value of one's effort is unoriginal (and happens even now), and is easily mitigated through interface design and community convention.
Participating in a community transcription effort to create a textual database makes it easier to situate one's own texts in the context of other texts and discover relationships that would otherwise remain invisible. Perhaps we might benefit from a single, authoritative archive for manuscripts, but that's not what I'm arguing for here. Rather, I'm suggesting that a tool agnostic to both library and project affiliations can complement existing cataloging projects, like Manuscriptorium and the ENRICH project, to create a powerful new collaborative environment that unifies public archives and the private workspace. Indeed, the recent API Workshop held at the Maryland Institute for Technology in the Humanities spawned an impromptu session that quickly agreed upon the need for creating a generic transcription tool that could be used by various transcription projects to help connect their textual resources. While the enthusiasm was directed primarily at the value of community transcription, I want to emphasize the value not only of the transcription functionality, but also of the much larger archive of texts it could create — a feature that does not seem to be a high priority for most transcription projects.
On a broad scale, even small contributions of rough
transcriptions will, over time, vastly improve our documentary knowledge as a whole
by aggregating individual research projects. One advantage, in the case of the
Such features are heavily dependent on an unobtrusive, functional, and intuitive interface. Texts must be easily (re)configured, and variations between versions must be easily displayed or hidden. As mentioned earlier, I want to emphasize that this is a design/interface problem, not a problem with the idea of collecting as much as text as possible. All data is good, as long as it can be managed. Fortunately, web interface technologies have advanced to the point where this is no longer the Sisyphean task it once was.
It is perhaps worth mentioning that a useful transcription tool would not, and perhaps should not, need to function as an image presentation platform as do promising tools like T-PEN and Scripto, at least at the outset. The transcriber might be sitting in front of the actual document, a photocopy, a microfilm machine, a PDF, or an image from elsewhere, like from a library website or even Google Books. The effort to embed text in an image (transcription as annotation) can certainly bring exciting research possibilities (Lecolinet, Robert, and Role 2002), especially with efforts toward standardization like that of the recent work of the Open Annotation Collaboration, and projects like TILE. The idea is certainly well worth pursuing, but issues with ownership, copyright, and similar contingencies hinder practical implementation and seriously restrict the kinds of texts that could be transcribed; it might be best left for later development and not a prerequisite for a community transcription tool. An early focus on images may also discriminate against using the tool for capturing casual transcription regardless of the text's medium.
I have argued that embracing the notion of community-driven, noncritical transcriptions will make dramatic progress toward discovering new textual traditions. By providing incentives for both individual and community participation, a web transcription tool will help reveal relevant texts, encourage cross-disciplinary work, and illuminate the development of ideas and texts over time.
To return to the critical edition for a moment, the adoption of a community transcription tool frees scholars from the biases of the single editorial voice. Similarly, it allows freedom from authorial intention as the central editorial principle, in favor of versioned texts that could be used individually or in aggregate. A greater focus on preserving quick and dirty transcription will provide a valuable complement to canonical editions and make available more versions of manuscripts that actually existed, as well as texts that would never get a critical edition in the first place.
There is little doubt that any success of web-based
collaborative transcription will depend on embracing new technologies, practices, and
interfaces. Certainly, realizing any of the theoretical benefits will require new
workflows and overcoming complex user-interface design challenges. Exactly how the
interface(s) should look and work requires an article in its own right, and best
practices and consensus will likely emerge only after significant community
engagement with some experimental prototypes. But what I am advocating here is not
fundamentally about technology or design, but embracing transparency and openness in
the ways in which we make texts available. This means shifting values toward creating
and maintaining an archive of imperfect, but