1 Background: XML and TEI
These guidelines assume familiarity with basic concepts of XML and the TEI system. This section serves as a brief review of some key points and refers readers in need of more extensive background information to external resources.
1.1 XML
XML (“eXtensible Markup Language”) is a system for (among other things) adding structured, machine-readable metadata to text-based documents. It is maintained as an international standard by the World Wide Web Consortium (W3C).
Metadata is data about data. For example, in our context, the sentence “The text used here is B. Jowett’s translation of The Dialogues of Plato, Vol. I, Random House, New York.” is a piece of data. The information that the sentence was an end-note to page 135 of Ricœur’s “The Function Of Fiction In Shaping Reality” and that it was added by the translator is metadata.
This is a simplification: comments, declarations, and processing instructions in XML also use angle brackets, but are not considered tags in this sense.
<bibl>Paul Ricoeur, "The Function Of Fiction In Shaping Reality", in Man and World 12:2 (<date subtype="thisIsOriginal" type="publication" when="1979">1979</date>), 123-141</bibl>
XML provides a shorthand for writing tags that are immediately closed: for example, writing <pb n="0" /> is equivalent to <pb n="0"></pb>.
While the concrete syntax of an XML document looks like a sequence of characters, much of the power of XML derives from the fact that an XML document actually specifies a tree data structure of nested elements. An element is an abstract, logical entity which may contain textual data and/or other elements.
In the example above, the whole example is a bibl element, which contains both textual data (a human-readable citation) and a date element, which marks part of the citation as specifying a publication date.
Readers will notice the close relationship between elements, the abstract, logical entities, and tags, the notations in XML’s concrete syntax that mark them. In practice, “element” and “tag” are often used synonymously.
In addition to its contents, an element may have attributes, which provide additional machine-readable metadata about the element. Each attribute has a name and, when present, is assigned a value. In our example, the date element has an attribute named when with a value of "1979". This attribute encodes the date specified by the element in a standard, machine-readable format.
We also rely on a very minimal understanding of the XML concept of entities. In XML’s concrete syntax, the characters & and < have special meaning, and therefore are not allowed in textual data. They must be replaced with the corresponding XML entities & and <, respectively. (For Digital Ricœur, this is done automatically by “TEI Lint” or, if prepairing a document manually, one of our command-line tools: see Getting Started for more details.) No attempt is made here to explain the other, more advanced uses of entities in XML.
XML is specifically an “extensible” markup language because, beyond the common concrete syntax of tags and its interpretation as elements, attributes, and entities, it makes little attempt to specify the structure or meaning of an XML document. Those aspects are left to specific applications of XML, which can vary from recipies to entries in library catalogues. They will typically be codified in a Document Type Definition (DTD), which is a formal, machine-checkable specification for the structure of an XML document. Many projects in the humanities (including ours) use Document Type Definitions based on the TEI model, which is described below.
1.1.1 Further Reading
Many systematic introductions to XML for beginners are available freely online, such as the XML Tutorial from the website “W3 Schools.” In fact, many of these tutorials cover far more detail about XML than is necessary to contribute to this project.
The W3C publishes a page called XML Essentials.
1.2 TEI
As discussed above, the XML standard itself does not specify what elements exist, the semantic meanings of particular elements, or how the hierarchy of elements and textual data should be structured in a document. The Text Encoding Initiative consortium (TEI) publishes a standard (also referred to as TEI) based on XML suitable for many projects in the humanities. This standard is described at https://www.tei-c.org.
The TEI standard is what tells us, for example, that the p element means “this is a paragraph,” as well as specifying the structure for the catalog information in the teiHeader element.
However, because the TEI standard aims to define elements to meet the needs of many diverse projects (from original poetry to facsimiles of manuscripts), projects must define smaller, more targeted Document Type Definitions that address their precise use-cases. The TEI consortium provides a variety of tools to define such customizations with relative ease.
Digital Ricœur’s specific customization of the TEI standard is known as DR-TEI.dtd. It comes with documentation automatically generaterd by the TEI consortium’s tools, which is available at DR-TEI_doc.html. We also impose additional requirements on our TEI documents that are not easily specified using a custom DTD: these requirements are specified in this manual and are checked by the tools described under Tools.