6. Text Encoding Initiative (TEI)#

Since the late 1980s, the Text Encoding Initiative and the DH community have maintained a data standard for the markup of humanities text data called TEI. TEI offers XML standards for the modeling and markup of metadata, text, formatting, and other text features of relevance to DH scholars. TEI digital editions and corpora offer a wealth of information and reflect the expertise of the world’s foremost domain experts. Notable examples of TEI are the Perseus Digital Library at Tufts, which holds thousands of texts related to the ancient worlds, and the European Literary Text Collection (ELTeC) corpus of TEI-annotated literature in 10 different languages.

As we work to create linguistic data for historical and low-resource languages, TEI can offer a wealth of information and existing text annotations. To make the extraction of this data easier, David Lassner created a standoff converter. This tool reads the TEI file and returns the plain text along with a dictionary of the annotations and their attributes.

The organizers hope that participants can make use of existing TEI corpora in their work and that we explore how TEI data can facilitate NLP tasks. We will also explore ways that NLP models can augment the process of annotating digital editions and other TEI projects.