4. Corpus Documentation#

When building your training corpus, you will gather many types of texts from the Internet, from libraries, and existing linguistic resources. We will tokenize and annotate these texts and eventually publish them as linguistic data. The creation of these datasets raises several vital questions.

  • Could an outsider assess the provenance of your texts and how they were collected?

  • Have you gathered a diverse and representative collection of texts?

  • Is the process of data collection and curation documented?

To assure that the answers to these questions are available both for your project as well as future reuse, we recommend documenting the process of gathering and preparing your texts. A thriving area of research focuses on what may be an entirely new profession dedicated to the gathering, curating, processing, and publishing research data for machine learning. Several exemplary texts are linked below and offer some practical guidance.

Your corpus could easily contain text from hundreds of different texts. You’re not likely to remember the specifics, so we recommend recording information each time you add a new text to the corpus. In the long term, this data will give you the means to manage the kinds of texts that are in the corpus.

  1. Provenance of the text

    • Where the text was found and how it was accessed (for example, were university credentials used to access the materials?)

    • The original format of the text (PDF, image, HTML)

    • How the text was collected (downloaded, requested, emailed…)

  2. Rights and licenses of the text

  3. Sampling and processing of the text

    • What portion of the original text was added to the corpus?

    • Were any sampling or deletion techniques used?

There is an entire NEH institute dedicated to legal literacies for text mining. Further information on the institute can be found here.