4. Corpus Documentation#
When building your training corpus, you will gather many types of texts from the Internet, from libraries, and existing linguistic resources. We will tokenize and annotate these texts and eventually publish them as linguistic data. The creation of these datasets raises several vital questions.
Could an outsider assess the provenance of your texts and how they were collected?
Have you gathered a diverse and representative collection of texts?
Is the process of data collection and curation documented?
To assure that the answers to these questions are available both for your project as well as future reuse, we recommend documenting the process of gathering and preparing your texts. A thriving area of research focuses on what may be an entirely new profession dedicated to the gathering, curating, processing, and publishing research data for machine learning. Several exemplary texts are linked below and offer some practical guidance.
Datasheets for Datasets
Your corpus could easily contain text from hundreds of different texts. You’re not likely to remember the specifics, so we recommend recording information each time you add a new text to the corpus. In the long term, this data will give you the means to manage the kinds of texts that are in the corpus.
Provenance of the text
Where the text was found and how it was accessed (for example, were university credentials used to access the materials?)
The original format of the text (PDF, image, HTML)
How the text was collected (downloaded, requested, emailed…)
Rights and licenses of the text
Is the text in the public domain or in copyright?
For books published after 1975, you can search the Library of Congress copyright database
Sampling and processing of the text
What portion of the original text was added to the corpus?
Were any sampling or deletion techniques used?
There is an entire NEH institute dedicated to legal literacies for text mining. Further information on the institute can be found here.
4.1. Copyright#
Does your data violate copyright laws?
There’s no one good answer to this question and we encourage you to seek answers that are specific to the texts in your corpus, their authors, and the copyright laws specific to them. That said, it is common practice to omit sections of in-copyright materials such that it would be impossible to reconstruct the original work from your data. You might choose to sample a certain percentage of the text. In the United States, less thant 10% of a work or no more than one chapter of a book is considered “fair use” for educational purposes. For more specific suggestions, there is a helpful chapter on copyright in Corpus-Based Language Studies: An Advanced Resource Book.