21. Practical Introduction to Model Training#

In this notebook, we will train a spaCy named entity recognition model (NER) using data from LitBank, an annotated dataset of 100 works of English-language fiction.

Steps:
✅ Load annotation data from LitBank
✅ Create train and validation sets
✅ Train NER from scratch using only the EN language object
✅ Visualize the results and compare the model’s predictions against the original data
✅ Is the model sufficiently useful for research? What would need to be improved and changed?

Open In Colab

21.1. Installing dependencies & loading data#

First, we install spaCy (to train the model), sklearn (to split the data for training), and tqdm (for a nice progress bar).

We also clone the GitHub repo with the LitBank data.

#Install libraries
!pip install spacy sklearn tqdm
#Clone LitBank
!git clone https://github.com/dbamman/litbank.git
import spacy
#Show what version of spaCy we're using
print(f'Using spaCy version {spacy.__version__}')

Next, we creat a list of the text files in the litbank/entities/brat directory and display the number of texts.

#Imports the Path library
from pathlib import Path
#Moves to the path litbank/entities/brat
entities_path = Path.cwd() / 'litbank' / 'entities' / 'brat'
#Creates a list of text files in the path above
text_files = [f for f in entities_path.iterdir() if f.suffix == '.txt']
#Counts how many text files there are
assert len(text_files) == 100
#Show how many text files have been imported
print(f'[*] imported {len(text_files)} files')
[*] imported 100 files

21.2. Process LitBank data#

Here, we run each of the LitBank text files through spaCy, but only using the sentenceizer (i.e. not all the other pieces of the default English pipeline, because we want to train a new model, not use its existing predictions). We also extract each of the annotations in the LitBank text files (which should refer to people, places, etc.) and add them to an entity list for that text.

# for each file, create a Doc object and add the annotation data to doc.ents
# our output is a list of Doc objects 
#Import spaCy, tqdm, and various utilities from spaCy
import spacy 
from tqdm.notebook import tqdm
from spacy.tokens import Span, DocBin
from spacy.util import filter_spans

#Creates a list of Doc objects that are the output from spaCy
docs = []

#Use a blank spaCy model
nlp = spacy.blank("en")
#Add the stentencizer ot break it up into sentences
nlp.add_pipe('sentencizer') # used in training assessment

#With each text file, while showing a progress bar
for text_file in tqdm(text_files):
    #Read the file
    doc = nlp(text_file.read_text())
    #Create a file for the extracted annotations
    annotation_file = (entities_path / (text_file.stem +'.ann'))
    #Split the annotations by new lines
    annotations = annotation_file.read_text().split('\n')
    #Create a list for the entities
    ents = []
    #For each annotation
    for annotation in annotations[:-1]:
        #Split the data based on tab characters to seaprate label, start, and end
        label, start, end = annotation.split('\t')[1].split()
        #Span is the text in the doc corresponding to the annotation
        span = doc.char_span(int(start), int(end), label=label)
        #Handles errors
        if span: # when start and end do not match a valid string, spaCy returns a NoneType span
            ents.append(span)
    #Removes duplicated or overlapping words
    filtered = filter_spans(ents)
    #The entities we want are the filtered list
    doc.ents = filtered
    #Append the spaCy-analyzed text to the list of docs
    docs.append(doc)
    

assert len(docs) == 100

21.3. Split data into sets for training and validation#

We don’t want to use all the data for training, because that would leave us without any data to use for checking the model’s accuracy. The training data is what the model actually learns from; the validation data is the data that’s used to choose the best model from multiple training runs; the test data is the “gold standard” of “right” answers.

If you read general-purpose descriptions of the different data sets for model training, you may see references to hyperparamters (like the “learning rate”). spaCy’s built-in model training provides sensible defaults that you don’t necessarily need to modify, but if you’re interested in the details of what could be modified, you can check the documentation about the training config file.

# Split the data into sets for training and validation 
from sklearn.model_selection import train_test_split

#Split the data into the training set (90%) and validation set (10%)
train_set, validation_set = train_test_split(docs, test_size=0.1)
#Split the validation set into the actual validation set (70%) and test set (30%)
validation_set, test_set = train_test_split(validation_set, test_size=0.3)
#Print how many docs are in each set
print(f'🚂 Created {len(train_set)} training docs')
print(f'😊 Created {len(validation_set)} validation docs')
print(f'🧪 Created {len(test_set)} test docs')
🚂 Created 90 training docs
😊 Created 7 validation docs
🧪 Created 3 test docs

21.3.1. Save the data sets#

From here, we save the training, validation, and test data sets.

#Import DocBin, a format for saving a collection of spaCy Doc objects
from spacy.tokens import DocBin

#Define a DocBin for training data
train_db = DocBin()
#For each doc in the training set
for doc in train_set:
    #Add it to the training DocBin
    train_db.add(doc)
#Save the resulting file
train_db.to_disk("./train.spacy")

# Define a DocBin for validation data, and do the same as above
validation_db = DocBin()
for doc in validation_set:
    validation_db.add(doc)
validation_db.to_disk("./dev.spacy") 

# Define a DocBin for test data, and do the same as above
test_db = DocBin()
for doc in test_set:
    test_db.add(doc)   
test_db.to_disk("./test.spacy") 

Here, we check to make sure the files all exist and are of reasonable sizes given the way we split them (90% training, then splitting that remaining 10% into 70% validation and 30% test.)

!ls -al train.spacy dev.spacy test.spacy
-rw-r--r-- 1 root root  115753 Dec 23 08:20 dev.spacy
-rw-r--r-- 1 root root   53751 Dec 23 08:20 test.spacy
-rw-r--r-- 1 root root 1406959 Dec 23 08:20 train.spacy

21.4. Create training configuration file#

Here, we create the configuration file we’ll need to actually run the training. We’re using English language, the named-entity recognition (NER) pipeline, and otherwise just the defaults.

!python3 -m spacy init config ./config.cfg --lang en --pipeline ner -F
⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.
ℹ Generated config template specific for your use case
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
✔ Auto-filled config with all values
✔ Saved config
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy

21.5. Model training#

The following code starts the training. The training output goes into a directory called output, and we define the paths to the training (train.spacy) and the validation (dev.spacy) data.

!python3 -m spacy train config.cfg --output ./output --paths.train train.spacy --paths.dev dev.spacy
ℹ Saving to output directory: output
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2021-12-23 08:22:05,786] [INFO] Set up nlp object from config
[2021-12-23 08:22:05,792] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-12-23 08:22:05,794] [INFO] Created vocabulary
[2021-12-23 08:22:05,795] [INFO] Finished initializing nlp object
[2021-12-23 08:22:11,376] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00   1072.88    0.00    0.00    0.00    0.00
  2     200      18889.69  63358.78   35.62   28.97   46.21    0.36
  4     400      11414.18  27850.79   53.58   60.92   47.83    0.54
  6     600      24399.51  23338.55   54.53   64.79   47.08    0.55
  8     800      20359.37  18970.20   57.03   62.17   52.67    0.57
 11    1000       6056.26  15459.09   56.76   65.95   49.81    0.57
 13    1200       6623.85  13355.36   59.69   62.62   57.02    0.60
 15    1400       6450.33  10892.70   63.05   67.52   59.13    0.63
 17    1600       8249.34  10429.20   61.55   68.65   55.78    0.62
 20    1800       8022.00   8933.03   58.64   59.82   57.52    0.59
 22    2000      10772.98   8298.90   59.52   62.85   56.52    0.60
 24    2200       9276.31   7258.57   58.76   61.78   56.02    0.59
 26    2400       6356.31   5919.32   57.84   60.22   55.65    0.58
 28    2600       7711.42   5918.36   58.36   63.77   53.79    0.58
 31    2800      11050.54   5581.69   56.84   61.72   52.67    0.57
 33    3000       8032.93   4899.75   56.33   60.37   52.80    0.56
✔ Saved pipeline to output directory
output/model-last

21.6. Test the new model#

Finally, we can check how the model we just trained performs, using the test data set for comparison. The closer the model results are to the human-annotated test set, the better the model is performing. We’ll start with running the model on a random exerpt from the test set.

#Imports the random library to choose a random exerpt.
import random
#Displacy shows a nice visualization of spaCy data, including entities on text
from spacy import displacy 

#Load the model we just trained
new_nlp = spacy.load("output/model-last")
#Pick a random exerpt from the test data set.
val_doc = random.choice(test_set)
#Run the new model on the random exerpt
doc = new_nlp(val_doc.text)

#Show the first 100 words of the random document.
displacy.render(doc[:100], jupyter=True, style="ent")
CHAPTER I At sunset hour the forest LOC was still , lonely , sweet with tang of fir and spruce , blazing in gold and red and green ; and the man who glided on under the great trees seemed to blend with PER the colors PER and , disappearing , to have become a part of the wild woodland .
Old Baldy PER , highest of the White Mountains , stood up round and bare , rimmed bright gold in the last glow of the setting sun .
Then , as the fire dropped behind the domed peak , a change

To compare, let’s display the original, human-generated annotations.

# Display the original annotations in the same style
displacy.render(val_doc[:100], jupyter=True, style="ent")
CHAPTER I At sunset hour the forest LOC was still , lonely , sweet with tang of fir and spruce , blazing in gold and red and green ; and the man who glided on under the great trees PER seemed to blend with the colors and , disappearing , to have become a part of the wild woodland LOC .
Old Baldy LOC , highest of the White Mountains PER , stood up round and bare , rimmed bright gold in the last glow of the setting sun .
Then , as the fire dropped behind the domed peak , a change

It’s not always easy to see the differences right away: walk through the human-annotated text, entity by entity, and then check what happened with the model at that same point in the text. Some common errors include getting the entity right but the label wrong (e.g. switching LOC/PER), and including too many words in the entity, in addition to just missing the entity entirely.

21.7. Evaluation#

Is the model sufficiently useful for research? What would need to be improved and changed?

!python -m spacy evaluate output/model-last test.spacy --output litbank.json