21. Practical Introduction to Model Training#
In this notebook, we will train a spaCy named entity recognition model (NER) using data from LitBank, an annotated dataset of 100 works of English-language fiction.
Steps:
✅ Load annotation data from LitBank
✅ Create train and validation sets
✅ Train NER from scratch using only the EN language object
✅ Visualize the results and compare the model’s predictions against the original data
✅ Is the model sufficiently useful for research? What would need to be improved and changed?
21.1. Installing dependencies & loading data#
First, we install spaCy (to train the model), sklearn (to split the data for training), and tqdm (for a nice progress bar).
We also clone the GitHub repo with the LitBank data.
#Install libraries
!pip install spacy sklearn tqdm
#Clone LitBank
!git clone https://github.com/dbamman/litbank.git
import spacy
#Show what version of spaCy we're using
print(f'Using spaCy version {spacy.__version__}')
Next, we creat a list of the text files in the litbank/entities/brat
directory and display the number of texts.
#Imports the Path library
from pathlib import Path
#Moves to the path litbank/entities/brat
entities_path = Path.cwd() / 'litbank' / 'entities' / 'brat'
#Creates a list of text files in the path above
text_files = [f for f in entities_path.iterdir() if f.suffix == '.txt']
#Counts how many text files there are
assert len(text_files) == 100
#Show how many text files have been imported
print(f'[*] imported {len(text_files)} files')
[*] imported 100 files
21.2. Process LitBank data#
Here, we run each of the LitBank text files through spaCy, but only using the sentenceizer (i.e. not all the other pieces of the default English pipeline, because we want to train a new model, not use its existing predictions). We also extract each of the annotations in the LitBank text files (which should refer to people, places, etc.) and add them to an entity list for that text.
# for each file, create a Doc object and add the annotation data to doc.ents
# our output is a list of Doc objects
#Import spaCy, tqdm, and various utilities from spaCy
import spacy
from tqdm.notebook import tqdm
from spacy.tokens import Span, DocBin
from spacy.util import filter_spans
#Creates a list of Doc objects that are the output from spaCy
docs = []
#Use a blank spaCy model
nlp = spacy.blank("en")
#Add the stentencizer ot break it up into sentences
nlp.add_pipe('sentencizer') # used in training assessment
#With each text file, while showing a progress bar
for text_file in tqdm(text_files):
#Read the file
doc = nlp(text_file.read_text())
#Create a file for the extracted annotations
annotation_file = (entities_path / (text_file.stem +'.ann'))
#Split the annotations by new lines
annotations = annotation_file.read_text().split('\n')
#Create a list for the entities
ents = []
#For each annotation
for annotation in annotations[:-1]:
#Split the data based on tab characters to seaprate label, start, and end
label, start, end = annotation.split('\t')[1].split()
#Span is the text in the doc corresponding to the annotation
span = doc.char_span(int(start), int(end), label=label)
#Handles errors
if span: # when start and end do not match a valid string, spaCy returns a NoneType span
ents.append(span)
#Removes duplicated or overlapping words
filtered = filter_spans(ents)
#The entities we want are the filtered list
doc.ents = filtered
#Append the spaCy-analyzed text to the list of docs
docs.append(doc)
assert len(docs) == 100
21.3. Split data into sets for training and validation#
We don’t want to use all the data for training, because that would leave us without any data to use for checking the model’s accuracy. The training data is what the model actually learns from; the validation data is the data that’s used to choose the best model from multiple training runs; the test data is the “gold standard” of “right” answers.
If you read general-purpose descriptions of the different data sets for model training, you may see references to hyperparamters (like the “learning rate”). spaCy’s built-in model training provides sensible defaults that you don’t necessarily need to modify, but if you’re interested in the details of what could be modified, you can check the documentation about the training config file.
# Split the data into sets for training and validation
from sklearn.model_selection import train_test_split
#Split the data into the training set (90%) and validation set (10%)
train_set, validation_set = train_test_split(docs, test_size=0.1)
#Split the validation set into the actual validation set (70%) and test set (30%)
validation_set, test_set = train_test_split(validation_set, test_size=0.3)
#Print how many docs are in each set
print(f'🚂 Created {len(train_set)} training docs')
print(f'😊 Created {len(validation_set)} validation docs')
print(f'🧪 Created {len(test_set)} test docs')
🚂 Created 90 training docs
😊 Created 7 validation docs
🧪 Created 3 test docs
21.3.1. Save the data sets#
From here, we save the training, validation, and test data sets.
#Import DocBin, a format for saving a collection of spaCy Doc objects
from spacy.tokens import DocBin
#Define a DocBin for training data
train_db = DocBin()
#For each doc in the training set
for doc in train_set:
#Add it to the training DocBin
train_db.add(doc)
#Save the resulting file
train_db.to_disk("./train.spacy")
# Define a DocBin for validation data, and do the same as above
validation_db = DocBin()
for doc in validation_set:
validation_db.add(doc)
validation_db.to_disk("./dev.spacy")
# Define a DocBin for test data, and do the same as above
test_db = DocBin()
for doc in test_set:
test_db.add(doc)
test_db.to_disk("./test.spacy")
Here, we check to make sure the files all exist and are of reasonable sizes given the way we split them (90% training, then splitting that remaining 10% into 70% validation and 30% test.)
!ls -al train.spacy dev.spacy test.spacy
-rw-r--r-- 1 root root 115753 Dec 23 08:20 dev.spacy
-rw-r--r-- 1 root root 53751 Dec 23 08:20 test.spacy
-rw-r--r-- 1 root root 1406959 Dec 23 08:20 train.spacy
21.4. Create training configuration file#
Here, we create the configuration file we’ll need to actually run the training. We’re using English language, the named-entity recognition (NER) pipeline, and otherwise just the defaults.
!python3 -m spacy init config ./config.cfg --lang en --pipeline ner -F
⚠ To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.
ℹ Generated config template specific for your use case
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
✔ Auto-filled config with all values
✔ Saved config
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
21.5. Model training#
The following code starts the training. The training output goes into a directory called output
, and we define the paths to the training (train.spacy) and the validation (dev.spacy) data.
!python3 -m spacy train config.cfg --output ./output --paths.train train.spacy --paths.dev dev.spacy
ℹ Saving to output directory: output
ℹ Using CPU
=========================== Initializing pipeline ===========================
[2021-12-23 08:22:05,786] [INFO] Set up nlp object from config
[2021-12-23 08:22:05,792] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-12-23 08:22:05,794] [INFO] Created vocabulary
[2021-12-23 08:22:05,795] [INFO] Finished initializing nlp object
[2021-12-23 08:22:11,376] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
✔ Initialized pipeline
============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'ner']
ℹ Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE
--- ------ ------------ -------- ------ ------ ------ ------
0 0 0.00 1072.88 0.00 0.00 0.00 0.00
2 200 18889.69 63358.78 35.62 28.97 46.21 0.36
4 400 11414.18 27850.79 53.58 60.92 47.83 0.54
6 600 24399.51 23338.55 54.53 64.79 47.08 0.55
8 800 20359.37 18970.20 57.03 62.17 52.67 0.57
11 1000 6056.26 15459.09 56.76 65.95 49.81 0.57
13 1200 6623.85 13355.36 59.69 62.62 57.02 0.60
15 1400 6450.33 10892.70 63.05 67.52 59.13 0.63
17 1600 8249.34 10429.20 61.55 68.65 55.78 0.62
20 1800 8022.00 8933.03 58.64 59.82 57.52 0.59
22 2000 10772.98 8298.90 59.52 62.85 56.52 0.60
24 2200 9276.31 7258.57 58.76 61.78 56.02 0.59
26 2400 6356.31 5919.32 57.84 60.22 55.65 0.58
28 2600 7711.42 5918.36 58.36 63.77 53.79 0.58
31 2800 11050.54 5581.69 56.84 61.72 52.67 0.57
33 3000 8032.93 4899.75 56.33 60.37 52.80 0.56
✔ Saved pipeline to output directory
output/model-last
21.6. Test the new model#
Finally, we can check how the model we just trained performs, using the test data set for comparison. The closer the model results are to the human-annotated test set, the better the model is performing. We’ll start with running the model on a random exerpt from the test set.
#Imports the random library to choose a random exerpt.
import random
#Displacy shows a nice visualization of spaCy data, including entities on text
from spacy import displacy
#Load the model we just trained
new_nlp = spacy.load("output/model-last")
#Pick a random exerpt from the test data set.
val_doc = random.choice(test_set)
#Run the new model on the random exerpt
doc = new_nlp(val_doc.text)
#Show the first 100 words of the random document.
displacy.render(doc[:100], jupyter=True, style="ent")
To compare, let’s display the original, human-generated annotations.
# Display the original annotations in the same style
displacy.render(val_doc[:100], jupyter=True, style="ent")
It’s not always easy to see the differences right away: walk through the human-annotated text, entity by entity, and then check what happened with the model at that same point in the text. Some common errors include getting the entity right but the label wrong (e.g. switching LOC/PER), and including too many words in the entity, in addition to just missing the entity entirely.
21.7. Evaluation#
Is the model sufficiently useful for research? What would need to be improved and changed?
!python -m spacy evaluate output/model-last test.spacy --output litbank.json