24. 🌿 The New Language Project#
For our workshops, we’ve created a spaCy project file for you that will:
fetch your language data from GitHub
convert and prepare the data for model training
train a model for your new language
package and publish your new model
This workflow can be adapted to meet the specific needs of your project. In this section, we will walk through the various sections and scripts of the project. We’ve made some choices on your behalf. They may be right, or you may want to change things. Let’s see what’s there.
In your language team’s GitHub repository, you’ll find a newlang_project
folder that contains a project.yml
file. In the next cell, we’re going to clone the repository and fetch the project’s assets. If your repository is private, you’ll need to get a developer key and enter it as git_access_token
. Otherwise, just enter your repository’s name under repo_name
.
newlang_project
│ README.md
│ project.yml
│
└───scripts
│ convert.py
│ split.py
│ update_config.py
repo_name = "repo-template"
private_repo = False
git_access_token = ""
!rm -rf /content/newlang_project
!rm -rf $repo_name
if private_repo:
git_url = f"https://{git_access_token}@github.com/New-Languages-for-NLP/{repo_name}/"
!git clone $git_url -b main
!cp -r ./$repo_name/newlang_project .
!mkdir newlang_project/assets/
!mkdir newlang_project/configs/
!mkdir newlang_project/corpus/
!mkdir newlang_project/metrics/
!mkdir newlang_project/packages/
!mkdir newlang_project/training/
!mkdir newlang_project/assets/$repo_name
!cp -r ./$repo_name/* newlang_project/assets/$repo_name/
!rm -rf ./$repo_name
else:
!python -m spacy project clone newlang_project --repo https://github.com/New-Languages-for-NLP/$repo_name --branch main
!python -m spacy project assets /srv/projects/course-materials/w2/using-inception-data/newlang_project
✔ Cloned 'newlang_project' from New-Languages-for-NLP/repo-template
/srv/projects/course-materials/w2/using-inception-data/newlang_project
✔ Your project is now ready!
To fetch the assets, run:
python -m spacy project assets /srv/projects/course-materials/w2/using-inception-data/newlang_project
ℹ Fetching 1 asset(s)
✔ Downloaded asset
/srv/projects/course-materials/w2/using-inception-data/newlang_project/assets/urban-giggle
Let’s start with the project.yml file in the newlang_project folder.
You’ll find a metadata section that you can update however you like using the yaml format.
title: "Train new language model from cadet and inception data"
description: "This project template lets you train a part-of-speech tagger, morphologizer and dependency parser from your cadet and inception data."
The vars section will have some information that is specific to your team.
vars:
config: "config"
lang: "yi"
treebank: "yiddish"
test_size: 0.2
n_sents: 10
random_state: 11
package_name: "Yiddish NewNLP Model May 2022"
package_version: "0.1"
wandb: true
gpu: -1
The
config
setting is the name and location of the config file. We’ll just haveconfig.cfg
in the project directory, so nothing fancy here.lang
is the ISO-style abbreviation for your language.treebank
is the name of your language’s repository (and is ususally the same as the language name).test_size
is the percentage of data that you want to set aside for model validation and testing. An 80/20 split is a good place to start, so you’ll see it set initially to0.2
. For more, this stackoverflow discussion is very informative.To evenly distribute your texts between the training and validation datasets, we split each text into blocks of 10 sentences. This is defined by the
n_sents
variable.To ensure that the test and train split is consistent and reproducible, we use a number called
random_state
. More here.The
package_name
is used during packaging. It sets the package’s metadata name. Basically, what is the name of your language model?Similarly,
package_version
sets the package metadata for version.spaCy comes with some basic ways to log training data. However, Weights and Biases provides an excellent way to record, manage and share experiment data. You’ll need to
pip install wanb
, create a free account and get an API key to use bandb. Then change the[training.logger]
section in your config file from
@loggers = "spacy.ConsoleLogger.v1"
# to
@loggers = "spacy.WandbLogger.v2"
project_name = "{treebank}"
remove_config_values = []
log_dataset_dir = "./assets"
model_log_interval = 1000 # optional to save checkpoint files to wanb
Finally, model training with graphics chips (GPUs) is often faster than with a standard CPU. We recommend using Colab for their free GPUs. In such a case, you’d change
-1
(CPU) to0
(the GPU id).
Assets is configured to use your language repo name to fetch project data from GitHub. It will save all that data in the assets/your-language-name
folder.
The commands section is the heart of the project file. Let’s take some time to understand each command and what it does.
24.1. Install#
The install
command in the next cell will read the files in the 2_new_language_object
directory and will install the customized spaCy language object that you created for your language in Cadet. The language object will tell spaCy how to break your texts into tokens and sentence spans.
# Install the custom language object from Cadet
!python -m spacy project run install /srv/projects/course-materials/w2/using-inception-data/newlang_project
================================== install ==================================
Running command: rm -rf lang
Running command: mkdir lang
Running command: mkdir lang/yi
Running command: cp -r assets/urban-giggle/2_new_language_object/ lang/yi/yi
Running command: mv lang/yi/yi/setup.py lang/yi/
Running command: /srv/projects/course-materials/w2/venv/bin/python -m pip install -e lang/yi
Obtaining file:///srv/projects/course-materials/w2/using-inception-data/newlang_project/lang/yi
Installing collected packages: yi
Attempting uninstall: yi
Found existing installation: yi 0.0.0
Uninstalling yi-0.0.0:
Successfully uninstalled yi-0.0.0
Running setup.py develop for yi
Successfully installed yi
24.2. Config#
The config
command creates a generic config.cfg
file (for more on config files, see the Config section in these course materials). It updates the train
and dev
settings in the config file to point to the train.spacy
and dev.spacy
files that are created by the split
command. If you’re using Weights and Biases, it will also change the training logger
.
!python -m spacy project run config /srv/projects/course-materials/w2/using-inception-data/newlang_project
=================================== config ===================================
Running command: /srv/projects/course-materials/w2/venv/bin/python -m spacy init config config.cfg --lang yi -F
âš To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.
ℹ Generated config template specific for your use case
- Language: yi
- Pipeline: tagger, parser, ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
✔ Auto-filled config with all values
✔ Saved config
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
Running command: /srv/projects/course-materials/w2/venv/bin/python scripts/update_config.py urban-giggle false
24.3. Convert#
Convert
will fetch your CoNLL-U and CoNLL 2002 (ner) files from the 3_inception_export
folder. It creates a spaCy Doc object for each text and then splits the Doc into separate documents with 10 sentences each. For each text file, the convert
script will look for a CoNLL 2002 file with the same name. If that text exists, it will add the named entity data in the file to the existing Doc objects. It will then save all the Docs to disk using the .spacy
binary format.
The outcome is a .spacy
for each text that includes the tokenization, sents, part of speech, lemma, morphology and named entity data.
!python -m spacy project run convert /srv/projects/course-materials/w2/using-inception-data/newlang_project -F
================================== convert ==================================
Running command: /srv/projects/course-materials/w2/venv/bin/python scripts/convert.py assets/urban-giggle/3_inception_export 10 yi
ℹ Grouping every 10 sentences into a document.
✔ Generated output file (49 documents):
corpus/converted/he_htb-ud-dev.spacy
ℹ Grouping every 10 sentences into a document.
✔ Generated output file (50 documents):
corpus/converted/he_htb-ud-test.spacy
ℹ Grouping every 10 sentences into a document.
✔ Generated output file (525 documents):
corpus/converted/he_htb-ud-train.spacy
24.4. Split#
The split
command loads all of the .spacy
files and creates a list of Doc objects. We then randomly shuffle them so that different kinds of text are evenly distibuted across the corpus. Using a test_train_split
function, we divide the corpus into a training and validation set. The split is determined by the test_size
variable. The model will learn how to make accurate predictions using the training data. We then use the validation set to assess how well the model performs on completeley new and unseen data. We want the model to learn general rules and patterns rather than overfitting on one particular set of data. The validation set provides a measure of model improvement as part of the training process. Because the model has seen this data before, it’s no longer useful as a tool to evaluate the trained model’s performance. So before we get started training, we also set aside 20% of the validation data to make a test set. This final set of totally unseen data lets us measure how well the model has learned what we’ve asked it to learn.
!python -m spacy project run split /srv/projects/course-materials/w2/using-inception-data/newlang_project -F
=================================== split ===================================
Running command: /srv/projects/course-materials/w2/venv/bin/python scripts/split.py 0.2 11 yi
🚂 Created 499 training docs
😊 Created 100 validation docs
🧪 Created 25 test docs
24.5. Debug#
The debug
command runs spacy debug data
, which provides a good overview of your prepared data. This can help identify problems that will lead to poor model training. It’s a good check-in and moment of reflection on the state of your data before moving forward. For more, see the spaCy docs.
!python -m spacy project run debug /srv/projects/course-materials/w2/using-inception-data/newlang_project
=================================== debug ===================================
Running command: /srv/projects/course-materials/w2/venv/bin/python -m spacy debug data ./config.cfg
============================ Data file validation ============================
✔ Pipeline can be initialized with data
✔ Corpus is loadable
=============================== Training stats ===============================
Language: yi
Training pipeline: tok2vec, tagger, parser, ner
499 training docs
100 evaluation docs
✔ No overlap between training and evaluation data
âš Low number of examples to train a new pipeline (499)
============================== Vocab & Vectors ==============================
ℹ 130313 total word(s) in the data (15962 unique)
âš 30077 misaligned tokens in the training data
âš 5675 misaligned tokens in the dev data
ℹ No word vectors present in the package
========================== Named Entity Recognition ==========================
ℹ 0 label(s)
0 missing value(s) (tokens with '-' label)
✔ Good amount of examples for all labels
✔ Examples without occurrences available for all labels
✔ No entities consisting of or starting/ending with whitespace
=========================== Part-of-speech Tagging ===========================
ℹ 15 label(s) in train data
============================= Dependency Parsing =============================
ℹ Found 4714 sentence(s) with an average length of 27.6 words.
ℹ Found 112 nonprojective train sentence(s)
ℹ 36 label(s) in train data
ℹ 52 label(s) in projectivized train data
âš Low number of examples for label 'dislocated' (10)
âš Low number of examples for label 'csubj' (1)
âš Low number of examples for label 'discourse' (2)
âš Low number of examples for 13 label(s) in the projectivized
dependency trees used for training. You may want to projectivize labels such as
punct before training in order to improve parser performance.
================================== Summary ==================================
✔ 6 checks passed
âš 10 warnings
24.6. Train#
The train
command is the moment we’ve all been waiting for. Go ahead and press the launch button! 🚀 This step will train the model using the settings in the config file.
When training begins, you’ll see a bunch of numbers. Let’s make sense of what they’re saying.
You’ll see a list of what components are currently being trained. Pipeline: ['tok2vec', 'tagger', 'parser', 'ner']
Tok2vec are token embeddings or numerical representations of tokens that can be used efficiently by the model. The tagger will learn to predict part of speech values for your tokens. The parser will learn to predict grammatical structure. The ner component learns to predict named entities in the text.
For each of these components, spaCy will print training metrics. So let’s dive into this pile of forbidding verbiage and numbers.
E # LOSS TOK2VEC LOSS TAGGER LOSS PARSER LOSS NER TAG_ACC DEP_UAS DEP_LAS SENTS_F ENTS_F ENTS_P ENTS_R SCORE
The
E
refers to the epoch. An epoch is one complete pass of all the data through the model. You can set the number of epochs to complete during training or let spaCy optimize the number of epochs automatically (this is the default).Every 200 examples, spaCy outputs accuracy scores in the
#
column.LOSS
refers to training loss. Loss is a measure of error. During training, the model will try to learn how to improve its predictions. Decreasing loss can suggest that the model is learning and improving. If the loss value flattens or plateaus, the model has probably stopped learning or reached the best result for a given set of parameters and data. You will find a loss measure for each of the pipeline components being trained. If the loss varies greatly and looks like a zigzag, the model is struggling to improve its predictions in a deliberate manner.LOSS TOK2VEC LOSS TAGGER LOSS PARSER LOSS NER
TAG_ACC
refers to the accuracy of the tagger component. Accuracy is the number of correct predictions divided by the total number of predictions made.DEP_UAS
andDEP_LAS
are the unlabeled attachment score (UAS) and labeled attachment score (LAS) for the dependency parser. This is a measure of how many times the model correctly predicted the correct head.SENTS_F
gives the model’s f-score for sentence prediction.ENTS_F ENTS_P ENTS_R
relate to the model’s predictions of named entities. The f-score is the mean of precision and recall.Finally, spaCy logs a
SCORE
for the model’s predictions overall. This gives a rough number for the model’s overall accuracy. As a general rule, increasing numbers means that the model is improving. By default, spaCy will end training when the score stops rising.
!python -m spacy project run train /srv/projects/course-materials/w2/using-inception-data/newlang_project
=================================== train ===================================
Running command: /srv/projects/course-materials/w2/venv/bin/python -m spacy train config.cfg --output training/urban-giggle --gpu-id -1 --nlp.lang=yi
ℹ Saving to output directory: training/urban-giggle
ℹ Using CPU
=========================== Initializing pipeline ===========================
[2021-12-30 21:32:00,090] [INFO] Set up nlp object from config
[2021-12-30 21:32:00,097] [INFO] Pipeline: ['tok2vec', 'tagger', 'parser']
[2021-12-30 21:32:00,100] [INFO] Created vocabulary
[2021-12-30 21:32:00,101] [INFO] Finished initializing nlp object
[2021-12-30 21:32:04,936] [INFO] Initialized pipeline components: ['tok2vec', 'tagger', 'parser']
✔ Initialized pipeline
============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'parser']
ℹ Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS TAGGER LOSS PARSER TAG_ACC DEP_UAS DEP_LAS SENTS_F ENTS_F ENTS_P ENTS_R SCORE
--- ------ ------------ ----------- ----------- ------- ------- ------- ------- ------ ------ ------ ------
0 0 0.00 145.60 431.08 22.31 3.87 3.16 0.09 0.00 0.00 0.00 0.09
0 200 2489.76 11513.39 26848.70 51.98 22.62 15.10 60.33 0.00 0.00 0.00 0.24
0 400 4685.01 6464.16 22632.98 56.50 25.11 19.89 77.27 0.00 0.00 0.00 0.26
✔ Saved pipeline to output directory
training/urban-giggle/model-last
24.7. Evaluate#
The evaluate command takes the trained model and tests it with the test data. Recall that these are examples that the model has never seen, so they provide the best measure of its performance. The output will be saved as a json file in the metrics folder.
# Evaluate the model
!python -m spacy project run evaluate /srv/projects/course-materials/w2/using-inception-data/newlang_project
================================== evaluate ==================================
Running command: /srv/projects/course-materials/w2/venv/bin/python -m spacy evaluate ./training/urban-giggle/model-best ./corpus/converted/test.spacy --output ./metrics/urban-giggle.json --gpu-id -1
ℹ Using CPU
================================== Results ==================================
TOK 80.58
TAG 55.30
UAS 23.93
LAS 18.56
SENT P 74.31
SENT R 85.60
SENT F 79.55
SPEED 19651
=============================== LAS (per type) ===============================
P R F
det 72.41 3.12 5.97
nsubj 48.22 27.78 35.25
flat:name 26.79 32.97 29.56
root 61.47 56.80 59.04
case:acc 88.46 28.40 42.99
obj 39.29 13.75 20.37
case:gen 86.79 27.88 42.20
nmod:poss 42.31 8.09 13.58
case 89.58 11.78 20.82
obl 43.40 5.24 9.35
nmod 28.00 2.32 4.28
amod 43.48 20.41 27.78
mark 70.27 14.29 23.74
acl:relcl 10.42 5.21 6.94
compound:smixut 53.19 8.71 14.97
dep 28.57 1.87 3.51
fixed 29.41 7.58 12.05
appos 4.26 6.67 5.19
nummod 95.16 52.21 67.43
cop 77.42 44.44 56.47
parataxis 0.00 0.00 0.00
advcl 0.00 0.00 0.00
advmod 59.84 38.38 46.77
ccomp 15.15 14.71 14.93
xcomp 32.20 44.19 37.25
acl 50.00 8.70 14.81
cc 53.85 3.87 7.22
conj 19.67 5.77 8.92
csubj 0.00 0.00 0.00
nsubj:cop 33.33 5.88 10.00
compound:affix 0.00 0.00 0.00
aux 62.50 38.46 47.62
✔ Saved results to metrics/urban-giggle.json
24.8. Package#
Finally, the package command saves your trained model in a single tar file that can be shared and installed on other computers.
!python -m spacy package ./newlang_project/training/urban-giggle/model-last ./export
ℹ Building package artifacts: sdist
✔ Loaded meta.json from file
newlang_project/training/urban-giggle/model-last/meta.json
✔ Generated README.md from meta.json
✔ Successfully created package 'yi_pipeline-0.0.0'
export/yi_pipeline-0.0.0
running sdist
running egg_info
creating yi_pipeline.egg-info
writing yi_pipeline.egg-info/PKG-INFO
writing dependency_links to yi_pipeline.egg-info/dependency_links.txt
writing entry points to yi_pipeline.egg-info/entry_points.txt
writing requirements to yi_pipeline.egg-info/requires.txt
writing top-level names to yi_pipeline.egg-info/top_level.txt
writing manifest file 'yi_pipeline.egg-info/SOURCES.txt'
reading manifest file 'yi_pipeline.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching 'LICENSE'
warning: no files found matching 'LICENSES_SOURCES'
writing manifest file 'yi_pipeline.egg-info/SOURCES.txt'
running check
warning: check: missing required meta-data: url
warning: check: missing meta-data: either (author and author_email) or (maintainer and maintainer_email) must be supplied
creating yi_pipeline-0.0.0
creating yi_pipeline-0.0.0/yi_pipeline
creating yi_pipeline-0.0.0/yi_pipeline.egg-info
creating yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0
creating yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/parser
creating yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/tagger
creating yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/tok2vec
creating yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/vocab
copying files to yi_pipeline-0.0.0...
copying MANIFEST.in -> yi_pipeline-0.0.0
copying README.md -> yi_pipeline-0.0.0
copying meta.json -> yi_pipeline-0.0.0
copying setup.py -> yi_pipeline-0.0.0
copying yi_pipeline/__init__.py -> yi_pipeline-0.0.0/yi_pipeline
copying yi_pipeline/meta.json -> yi_pipeline-0.0.0/yi_pipeline
copying yi_pipeline.egg-info/PKG-INFO -> yi_pipeline-0.0.0/yi_pipeline.egg-info
copying yi_pipeline.egg-info/SOURCES.txt -> yi_pipeline-0.0.0/yi_pipeline.egg-info
copying yi_pipeline.egg-info/dependency_links.txt -> yi_pipeline-0.0.0/yi_pipeline.egg-info
copying yi_pipeline.egg-info/entry_points.txt -> yi_pipeline-0.0.0/yi_pipeline.egg-info
copying yi_pipeline.egg-info/not-zip-safe -> yi_pipeline-0.0.0/yi_pipeline.egg-info
copying yi_pipeline.egg-info/requires.txt -> yi_pipeline-0.0.0/yi_pipeline.egg-info
copying yi_pipeline.egg-info/top_level.txt -> yi_pipeline-0.0.0/yi_pipeline.egg-info
copying yi_pipeline/yi_pipeline-0.0.0/README.md -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0
copying yi_pipeline/yi_pipeline-0.0.0/config.cfg -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0
copying yi_pipeline/yi_pipeline-0.0.0/meta.json -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0
copying yi_pipeline/yi_pipeline-0.0.0/tokenizer -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0
copying yi_pipeline/yi_pipeline-0.0.0/parser/cfg -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/parser
copying yi_pipeline/yi_pipeline-0.0.0/parser/model -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/parser
copying yi_pipeline/yi_pipeline-0.0.0/parser/moves -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/parser
copying yi_pipeline/yi_pipeline-0.0.0/tagger/cfg -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/tagger
copying yi_pipeline/yi_pipeline-0.0.0/tagger/model -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/tagger
copying yi_pipeline/yi_pipeline-0.0.0/tok2vec/cfg -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/tok2vec
copying yi_pipeline/yi_pipeline-0.0.0/tok2vec/model -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/tok2vec
copying yi_pipeline/yi_pipeline-0.0.0/vocab/key2row -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/vocab
copying yi_pipeline/yi_pipeline-0.0.0/vocab/lookups.bin -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/vocab
copying yi_pipeline/yi_pipeline-0.0.0/vocab/strings.json -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/vocab
copying yi_pipeline/yi_pipeline-0.0.0/vocab/vectors -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/vocab
Writing yi_pipeline-0.0.0/setup.cfg
creating dist
Creating tar archive
removing 'yi_pipeline-0.0.0' (and everything under it)
✔ Successfully created zipped Python package
export/yi_pipeline-0.0.0/dist/yi_pipeline-0.0.0.tar.gz
Keep in mind that you’ll need some persistence and patience along the way. You’ll probably need to run multiple experiments before you find the right blend of data and parameters to create a final product. The instructors are happy to help along the way and we look forward to learning together with you. When all commands are successfully run, you will have converted your text annotations from inception and language object from Cadet into a trained statistical language model that can be loaded with spaCy for a large variety of research tasks.