24. 🌿 The New Language Project#

Open In Colab

For our workshops, we’ve created a spaCy project file for you that will:

  • fetch your language data from GitHub

  • convert and prepare the data for model training

  • train a model for your new language

  • package and publish your new model

This workflow can be adapted to meet the specific needs of your project. In this section, we will walk through the various sections and scripts of the project. We’ve made some choices on your behalf. They may be right, or you may want to change things. Let’s see what’s there.

In your language team’s GitHub repository, you’ll find a newlang_project folder that contains a project.yml file. In the next cell, we’re going to clone the repository and fetch the project’s assets. If your repository is private, you’ll need to get a developer key and enter it as git_access_token. Otherwise, just enter your repository’s name under repo_name.

newlang_project
│   README.md
│   project.yml    
│
└───scripts
    │   convert.py
    │   split.py
    │   update_config.py
repo_name = "repo-template"
private_repo = False
git_access_token = ""

!rm -rf /content/newlang_project
!rm -rf $repo_name
if private_repo:
    git_url = f"https://{git_access_token}@github.com/New-Languages-for-NLP/{repo_name}/"
    !git clone $git_url  -b main
    !cp -r ./$repo_name/newlang_project .  
    !mkdir newlang_project/assets/
    !mkdir newlang_project/configs/
    !mkdir newlang_project/corpus/
    !mkdir newlang_project/metrics/
    !mkdir newlang_project/packages/
    !mkdir newlang_project/training/
    !mkdir newlang_project/assets/$repo_name
    !cp -r ./$repo_name/* newlang_project/assets/$repo_name/
    !rm -rf ./$repo_name
else:
    !python -m spacy project clone newlang_project --repo https://github.com/New-Languages-for-NLP/$repo_name --branch main
    !python -m spacy project assets /srv/projects/course-materials/w2/using-inception-data/newlang_project
✔ Cloned 'newlang_project' from New-Languages-for-NLP/repo-template
/srv/projects/course-materials/w2/using-inception-data/newlang_project
✔ Your project is now ready!
To fetch the assets, run:
python -m spacy project assets /srv/projects/course-materials/w2/using-inception-data/newlang_project
ℹ Fetching 1 asset(s)
✔ Downloaded asset
/srv/projects/course-materials/w2/using-inception-data/newlang_project/assets/urban-giggle

Let’s start with the project.yml file in the newlang_project folder.

You’ll find a metadata section that you can update however you like using the yaml format.

title: "Train new language model from cadet and inception data"
description: "This project template lets you train a part-of-speech tagger, morphologizer and dependency parser from your cadet and inception data."

The vars section will have some information that is specific to your team.

vars:
  config: "config"
  lang: "yi"
  treebank: "yiddish"
  test_size: 0.2
  n_sents: 10
  random_state: 11
  package_name: "Yiddish NewNLP Model May 2022"
  package_version: "0.1"
  wandb: true 
  gpu: -1
  • The config setting is the name and location of the config file. We’ll just have config.cfg in the project directory, so nothing fancy here.

  • lang is the ISO-style abbreviation for your language.

  • treebank is the name of your language’s repository (and is ususally the same as the language name).

  • test_size is the percentage of data that you want to set aside for model validation and testing. An 80/20 split is a good place to start, so you’ll see it set initially to 0.2. For more, this stackoverflow discussion is very informative.

  • To evenly distribute your texts between the training and validation datasets, we split each text into blocks of 10 sentences. This is defined by the n_sents variable.

  • To ensure that the test and train split is consistent and reproducible, we use a number called random_state. More here.

  • The package_name is used during packaging. It sets the package’s metadata name. Basically, what is the name of your language model?

  • Similarly, package_version sets the package metadata for version.

  • spaCy comes with some basic ways to log training data. However, Weights and Biases provides an excellent way to record, manage and share experiment data. You’ll need to pip install wanb, create a free account and get an API key to use bandb. Then change the [training.logger] section in your config file from

@loggers = "spacy.ConsoleLogger.v1"

# to 

@loggers = "spacy.WandbLogger.v2"
project_name = "{treebank}"
remove_config_values = []
log_dataset_dir = "./assets"
model_log_interval = 1000 # optional to save checkpoint files to wanb
  • Finally, model training with graphics chips (GPUs) is often faster than with a standard CPU. We recommend using Colab for their free GPUs. In such a case, you’d change -1 (CPU) to 0 (the GPU id).

Assets is configured to use your language repo name to fetch project data from GitHub. It will save all that data in the assets/your-language-name folder.

The commands section is the heart of the project file. Let’s take some time to understand each command and what it does.


24.1. Install#

The install command in the next cell will read the files in the 2_new_language_object directory and will install the customized spaCy language object that you created for your language in Cadet. The language object will tell spaCy how to break your texts into tokens and sentence spans.

# Install the custom language object from Cadet 
!python -m spacy project run install /srv/projects/course-materials/w2/using-inception-data/newlang_project

================================== install ==================================
Running command: rm -rf lang
Running command: mkdir lang
Running command: mkdir lang/yi
Running command: cp -r assets/urban-giggle/2_new_language_object/ lang/yi/yi
Running command: mv lang/yi/yi/setup.py lang/yi/
Running command: /srv/projects/course-materials/w2/venv/bin/python -m pip install -e lang/yi
Obtaining file:///srv/projects/course-materials/w2/using-inception-data/newlang_project/lang/yi
Installing collected packages: yi
  Attempting uninstall: yi
    Found existing installation: yi 0.0.0
    Uninstalling yi-0.0.0:
      Successfully uninstalled yi-0.0.0
  Running setup.py develop for yi
Successfully installed yi

24.2. Config#

The config command creates a generic config.cfg file (for more on config files, see the Config section in these course materials). It updates the train and dev settings in the config file to point to the train.spacy and dev.spacy files that are created by the split command. If you’re using Weights and Biases, it will also change the training logger.

!python -m spacy project run config /srv/projects/course-materials/w2/using-inception-data/newlang_project

=================================== config ===================================
Running command: /srv/projects/course-materials/w2/venv/bin/python -m spacy init config config.cfg --lang yi -F
âš  To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.
ℹ Generated config template specific for your use case
- Language: yi
- Pipeline: tagger, parser, ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
✔ Auto-filled config with all values
✔ Saved config
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
Running command: /srv/projects/course-materials/w2/venv/bin/python scripts/update_config.py urban-giggle false

24.3. Convert#

Convert will fetch your CoNLL-U and CoNLL 2002 (ner) files from the 3_inception_export folder. It creates a spaCy Doc object for each text and then splits the Doc into separate documents with 10 sentences each. For each text file, the convert script will look for a CoNLL 2002 file with the same name. If that text exists, it will add the named entity data in the file to the existing Doc objects. It will then save all the Docs to disk using the .spacy binary format. The outcome is a .spacy for each text that includes the tokenization, sents, part of speech, lemma, morphology and named entity data.

!python -m spacy project run convert /srv/projects/course-materials/w2/using-inception-data/newlang_project -F

================================== convert ==================================
Running command: /srv/projects/course-materials/w2/venv/bin/python scripts/convert.py assets/urban-giggle/3_inception_export 10 yi
ℹ Grouping every 10 sentences into a document.
✔ Generated output file (49 documents):
corpus/converted/he_htb-ud-dev.spacy
ℹ Grouping every 10 sentences into a document.
✔ Generated output file (50 documents):
corpus/converted/he_htb-ud-test.spacy
ℹ Grouping every 10 sentences into a document.
✔ Generated output file (525 documents):
corpus/converted/he_htb-ud-train.spacy

24.4. Split#

The split command loads all of the .spacy files and creates a list of Doc objects. We then randomly shuffle them so that different kinds of text are evenly distibuted across the corpus. Using a test_train_split function, we divide the corpus into a training and validation set. The split is determined by the test_size variable. The model will learn how to make accurate predictions using the training data. We then use the validation set to assess how well the model performs on completeley new and unseen data. We want the model to learn general rules and patterns rather than overfitting on one particular set of data. The validation set provides a measure of model improvement as part of the training process. Because the model has seen this data before, it’s no longer useful as a tool to evaluate the trained model’s performance. So before we get started training, we also set aside 20% of the validation data to make a test set. This final set of totally unseen data lets us measure how well the model has learned what we’ve asked it to learn.

!python -m spacy project run split /srv/projects/course-materials/w2/using-inception-data/newlang_project -F

=================================== split ===================================
Running command: /srv/projects/course-materials/w2/venv/bin/python scripts/split.py 0.2 11 yi
🚂 Created 499 training docs
😊 Created 100 validation docs
🧪  Created 25 test docs

24.5. Debug#

The debug command runs spacy debug data, which provides a good overview of your prepared data. This can help identify problems that will lead to poor model training. It’s a good check-in and moment of reflection on the state of your data before moving forward. For more, see the spaCy docs.

!python -m spacy project run debug  /srv/projects/course-materials/w2/using-inception-data/newlang_project

=================================== debug ===================================
Running command: /srv/projects/course-materials/w2/venv/bin/python -m spacy debug data ./config.cfg

============================ Data file validation ============================
✔ Pipeline can be initialized with data
✔ Corpus is loadable

=============================== Training stats ===============================
Language: yi
Training pipeline: tok2vec, tagger, parser, ner
499 training docs
100 evaluation docs
✔ No overlap between training and evaluation data
âš  Low number of examples to train a new pipeline (499)

============================== Vocab & Vectors ==============================
ℹ 130313 total word(s) in the data (15962 unique)
âš  30077 misaligned tokens in the training data
âš  5675 misaligned tokens in the dev data
ℹ No word vectors present in the package

========================== Named Entity Recognition ==========================
ℹ 0 label(s)
0 missing value(s) (tokens with '-' label)
✔ Good amount of examples for all labels
✔ Examples without occurrences available for all labels
✔ No entities consisting of or starting/ending with whitespace

=========================== Part-of-speech Tagging ===========================
ℹ 15 label(s) in train data

============================= Dependency Parsing =============================
ℹ Found 4714 sentence(s) with an average length of 27.6 words.
ℹ Found 112 nonprojective train sentence(s)
ℹ 36 label(s) in train data
ℹ 52 label(s) in projectivized train data
âš  Low number of examples for label 'dislocated' (10)
âš  Low number of examples for label 'csubj' (1)
âš  Low number of examples for label 'discourse' (2)
âš  Low number of examples for 13 label(s) in the projectivized
dependency trees used for training. You may want to projectivize labels such as
punct before training in order to improve parser performance.

================================== Summary ==================================
✔ 6 checks passed
âš  10 warnings

24.6. Train#

The train command is the moment we’ve all been waiting for. Go ahead and press the launch button! 🚀 This step will train the model using the settings in the config file.

When training begins, you’ll see a bunch of numbers. Let’s make sense of what they’re saying.

You’ll see a list of what components are currently being trained. Pipeline: ['tok2vec', 'tagger', 'parser', 'ner'] Tok2vec are token embeddings or numerical representations of tokens that can be used efficiently by the model. The tagger will learn to predict part of speech values for your tokens. The parser will learn to predict grammatical structure. The ner component learns to predict named entities in the text.

For each of these components, spaCy will print training metrics. So let’s dive into this pile of forbidding verbiage and numbers.

E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS PARSER  LOSS NER  TAG_ACC  DEP_UAS  DEP_LAS  SENTS_F  ENTS_F  ENTS_P  ENTS_R  SCORE
  • The E refers to the epoch. An epoch is one complete pass of all the data through the model. You can set the number of epochs to complete during training or let spaCy optimize the number of epochs automatically (this is the default).

  • Every 200 examples, spaCy outputs accuracy scores in the # column.

  • LOSS refers to training loss. Loss is a measure of error. During training, the model will try to learn how to improve its predictions. Decreasing loss can suggest that the model is learning and improving. If the loss value flattens or plateaus, the model has probably stopped learning or reached the best result for a given set of parameters and data. You will find a loss measure for each of the pipeline components being trained. If the loss varies greatly and looks like a zigzag, the model is struggling to improve its predictions in a deliberate manner. LOSS TOK2VEC  LOSS TAGGER  LOSS PARSER  LOSS NER

  • TAG_ACC refers to the accuracy of the tagger component. Accuracy is the number of correct predictions divided by the total number of predictions made.

  • DEP_UAS and DEP_LAS are the unlabeled attachment score (UAS) and labeled attachment score (LAS) for the dependency parser. This is a measure of how many times the model correctly predicted the correct head.

  • SENTS_F gives the model’s f-score for sentence prediction.

  • ENTS_F  ENTS_P  ENTS_R relate to the model’s predictions of named entities. The f-score is the mean of precision and recall.

  • Finally, spaCy logs a SCORE for the model’s predictions overall. This gives a rough number for the model’s overall accuracy. As a general rule, increasing numbers means that the model is improving. By default, spaCy will end training when the score stops rising.

!python -m spacy project run train /srv/projects/course-materials/w2/using-inception-data/newlang_project

=================================== train ===================================
Running command: /srv/projects/course-materials/w2/venv/bin/python -m spacy train config.cfg --output training/urban-giggle --gpu-id -1 --nlp.lang=yi
ℹ Saving to output directory: training/urban-giggle
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2021-12-30 21:32:00,090] [INFO] Set up nlp object from config
[2021-12-30 21:32:00,097] [INFO] Pipeline: ['tok2vec', 'tagger', 'parser']
[2021-12-30 21:32:00,100] [INFO] Created vocabulary
[2021-12-30 21:32:00,101] [INFO] Finished initializing nlp object
[2021-12-30 21:32:04,936] [INFO] Initialized pipeline components: ['tok2vec', 'tagger', 'parser']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['tok2vec', 'tagger', 'parser']
ℹ Initial learn rate: 0.001
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS PARSER  TAG_ACC  DEP_UAS  DEP_LAS  SENTS_F  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  -----------  -----------  -------  -------  -------  -------  ------  ------  ------  ------
  0       0          0.00       145.60       431.08    22.31     3.87     3.16     0.09    0.00    0.00    0.00    0.09
  0     200       2489.76     11513.39     26848.70    51.98    22.62    15.10    60.33    0.00    0.00    0.00    0.24
  0     400       4685.01      6464.16     22632.98    56.50    25.11    19.89    77.27    0.00    0.00    0.00    0.26
✔ Saved pipeline to output directory
training/urban-giggle/model-last

24.7. Evaluate#

The evaluate command takes the trained model and tests it with the test data. Recall that these are examples that the model has never seen, so they provide the best measure of its performance. The output will be saved as a json file in the metrics folder.

# Evaluate the model 
!python -m spacy project run evaluate /srv/projects/course-materials/w2/using-inception-data/newlang_project

================================== evaluate ==================================
Running command: /srv/projects/course-materials/w2/venv/bin/python -m spacy evaluate ./training/urban-giggle/model-best ./corpus/converted/test.spacy --output ./metrics/urban-giggle.json --gpu-id -1
ℹ Using CPU

================================== Results ==================================

TOK      80.58
TAG      55.30
UAS      23.93
LAS      18.56
SENT P   74.31
SENT R   85.60
SENT F   79.55
SPEED    19651


=============================== LAS (per type) ===============================

                      P       R       F
det               72.41    3.12    5.97
nsubj             48.22   27.78   35.25
flat:name         26.79   32.97   29.56
root              61.47   56.80   59.04
case:acc          88.46   28.40   42.99
obj               39.29   13.75   20.37
case:gen          86.79   27.88   42.20
nmod:poss         42.31    8.09   13.58
case              89.58   11.78   20.82
obl               43.40    5.24    9.35
nmod              28.00    2.32    4.28
amod              43.48   20.41   27.78
mark              70.27   14.29   23.74
acl:relcl         10.42    5.21    6.94
compound:smixut   53.19    8.71   14.97
dep               28.57    1.87    3.51
fixed             29.41    7.58   12.05
appos              4.26    6.67    5.19
nummod            95.16   52.21   67.43
cop               77.42   44.44   56.47
parataxis          0.00    0.00    0.00
advcl              0.00    0.00    0.00
advmod            59.84   38.38   46.77
ccomp             15.15   14.71   14.93
xcomp             32.20   44.19   37.25
acl               50.00    8.70   14.81
cc                53.85    3.87    7.22
conj              19.67    5.77    8.92
csubj              0.00    0.00    0.00
nsubj:cop         33.33    5.88   10.00
compound:affix     0.00    0.00    0.00
aux               62.50   38.46   47.62

✔ Saved results to metrics/urban-giggle.json

24.8. Package#

Finally, the package command saves your trained model in a single tar file that can be shared and installed on other computers.

!python -m spacy package ./newlang_project/training/urban-giggle/model-last ./export 
ℹ Building package artifacts: sdist
✔ Loaded meta.json from file
newlang_project/training/urban-giggle/model-last/meta.json
✔ Generated README.md from meta.json
✔ Successfully created package 'yi_pipeline-0.0.0'
export/yi_pipeline-0.0.0
running sdist
running egg_info
creating yi_pipeline.egg-info
writing yi_pipeline.egg-info/PKG-INFO
writing dependency_links to yi_pipeline.egg-info/dependency_links.txt
writing entry points to yi_pipeline.egg-info/entry_points.txt
writing requirements to yi_pipeline.egg-info/requires.txt
writing top-level names to yi_pipeline.egg-info/top_level.txt
writing manifest file 'yi_pipeline.egg-info/SOURCES.txt'
reading manifest file 'yi_pipeline.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching 'LICENSE'
warning: no files found matching 'LICENSES_SOURCES'
writing manifest file 'yi_pipeline.egg-info/SOURCES.txt'
running check
warning: check: missing required meta-data: url

warning: check: missing meta-data: either (author and author_email) or (maintainer and maintainer_email) must be supplied

creating yi_pipeline-0.0.0
creating yi_pipeline-0.0.0/yi_pipeline
creating yi_pipeline-0.0.0/yi_pipeline.egg-info
creating yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0
creating yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/parser
creating yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/tagger
creating yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/tok2vec
creating yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/vocab
copying files to yi_pipeline-0.0.0...
copying MANIFEST.in -> yi_pipeline-0.0.0
copying README.md -> yi_pipeline-0.0.0
copying meta.json -> yi_pipeline-0.0.0
copying setup.py -> yi_pipeline-0.0.0
copying yi_pipeline/__init__.py -> yi_pipeline-0.0.0/yi_pipeline
copying yi_pipeline/meta.json -> yi_pipeline-0.0.0/yi_pipeline
copying yi_pipeline.egg-info/PKG-INFO -> yi_pipeline-0.0.0/yi_pipeline.egg-info
copying yi_pipeline.egg-info/SOURCES.txt -> yi_pipeline-0.0.0/yi_pipeline.egg-info
copying yi_pipeline.egg-info/dependency_links.txt -> yi_pipeline-0.0.0/yi_pipeline.egg-info
copying yi_pipeline.egg-info/entry_points.txt -> yi_pipeline-0.0.0/yi_pipeline.egg-info
copying yi_pipeline.egg-info/not-zip-safe -> yi_pipeline-0.0.0/yi_pipeline.egg-info
copying yi_pipeline.egg-info/requires.txt -> yi_pipeline-0.0.0/yi_pipeline.egg-info
copying yi_pipeline.egg-info/top_level.txt -> yi_pipeline-0.0.0/yi_pipeline.egg-info
copying yi_pipeline/yi_pipeline-0.0.0/README.md -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0
copying yi_pipeline/yi_pipeline-0.0.0/config.cfg -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0
copying yi_pipeline/yi_pipeline-0.0.0/meta.json -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0
copying yi_pipeline/yi_pipeline-0.0.0/tokenizer -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0
copying yi_pipeline/yi_pipeline-0.0.0/parser/cfg -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/parser
copying yi_pipeline/yi_pipeline-0.0.0/parser/model -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/parser
copying yi_pipeline/yi_pipeline-0.0.0/parser/moves -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/parser
copying yi_pipeline/yi_pipeline-0.0.0/tagger/cfg -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/tagger
copying yi_pipeline/yi_pipeline-0.0.0/tagger/model -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/tagger
copying yi_pipeline/yi_pipeline-0.0.0/tok2vec/cfg -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/tok2vec
copying yi_pipeline/yi_pipeline-0.0.0/tok2vec/model -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/tok2vec
copying yi_pipeline/yi_pipeline-0.0.0/vocab/key2row -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/vocab
copying yi_pipeline/yi_pipeline-0.0.0/vocab/lookups.bin -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/vocab
copying yi_pipeline/yi_pipeline-0.0.0/vocab/strings.json -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/vocab
copying yi_pipeline/yi_pipeline-0.0.0/vocab/vectors -> yi_pipeline-0.0.0/yi_pipeline/yi_pipeline-0.0.0/vocab
Writing yi_pipeline-0.0.0/setup.cfg
creating dist
Creating tar archive
removing 'yi_pipeline-0.0.0' (and everything under it)
✔ Successfully created zipped Python package
export/yi_pipeline-0.0.0/dist/yi_pipeline-0.0.0.tar.gz

Keep in mind that you’ll need some persistence and patience along the way. You’ll probably need to run multiple experiments before you find the right blend of data and parameters to create a final product. The instructors are happy to help along the way and we look forward to learning together with you. When all commands are successfully run, you will have converted your text annotations from inception and language object from Cadet into a trained statistical language model that can be loaded with spaCy for a large variety of research tasks.