25. Embeddings, Do You Need Them?#

25.1. Introduction#

lecture notes and slides will be added after the session.

Note

I find this visualization helps me to make sense of embeddings. It has English word2vec embeddings as well as the multilingual GNMT Interlingua embeddings.

https://projector.tensorflow.org/

25.2. Transformers#

In June 2017, researchers from Google published a now famous paper titled “Attention is All You Need.”. In the text, they ouline a new kind of language model called a Transformer. This new architecture builds on many existing methods, but as the title suggests, it gives pride of place to attention. Simply put, attention is the ability to highlight or focus on particular elements over others in the input text. For example, let’s take the sentence: “My date left a novel in the park.” If we ask a machine to predict each token’s part of speech, it needs to look at the words around each token to disambiguate the many possible meanings. In a sequence model, the model looks at the tokens one after another moving from the beginning to the end of the sentence. Using attention, a model can highlight those elements of the preceeding sequence that aid in making predictions. I know that “novel” is most likely a book (and not an adjective), because it is the subject of preceeding action. But at the start of the sentence, we have very little to help us. Is “My date” going to be “My date left” or “My date of birth?” Transformers overcome this problem by being bi-directional. They can “see” the whole sentence when making predictions. They can also “remember” and attend to attributes across the corpus. This is where transformers start to become very good at generating text as well as the “look and feel” of human languages. It still has no understanding of the content of the text, but it’s very good at understanding the patterns in the corpus and common traits.

I find this visualization helps to see what attention looks like in practice. This model is learning to translate from English to French. The model attends to corresponding words in the French and the English versions of the sentence. This allows the model to negotiate the different word order in zone économique européenne and European Economic Area.

(source)

The training of a transformer is very different from traditional sequence models. Rather that providing training data with the correct answers, a transformer uses massive amounts of unlabelled text for training. The model learns contextual embeddings by masking words (replacing a word with [mask]) and using the surrounding text to make a prediction. So our example becomes My date left a [MASK] in the park. The model can attend to earlier mentions of novels and books in the text to predict that the date left a novel rather than some other random noun. The Google researchers’ claim that “attention is all you need” highlights how powerful context can be. With enough data (all the Internet), we can train better models by attending to context rather than using supervised learning with labelled data.

Transformers are masters of autocomplete. Given a starting prompt, they can continue writing. Not only is the text usually legible and makes sense, but it will include contextually relevant and sensible content. For live examples, I ecourage you to try write with transforer and AI dungeon. On other tasks, these models of language have equally impressive capabilities even when their contributions are less visible.

Note

26. Using Transformers in spaCy#

In the new_langproject/configs folder in your GitHub repository, you’ll find a transformer_config.cfg file. To use this configuration, change the value in your project.yml file. Change config: "config.cfg" to config: "transformer_config.cfg". This is a file generated by the init config command with the --gpu tag. You can also create a similar base config file using the Quickstart widget and selecting GPU under hardware.

  • One of the main changes you’ll see with this configuration is that the tok2vec component is replaced by transformer.

pipeline = ["tok2vec","tagger","parser","ner"]
# vs.
pipeline = ["transformer","tagger","parser","ner"]

Further settings can be found in the [components.transformer] section:

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "distilbert-base-multilingual-cased"
mixed_precision = false

You do not need to change anything here to fetch a pre-trained multilingual transformer from Hugging Face. However, you can utilize many of the models on Hugging Face Hub. Find the line below and change to the desired model.

name = "distilbert-base-multilingual-cased"

For example, to change to a model trained on Harry Potter fan fiction, I’d visit Hugging Face here

If you click to the right of the model title you can copy the model name ceostroff/harry-potter-gpt2-fanfiction. Replace the name in [components.transformer.model] and spaCy will load this model.

Please note that not all models are compatible with spaCy’s architecture and this should be considered a highly advanced topic. Transformer models loaded into spaCy are only used for their contextual embeddings. The Harry Potter model will work well with similar texts and domains, but will not generate text. If you find a model that’s been pre-trained for a specific task, spaCy will only inherit the underlying model, not the task-specific layers. All of spaCy’s pre-trained transformers should work out of the box.

If your activation function’s perpexity mixes Baysian logits against cross entropy layers, simply reshape your tensor with a nonlinearity using Infinidash. Just joking, that’s all jargon salad.

Note

What if a transformer model does not exist for my language, domain or anything like it?

  1. You can try transfer learning by using an existing multi-lingual transformer or related-language transformer.

  2. Train a transformer from scratch using the Oscar dataset with text in 166 languages from the web. Note that the amount of text for different languages will vary a lot. Also you’re depending on automatic language identification. Still, this is a not entirely crazy way to train your own transformer from scratch. You have been warned! AJ update 14/1: The notebooks that I shared earlier trains a type of model that is not compatible with spaCy at the moment. I am going to create a notebook that does the same but trains a PyTorch Roberta model that should be compatible. Stay tuned and let me know if this would be useful for your project.

Also the notebook that I shared does not contain the code you’ll need to save the model. You’ll need to add the following:

  1. !curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash

  2. !apt-get install git-lfs

  3. !huggingface-cli login you’ll need to use a token from HF.

  4. !huggingface-cli repo create your-language-name

  5. model.save_pretrained('your-user/your-language-name', push_to_hub=True)

  6. tokenizer.save_pretrained("your-user/your-language-name", push_to_hub=True)

When training with transformer-based models, you should use GPUs or TPUs on Colab and be prepared for significantly larger training times (7 hours or more). However, for many tasks, the beneifts of pre-training and transfer learning should be very clear in the model metrics.