10. Cadet#

Each language team has a copy of Cadet for use during the workshops.

arabic.slovo.world
chinese.slovo.world 
kanbun.slovo.world
kannada.slovo.world
turkish.slovo.world
quechua.slovo.world
russian.slovo.world
tigrinya.slovo.world
yiddish.slovo.world
yoruba.slovo.world 

10.1. Overview of Cadet#


10.2. Main Page#


10.3. Steps One to Three#


10.4. Steps Four to Six#


10.5. How do you add new punctuation in spaCy?#

So the idea is that there are different types of punctuations relevant for tokenization. It is described in the spacy documentation: https://spacy.io/usage/linguistic-features#tokenization

There are Prefix, Suffix and Infix: Tokenizer exception: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied. Prefix: Character(s) at the beginning, e.g. $, (, “, ¿. Suffix: Character(s) at the end, e.g. km, ), “, !. Infix: Character(s) in between, e.g. -, –, /, …. In the punctuation file in cadet, you see something like this:

_prefixes = BASE_TOKENIZER_PREFIXES
_suffixes = BASE_TOKENIZER_SUFFIXES
_infixes = BASE_TOKENIZER_INFIXES
TOKENIZER_PREFIXES = _prefixes
TOKENIZER_SUFFIXES = _suffixes
TOKENIZER_INFIXES = _infixes

and you can extend all the lists from the base_tokenizer_* with with additional characters. To see how this might look like, here is the english punctuations from the existing spacy model explosion/spaCy there, you see that LIST_ELLIPSES is added to the _infixes. And this is what LIST_ELLIPSES looks like: LIST_ELLIPSES = [r"\.\.+", "…"]