3.
How to Train spaCy NER Model

Dr. W.J.B. Mattingly
Smithsonian Data Science Lab and United States Holocaust Memorial Museum
January 2021

3.1. Key Concepts in this Notebook

  1. the training process in spaCy

3.2. Introduction to Training a Machine Learning Model in spaCy

In the last notebook, we created a basic training set for a machine learning model using spaCy’s EntityRuler. We were able to do this by making certain presumptions about things that are very likely or certainly going to fall under a specific label. Such an approach to cultivating a training set is, by its nature, problematic. It will miss some entities and falsely label others. If one wishes this to be the essential training set used to train a final model, I encourage a manual check. If, however, you want to use this model as a baseline model that can be used to cultivate a better training set via Prodigy, then this method will also work.

In this notebook, we will not be interested in the refining of this training set, rather the use of it to train a custom spaCy machine learning NER model. The methods, therefore, will receive the chief focus on this notebook, not the results.

In notebook 03, we first met machine learning and some of the fundamentals of it. If you have not viewed that notebook and the videos within, I encourage you to do so prior to working through this notebook as I will be assuming that you have a basic understanding of machine learning.

The reason I prefer spaCy over other NLP frameworks is spaCy’s ability to scale well (work on both small and big data) and it’s easy-to-use training process. An NER practitioner does not have to create a custom neural network via PyTorch/FastAI or TensorFlow/Keras, all of which have a steep learning curve, despite being some of the easist frameworks to use. Instead, users of spaCy can take advantage of the predesigned CNN architecture behind the spaCy training process. In version 3.0 of spaCy (nightly is available at the time of writing this notebook), due in early 2021, the user will also be able to customize this neural network architecture, expanding spaCy’s utility and customizability.

In order to take advantage of the spaCy training process, the user need only understand a few basic concepts, such as how the data should look going into the training process (covered in the last notebook) and a few hyperperameters (the things we adjust in the training process to try and find optimal results).

3.3. Preparing the Data

As noted in the last notebook, your input data should be in the following format:

TRAIN_DATA = [ (TEXT AS A STRING, {“entities”: [(START, END, LABEL)]}) ]

To begin, let’s bring in the code from the last video to generate our training data:

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Treblinka is a small village in Poland. Wikipedia notes that Treblinka is not large."
corpus = []

doc = nlp(text)
for sent in doc.sents:
    corpus.append(sent.text)

nlp = spacy.blank("en")

ruler = nlp.create_pipe("entity_ruler")

patterns = [
                {"label": "GPE", "pattern": "Treblinka"}
            ]

ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

TRAIN_DATA = []
for sentence in corpus:
    doc = nlp(sentence)
    entities = []

    for ent in doc.ents:
        entities.append([ent.start_char, ent.end_char, ent.label_])
    TRAIN_DATA.append([sentence, {"entities": entities}])

print (TRAIN_DATA)
[['Treblinka is a small village in Poland.', {'entities': [[0, 9, 'GPE']]}], ['Wikipedia notes that Treblinka is not large.', {'entities': [[21, 30, 'GPE']]}]]

3.4. Training a spaCy model

I find that it is always easiest to create a good function that is reusable in different scripts. This is the boiler plate function I use. It is in the spaCy documentation, but I obtained it from a few different places, including TowardsDataScience (a good blog on data science with articles from various authors) and Medium. This function will take two arguments: the training data, and the number of iterations, or times we want the model to pass over our data.

import random

def train_spacy(TRAIN_DATA, iterations):

    #Create the blank spacy model
    nlp = spacy.blank("en")
    
    #add the ner component to the pipeline if it's not there
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    
    #add all labels to the spaCy model
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])
    
    #eliminate the effect of the training on other pipes and 
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    
    #begin training
    with nlp.disable_pipes(*other_pipes):
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print ("Starting iteration " + str(itn))
            random.shuffle(TRAIN_DATA)
            losses = {}
            for text, annotations in TRAIN_DATA:
                nlp.update(
                            [text],
                            [annotations],
                            drop=0.2,
                            sgd=optimizer,
                            losses=losses
                )
            print
    return (nlp)

#run function and create a trained model
trained_nlp = train_spacy(TRAIN_DATA, 10)
Starting iteration 0
Starting iteration 1
Starting iteration 2
Starting iteration 3
Starting iteration 4
Starting iteration 5
Starting iteration 6
Starting iteration 7
Starting iteration 8
Starting iteration 9

In the code above, we see the spacy model train over our VERY small training set of 2 sentences. To demonstrate how this works, let’s now try and use our model on a sample sentence.

text = "The village of Treblinka is located in Poland."
doc = trained_nlp(text)

for ent in doc.ents:
    print (ent.text, ent.label_)
Treblinka GPE

Note that we gave the machine learning model NER a new sentence and it correctly identifies Treblinka as a “GPE”. But we should not get too excited. Minor alterations to this text result in a missed entity.

text = "Mark, from New York, said that he wants to go to Treblinka to speak to the locals."
doc = trained_nlp(text)

for ent in doc.ents:
    print (ent.text, ent.label_)
if len(doc.ents) == 0:
    print ("No entities found.")
No entities found.

Why does our model now fail? Because we have trained a machine learning model, not an EntityRuler. It knows that Treblinka is a GPE, but it has only learned to identify it as such in certain contexts. These contexts include words like village, Wikipedia, etc. Machine learing NER models improve vastly the more training data that we feed them. Most importantly, however, they improve with the greater amount of varied training data we feed them.

3.5. How to Improve the above Model?

So, how might we improve this model? The answer is to generate more training data. This is not always as easy as it may sound. In some cases, it may be necessary to introduce data augmentation (addressed in a later notebook).

3.6. Excerise

For this notebook’s exercise, try and generate a larger quantity of training data and feed it into a spaCy model and test that model on new, unseen texts. See how it performs.

3.7. Video

In this notebook, I rushed through this process with little exposition. The reason is because this material is easier to cover in video form. Please see the video below in which we perform similar tasks on a larger scale with the characters from the first book of Harry Potter.

%%html
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/7Z1imsp6g10" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>