2.
Introduction to spaCy

Dr. W.J.B. Mattingly
Smithsonian Data Science Lab and United States Holocaust Memorial Museum
January 2021

2.1. Key Concepts in this Notebook

  1. frameworks

  2. libraries

  3. rules-based NLP

  4. machine learning-based NLP

  5. tokenization

  6. chunks

  7. noun extraction

  8. part-of-speech identification

  9. entity identification

2.2. What are Frameworks?

In order to engage in NLP, a researcher must first decide upon the framework they wish to use. Framework is a word that describes the software used by the researcher to engage in a specific task. A good way to think about a framework in Pythonic terms is as a library, or packaged set of usable classes and functions to perform complex tasks easily. Deciding which framework to use depends on a few variables.

First, not all frameworks support all languages and not all frameworks support the same languages equally.

Second, certain frameworks perform certain tasks better than others. While all frameworks will tokenize equally well (usually), the way in which some tasks, such as finding the root of words via lemmatization (spaCy) vs. stemming (Stanza) will vary. Decision on a framework for this purpose typically lies in the realm of computational linguistics or distance reading for the purpose of finding how a word (or words) appear in texts in all forms (conjugated and declined).

A common third thing to consider is the way in which the framework performs NLP. There are essentially two methods for performing NLP: rules-based and machine learning-based. Rules-based NLP is the process by which the frameworks has a predetermined set of rules for how to handle specific tasks. In order to find entities in a text, for example, a rules-based method will contain a dictionary of all types of entities or it may contain a RegEx formula for identifying patterns that match an entity.

Most frameworks today are moving away from a rules-based approach to NLP in favor of a machine learning-based approach. Machine learning-based NLP is the process by which developers use statistics to teach a computer system (known as a model) to perform a task based on past experiences (known as training). We will be speaking much more about machine learning-based NLP later in a later notebook as spaCy, the chief subject of this notebook, is a machine learning-based Python library.

2.3. What is spaCy?

The spaCy (spelled correctly) library is a robust machine learning NLP library developed by Explosion AI, a Berlin based team of computer scientists and computational linguists. It supports a wide variety of European languages out-of-the-box with statistical models capable of parsing texts, identifying parts-of-speech, and extract entities. SpaCy is also capable of easily improving or training from scratch custom models on domain-specific texts.

In this notebook, we will go through the steps for installing spaCy, downloading a pretrained language model, and performing the essential tasks of NLP.

In order to download and install spaCy and the model, one must do so outside of this notebook. Please watch the video below and follow the necessary steps:

%%html
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/yqruv_QQctI" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>

2.4. Sentence Tokenization

A common essential task of NLP is known as tokenization. We looked at tokenization briefly in the last notebook in which we wanted to break a text into individual components. This is one form of tokenization known as word tokenization. There are, however, many other forms, such as sentence tokenization. Sentence tokenization is precisely the same as word tokenization, except instead of breaking a text up into individual word and puncation components, we break a text up into individual sentences.

If you are familiar with Python, you may be familiar with the built-in split() function which allows for a programmer to split a text by whitespace (default) or by passing an argument of a string to define where to split a text, i.e. split(“.”). A common practice (without NLP frameworks) is to split a text into sentences by simply using the split function, but this is ill-advised. Let us consider the example below

text = "Martin J. Thompson is known for his writing skills. He is also good at programming."
#Now, let's try and use the split function to split the text object based on punctuation.
new = text.split(".")
print (new)
['Martin J', ' Thompson is known for his writing skills', ' He is also good at programming', '']

While we successfully were able to split the two sentences, we had the unfortunate result of splitting at Martin J. The reason for this may be obvious. In English, it is common convention to indicate abbreviation with the same punctuation mark used to indicate the end of a sentence. The reason for this extends to the early middle ages when Irish monks began to introduce punctuation and spacing to better read Latin (a story for another day).

The very thing that makes texts easier to read, however, greatly hinders our ability to easily split sentences. For this reason, another method is needed. This is where sentence tokenization comes into play. In order to see how sentence tokenization differs, let’s begin with our first spaCy usage.

#First, we import spaCy
import spacy
'''
Next, we need to load an NLP model object.
To do this, we use the spacy.load() function.
This will take one argument, the model one wishes to load.
We will use the small English model.
'''
nlp = spacy.load("en_core_web_sm")
'''
With the nlp object created, we can use it to to parse a text.
To do this, we create a doc object.
This object will contain a lot of data on the text.
'''

doc = nlp(text)
#try printing the object:
print (doc)
Martin J. Thompson is known for his writing skills. He is also good at programming.
'''
While this looks identical to the "text" string above, it is quite different.
To demonstrate this, let us use the sentence tokenizer.
'''

for sent in doc.sents:
    print (sent)
Martin J. Thompson is known for his writing skills.
He is also good at programming.

Notice now that we have used the spaCy sentence tokenizer to generate a desired output: a text correctly broken into sentences. This simple demonstration reveals why using an NLP framework for perfoming even a basic task is not only easier, but essential. For a larger explanation of this process, please watch the video below:

%%html
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/ytAyCO-n8tY" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>

2.5. Named Entity Recognition

Another essential task of NLP, and the chief subject of this series, is named entity recognition (NER). I spoke about NER in the last notebook. Here, I’d like to demonstrate how to perform basic NER via spaCy. Again, we will iterate over the doc object as we did above, but instead of iterating over doc.sents, we will iterate over doc.ents. For our purposes right now, I simply want to print off each entity’s text (the string itself) and its corresponding label (note the _ after label). I will be explaining this process in much greater detail in the next two notebooks.

for ent in doc.ents:
    print (ent.text, ent.label_)
Martin J. Thompson PERSON

As we can see the small spaCy statistical machine learning model has correctly identified that Martin J. Thompson is, in fact, an entity. What kind of entity? A person. We will explore how it made this determination in notebook 03 in which we explore machine learning NLP more closely. For a look at this process in video form, please see the video below.

%%html
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/lxHNsXudkrY" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>

2.6. Part-of-Speech

In the field of computational linguistics, understanding parts-of-speech is essential. SpaCy offers an easy way to parse a text and identify its parts of speech. Below, we will iterate across each token (word or punctuation) in the text and identify its part of speech.

for token in doc:
    print(token.text, token.pos_)
Martin PROPN
J. PROPN
Thompson PROPN
is AUX
known VERB
for ADP
his DET
writing NOUN
skills NOUN
. PUNCT
He PRON
is AUX
also ADV
good ADJ
at ADP
programming NOUN
. PUNCT

Here, we can see two vital pieces of information: the string and the corresponding part-of-speech (pos). For a complete list of the pos labels, see the spaCy documentation (https://spacy.io/api/annotation#pos-tagging). Most of these, however, should be apparent, i.e. PROPN is proper noun, AUX is an auxillary verb, ADJ, is adjective, etc. For more on this process, please see the video below.

%%html
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/nv0pksknFxY" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>

2.7. Extracting Nouns and Noun Chunks

Often times when working with a text, we need to extract nouns and noun chunks. There are a few different ways that we can do this via spaCy. To extract nouns, we can use the doc.noun_chunks attribute.

for chunk in doc.noun_chunks:
    print(chunk.text)
Martin J. Thompson
his writing skills
He
programming

Note that we get a list of all nouns and noun chunks, i.e. “He” and “programming” being nouns and “Martin J. Thompson” and “his writing skills” being noun chunks. For more on this, please see the video below.

%%html
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/aNKt1gKK8Lo" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>

2.8. Extracting Verbs and Verb Phrases

We can do precisely the same thing with verbs and verb phrases by developing predefined patterns using RegEx. This is a bit more complicated and requires an understanding of linguistic patterns and the textacy library. We will try to find all instances of a specific pattern, an auxillary verb followed by a normal verb.

#We import textacy
import textacy


#We create our patterns as a list of dictionaries
patterns = [{"POS": "AUX"}, {"POS": "VERB"}]


#We create our verb_phrases which will leverage textacy to find the specific patterns in our doc object
verb_phrases = textacy.extract.matches(doc, patterns=patterns)


#We iterate across the verb_phrases
for verb_phrase in verb_phrases:
    print (verb_phrase)
is known

And we correctly find the one instance of an auxillary verb followed by a regular verb: “is known”. For more on this, please see the video below:

%%html
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/VgGHwIWu-kU" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>

2.9. Lemmatization

The final item I’d like to explore in this notebook is lemmatization. Lemmatization is an essential component in most NLP frameworks, though some libraries perform this concept differently. While libraries, such as Stanza will find word stems, spaCy will find word lemmas. They are technically a little different, but both seek to reduce all words to their roots. To find lemmas via spaCy, we use the same process as we did for finding a word’s part of speech, via iterating over the tokens in the doc object.

for token in doc:
    print(token.text, token.lemma_)
Martin Martin
J. J.
Thompson Thompson
is be
known know
for for
his -PRON-
writing writing
skills skill
. .
He -PRON-
is be
also also
good good
at at
programming programming
. .

Note that we see most words remain the same, but notice particularly “is” being identified as “be” and “known” becomes “know”. These are the respective lemmas for these verbs. Also notice the same effect on nouns, such as “skills”, a plural, being reduced to “skill”, the singular form. For more on this, see the video below.

%%html
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/YztOLsJkC3A" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>