Introduction to spaCy

Dr. W.J.B. Mattingly
Smithsonian Data Science Lab and United States Holocaust Memorial Museum
January 2021

2.1. Key Concepts in this Notebook

  1. frameworks

  2. libraries

  3. rules-based NLP

  4. machine learning-based NLP

  5. tokenization

  6. chunks

  7. noun extraction

  8. part-of-speech identification

  9. entity identification

2.2. What are Frameworks?

In order to engage in NLP, a researcher must first decide upon the framework they wish to use. Framework is a word that describes the software used by the researcher to engage in a specific task. A good way to think about a framework in Pythonic terms is as a library, or packaged set of usable classes and functions to perform complex tasks easily. Deciding which framework to use depends on a few variables. I will use the word “Pythonic” throughout this book. Pythonic is a term programmers of Python use to refer to the standard, or community-accepted, way to do something. A good example is the way in which one imports pandas, a library for analyzing and working with tabular data. When we import pandas, we import it as pd. Why? Because the documentation told us to do so and, perhaps even more importantly, everyone in the community follows this syntax.

First, not all frameworks support all languages and not all frameworks support the same languages equally.

Second, certain frameworks perform certain tasks better than others. While all frameworks will tokenize equally well (usually), the way in which some tasks, such as finding the root of words via lemmatization (spaCy) vs. stemming (Stanza) will vary. Decision on a framework for this purpose typically lies in the realm of computational linguistics or distance reading for the purpose of finding how a word (or words) appear in texts in all forms (conjugated and declined).

A common third thing to consider is the way in which the framework performs NLP. There are essentially two methods for performing NLP: rules-based and machine learning-based. Rules-based NLP is the process by which the frameworks has a predetermined set of rules for how to handle specific tasks. In order to find entities in a text, for example, a rules-based method will contain a dictionary of all types of entities or it may contain a RegEx formula for identifying patterns that match an entity.

Most frameworks today are moving away from a rules-based approach to NLP in favor of a machine learning-based approach. Machine learning-based NLP is the process by which developers use statistics to teach a computer system (known as a model) to perform a task based on past experiences (known as training). We will be speaking much more about machine learning-based NLP later in a later notebook as spaCy, the chief subject of this notebook, is a machine learning-based Python library.

2.3. What is spaCy?

The spaCy (spelled correctly) library is a robust machine learning NLP library developed by Explosion AI, a Berlin based team of computer scientists and computational linguists. It supports a wide variety of European languages out-of-the-box with statistical models capable of parsing texts, identifying parts-of-speech, and extract entities. SpaCy is also capable of easily improving or training from scratch custom models on domain-specific texts.

In this notebook, we will go through the steps for installing spaCy, downloading a pretrained language model, and performing the essential tasks of NLP.

In order to download and install spaCy and the model, one must do so outside of this notebook. Please watch the video below and follow the necessary steps:

<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/yqruv_QQctI" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

2.4. Sentence Tokenization

A common essential task of NLP is known as tokenization. We looked at tokenization briefly in the last notebook in which we wanted to break a text into individual components. This is one form of tokenization known as word tokenization. There are, however, many other forms, such as sentence tokenization. Sentence tokenization is precisely the same as word tokenization, except instead of breaking a text up into individual word and punctuation components, we break a text up into individual sentences.

If you are familiar with Python, you may be familiar with the built-in split() function which allows for a programmer to split a text by whitespace (default) or by passing an argument of a string to define where to split a text, i.e. split(“.”). A common practice (without NLP frameworks) is to split a text into sentences by simply using the split function, but this is ill-advised. Let us consider the example below

text = "Martin J. Thompson is known for his writing skills. He is also good at programming."
#Now, let's try and use the split function to split the text object based on punctuation.
new = text.split(".")
print (new)
['Martin J', ' Thompson is known for his writing skills', ' He is also good at programming', '']

While we successfully were able to split the two sentences, we had the unfortunate result of splitting at Martin J. The reason for this may be obvious. In English, it is common convention to indicate abbreviation with the same punctuation mark used to indicate the end of a sentence. The reason for this extends to the early middle ages when Irish monks began to introduce punctuation and spacing to better read Latin (a story for another day).

The very thing that makes texts easier to read, however, greatly hinders our ability to easily split sentences. For this reason, another method is needed. This is where sentence tokenization comes into play. In order to see how sentence tokenization differs, let’s begin with our first spaCy usage.

#First, we import spaCy
import spacy
Next, we need to load an NLP model object.
To do this, we use the spacy.load() function.
This will take one argument, the model one wishes to load.
We will use the small English model.
nlp = spacy.load("en_core_web_sm")
With the nlp object created, we can use it to to parse a text.
To do this, we create a doc object.
This object will contain a lot of data on the text.

doc = nlp(text)
#try printing the object:
print (doc)
Martin J. Thompson is known for his writing skills. He is also good at programming.
While this looks identical to the "text" string above, it is quite different.
To demonstrate this, let us use the sentence tokenizer.

for sent in doc.sents:
    print (sent)
Martin J. Thompson is known for his writing skills.
He is also good at programming.

Notice now that we have used the spaCy sentence tokenizer to generate a desired output: a text correctly broken into sentences. This simple demonstration reveals why using an NLP framework for performing even a basic task is not only easier, but essential. For a larger explanation of this process, please watch the video below:

<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/ytAyCO-n8tY" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

2.5. Named Entity Recognition

Another essential task of NLP, and the chief subject of this series, is named entity recognition (NER). I spoke about NER in the last notebook. Here, I’d like to demonstrate how to perform basic NER via spaCy. Again, we will iterate over the doc object as we did above, but instead of iterating over doc.sents, we will iterate over doc.ents. For our purposes right now, I simply want to print off each entity’s text (the string itself) and its corresponding label (note the _ after label). I will be explaining this process in much greater detail in the next two notebooks.

for ent in doc.ents:
    print (ent.text, ent.label_)

As we can see the small spaCy statistical machine learning model has correctly identified that Martin J. Thompson is, in fact, an entity. What kind of entity? A person. We will explore how it made this determination in notebook 03 in which we explore machine learning NLP more closely. For a look at this process in video form, please see the video below.

<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/lxHNsXudkrY" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

2.6. Part-of-Speech

In the field of computational linguistics, understanding parts-of-speech is essential. SpaCy offers an easy way to parse a text and identify its parts of speech. Below, we will iterate across each token (word or punctuation) in the text and identify its part of speech.

for token in doc:
    print(token.text, token.pos_)
Martin PROPN
Thompson PROPN
is AUX
known VERB
for ADP
his PRON
writing NOUN
skills NOUN
is AUX
also ADV
good ADJ
at ADP
programming NOUN

Here, we can see two vital pieces of information: the string and the corresponding part-of-speech (pos). For a complete list of the pos labels, see the spaCy documentation (https://spacy.io/api/annotation#pos-tagging). Most of these, however, should be apparent, i.e. PROPN is proper noun, AUX is an auxiliary verb, ADJ, is adjective, etc. For more on this process, please see the video below.

<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/nv0pksknFxY" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

2.7. Extracting Nouns and Noun Chunks

Often times when working with a text, we need to extract nouns and noun chunks. There are a few different ways that we can do this via spaCy. To extract nouns, we can use the doc.noun_chunks attribute.

for chunk in doc.noun_chunks:
Martin J. Thompson
his writing skills

Note that we get a list of all nouns and noun chunks, i.e. “He” and “programming” being nouns and “Martin J. Thompson” and “his writing skills” being noun chunks. For more on this, please see the video below.

<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/aNKt1gKK8Lo" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

2.8. Extracting Verbs and Verb Phrases

In order to extract all verbs, we can leverage the POS tagger’s output in spaCy. We can establish a for loop to iterate over all POS tags in the doc object and then print off just the ones that are either a “VERB” or “AUX”. These are the two POS tags used to identify tokens in a sentence that function as verbs.

verbs = ["VERB", "AUX"]
for token in doc:
    if token.pos_ in verbs:
        print (token.text, token.pos_)
is AUX
known VERB
is AUX

To print off verb phrases is, however, a bit trickier. The reason for this is because of how complex verb phrases can be. It requires an understanding of linguistic patterns, as we shall see. In a previous edition of this textbook, I opted to use the library Textacy. In this edition, however, I have opted to do this with spaCy itself. We will try to find all instances of a specific pattern, an auxiliary verb followed by a normal verb.

The code below is above that of a beginner. I do not expect you to understand it at this stage, but it demonstrates how to extract verb phrases from spaCy. Let’s break down a bit of what is happening here. We are first going to import the Matcher from spaCy. Matcher allows us to match patterns in a text. It is particularly suited for finding verb phrases, but can do more robust things as well.

Next, we need to create a new NLP object. So that we don’t confuse it with the object nlp referenced earlier in this chapter, I opt to call it nlp_matcher. Next, we need to create a matcher object. This will be our spaCy Matcher. After that, we need to create a set of patterns. It’s important that you contain these as lists within a list. Each pattern will be an index in the main list. With these patterns, we can find a sequence of tokens that match a specific pattern. In this case, we want to find all patterns that are an auxiliary verb followed by a verb, e.g. “is known”.

Next, we add these patterns into the matcher and then create a new doc object (doc2). Finally, we iterate over all the matches and recreate the text by using the data provided by the matcher. Note that the beginning and end tokens are in index 1 and 2 of the output. We can use the 3 and the 5 to find those tokens in the doc object and recreate the text.

#We import the PhraseMatcher
from spacy.matcher import Matcher

nlp_matcher = spacy.load("en_core_web_sm")

matcher = Matcher(nlp_matcher.vocab)

#We create our patterns as a list of dictionaries
pattern = [
    [{"POS": "AUX"}, {"POS": "VERB"}]

matcher.add("verb-phrases", pattern)

doc2 = nlp_matcher(text)
matches = matcher(doc2)
for match in matches:
    print (match)
    span = doc[match[1]:match[2]]
    print (span)
(13004193102528962121, 3, 5)
is known

And we correctly find the one instance of an auxillary verb followed by a regular verb: “is known”. As I stated, this is a bit more complex than finding noun chunks, but as we will see the Matcher as well as Regular Expressions, known as RegEx (see Part 04.02 and Part 04.03), can be used to do complex pattern matching in spaCy.

<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/VgGHwIWu-kU" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

2.9. Lemmatization

The final item I’d like to explore in this notebook is lemmatization. Lemmatization is an essential component in most NLP frameworks, though some libraries perform this concept differently. While libraries, such as Stanza will find word stems, spaCy will find word lemmas. They are technically a little different, but both seek to reduce all words to their roots. To find lemmas via spaCy, we use the same process as we did for finding a word’s part of speech, via iterating over the tokens in the doc object.

for token in doc:
    print(token.text, token.lemma_)

Note that we see most words remain the same, but notice particularly “is” being identified as “be” and “known” becomes “know”. These are the respective lemmas for these verbs. Also notice the same effect on nouns, such as “skills”, a plural, being reduced to “skill”, the singular form. For more on this, see the video below.

<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/YztOLsJkC3A" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>