1. Introduction to Named Entity Recognition¶

Dr. W.J.B. Mattingly Smithsonian Data Science Lab and United States Holocaust Memorial Museum January 2021

1.1. Introduction to this Series of Notebooks¶

These notebooks are designed for those interested in training custom named entity recognition models via the spaCy library. They are intended those who have limited coding experience and no background in natural language processing (NLP). A basic understanding of Python is necessary to partake fully in this series, however, those with no coding experience will still gain a foundational understanding of natural language processing, named entity recognition, the common problems in these fields, and solutions to those problems. For those interested in gaining a quick introduction to Python, please see my series designed for digital humanists at PythonHumanities.com.

The purposes of these notebooks is fivefold.

Introduce the reader to the core concepts of natural language processing and named entity recognition
Provide an introduction to the spaCy library for those with limited Python knowledge
Detail the problems and solutions to working with domain-specific entities
Detail the unique problems that Holocaust documents present to practitioners of NLP
Provide code that will be easily replicable for readers who wish to apply these methods to their own domains.

While these notebooks are dedicated to NER in the domain of the Holocaust, the problems we will encounter in this series are not unique. These notebooks, therefore, shall serve as a guide for those experiencing similar problems in other domains.

Key concepts and terminology will be emboldened. If there are mistakes in grammar, spelling, or code, please reach out to me via twitter (@wjb_mattingly) and I will update the notebooks accordingly.

1.2. Key Concepts in this Notebook¶

natural language processing (NLP)
named entity recognition (NER)
tokens and tokenization
multi-word tokens
spans
pipelines

1.3. What is Natural Language Processing?¶

Named entity recognition (addressed below) is a branch of natural language processing, better known as NLP. NLP is the process by which a researcher uses a computer system to parse human language and extract important metadata from texts. The purpose of NLP is to perform, among other things, distant reading.

Distant reading has a long history extending to the late-twentieth century. It is commonly used when the quantity of texts in a given corpus prevent a researcher (or a team of researchers) from reading the corpus closely in its entirety. In order to make sense of that large corpus, the researcher will often pass certain tasks to a computer with the understanding that there is a margin of error. This margin of error is accepted in exchange for the ability to gain a larger, distant understanding of that corpus. Distant reading is used to perform several significant tasks, such as:

sentiment analysis=> understanding the sentiment of a text
text classification=> classify texts into predetermined categories
named entity recognition=> extract entities from a text

The metadata from these tasks can then be used to get a sense of the texts without reading them closely, hence the term distant reading.

NLP works in tandem with two other similar branches of computational linguistics, natural language understanding, or NLU, and automatic speech recognition, or ASR. To get a better understanding of how the fields of NLP and NLU relate to one another, please see the image below.

This image is an alteration of one that is commonly shared across various NLP tutorials and for good reason. It accurately portrays the diverse field of NLP and its close partner fields of NLU and ASR. The goal of NLP is to feed a text to a computer system and have it return some sort of output. This is often achieved through a series of pipelines that perform some operations on the data at hand.

Earlier pipelines, may include a tokenizer, whose sole job is to break a text into individual tokens. Tokens are items in a text that have some linguistic meaning. They can be words, such as “Martha”, but they can also be punctuation marks, such as “,” in the relative clause “, a senior,”. Likewise, “‘nt” in the contraction “can’t” would also be recognized as a token since “‘nt” in English corresponds to the word “not”.

A common pipeline after a tokenizer is a POS tagger whose job is to identify the parts-of-speech, or POS, in the text. This is essential for the computer to understand how individual tokens are functioning in a sentence. The way in which we perform POS on different languages is not the same. In inflected languages, such as German, or highly inflected languages, such as Latin or Ancient Greek, the ending of the word contains a lot of information about it’s role in the sentence, i.e. a nominative singular or dative plural. In low inflected languages, such as English, position in the sentence holds primacy. English is a Noun-Verb-Object language (NVO). Let us consider an example sentence:

The boy took the ball to the store.

The nominative (subject), “boy”, comes first in the sentence, followed by the verb, “took”, then followed by the accusative (object), “ball”, and finally the dative (indirect object), “store”. The words “the” and “to” also contain vital information. “The” occurs twice and tells the reader that it’s not just any ball, it’s the ball; likewise, it’s not just a store, but the store. The period too tells us something important. This is a statement, not a question. For native speakers of a given language these parts-of-speech may go entirely unnoticed. We understand them intuitively. Some of us may have memories of memorizing parsing trees in 5th grade grammar, but for the most part we developed mentally and linguistically with our mother tongue in a unique way. We can use that language without thought of grammar. For those who have devoted time to learning a second language later in life, grammar is a necessity (and sometimes a bane) to learn. We do not learn languages later in life the same way we learn our mother tongue. For a computer, the same holds true. We need to allow the computer to understand parts of speech.

Named entity recognition will often times come later in a pipeline because it needs to receive a tokenized text and, in some languages, it needs to understand a words POS to perform well. As a text moves through the pipeline, it receives spans that contain valuable information, such as part of speech. Once the text reaches the NER pipeline, it is time for the machine to make some structured decisions about individual tokens.

1.4. What is Named Entity Recognition?¶

Entities are words in a text that correspond to a specific type of data. They can be numerical, such as cardinal numbers; temporal, such as dates; nominal, such as names of people and places; and political, such as geopolitical entities (GPE). In short, an entity can be anything the designer wishes to designate as an item in a text that has a corresponding label.

Named entity recognition, or NER, is the process by which a system takes an input of unstructured data (a text) and outputs structured data, specifically the identification of entities. Let us consider this short example.

Martha, a senior, moved to Spain where she will be playing basketball until 05 June 2022 or until she can’t play any longer.

In this example, we have several potential entities. First, there is “Martha”. Different NER models will have different corresponding labels for such an entity, but PERSON or PER is considered standard practice. Note here that the label is capitalized. This is also standard practice. We also have a GPE, or Geopolitical Entity, notably “Spain”. Finally, we have a DATE entity, “05 June 2022”. These are standard labels that one can expect to extract from a text. If the domain at hand, however, has additional labels, those can be extracted as well. Perhaps the client or user wants to not only extract people, GPEs, and dates, but also sports. In such a scenario “basketball” could be extracted and given the label SPORT.

Not all entities are singular. As is common with texts, sometimes entities are Multi-word Tokens, or MWT. Let us consider the same sentence as above, but with one modification:

Martha Thompson, a senior, moved to Spain where she will be playing basketball until 05 June 2022 or until she can’t play any longer.

Here, Martha now has a surname, “Thompson”. We can either extract Martha and Thompson as individual entities or, as is more common practice, extract both as a single entity, since “Martha Thompson” is a single individual. An NER system, therefore, should recognize “Martha Thompson” as a single, MWT.

As we progress through these notebooks and videos, we will learn new NER concepts. For now, I recommend watching the video below. Each notebook, including this one, will have a corresponding video lesson.

1.5. Video Explanation on NER¶

In this video, I explain these concepts and outline the future parts of this blog.

%%html
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/E9h8qVm2uNY" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>