# Learning What’s in a Name with Graphical Models

An overview of the structure and statistical assumptions behind linear-chain graphical models – effective and interpretable approaches to sequence labeling problems such as named entity recognition.

Starting prediction server…

“The UK” alone is a country, but “The UK Department of Transport” is an organization within said country. In a named entity recognition (NER) task, where we want to label each word with a name tag (organization/person/location/not a name), how can a computer model know one from the other?

In such cases, contextual information is key. In the second example, the fact that “The UK” is followed by “Department” is compelling evidence that when taken together the phrase refers to an organization. Sequence models – machine learning systems designed to take sequences of data as input, can recognize and put such relationships to productive use. Rather than making isolated predictions based on individual words, they take the given sequence as a combined unit, model the dependencies between words in that sequences, and depending on the problem can return the most likely label sequence.

In this article, we’ll explore three sequence models that are particularly successful at NER: Hidden Markov Models (HMMs), Maximum-Entropy Markov Models (MEMMs), and Conditional Random Fields (CRFs). All three are probabilistic graphical models, which we’ll cover in the next section.

Graphical modeling is a robust framework for representing probabilistic models. Complex multivariate probability distributions can be expressed with compact graphs that are vastly easier to understand and interpret.

Let’s start with a simple example with just two random variables, $A$ and $B$. Assume that $B$ is conditionally dependent on $A$. Through a canonical application of the chain rule, the joint distribution of $A$ and $B$ is:

Notation

*p(a)*to mean

*p(A = a)*, that is, the probability of variable A taking value a.

This is a simple enough example, with just 2 factors in the right hand side. Add more variables, however, and the result can get messy fast. To see this, assume that there are two more variables, $C$ and $D$, and that $D$ is conditionally dependent on $A$, $B$, and $C$. The factorization becomes:

The relationship between variables is more opaque, hidden behind second-order dependencies. For example, while it’s clear that $D$ is directly dependent on $A$, we may miss the fact that there is another, second-order dependency between the two ($D$ is dependent on $B$, which in turn is dependent on $A$).

Directed Acyclic Graphs, or DAGs, offer a natural remedy to this problem. Each factor in the equation can be represented by a node. An arrow indicates conditional dependence. The resulting graph would look like:

With this graph, it’s easier to construct a generative story of how $A$, $B$, $C$ and $D$ are sampled. The process proceeds in topological order, for example $A$ → $C$ → $B$ → $D$, to ensure that all dependencies have been resolved by the time each variable is sampled.

Below is what a sampled population of the given distributions would look like. For the sake of demonstration, some distribution parameters are modifiable – in reality these are the quantities that need to be learned from training data.

Sampled Population

Hidden Markov Models (HMMs) are an early class of probabilistic graphical models representing partially hidden (unobserved) sequences of events. Structurally, they are built with two main layers, one hidden and one observed:

States inside the hidden layer ($X_n$) can be inferred using data in the observed layer ($Y_n$).

The hidden layer is assumed to be a Markov process: a chain of events in which each event’s probability depends only on the state of the preceding event. More formally, given a sequence of $N$ random events $X_1$, $X_2$,…, $X_N$, the Markov assumption holds that:

In a graph, this translates to a linear chain of events where each event has exactly one arrow pointing towards it (except for the first event) and one pointing away from it (except for the last):

A second assumption that HMMs make is time-homogeneity: that the probability of transition from one event's state to the next is constant over time. In formal terms:

$p(x_i | x_{i-1})$ is called the transition probability and is one of the two key parameters to be learned during training.

The assumptions about the hidden layer, Markov and time-homogeneity, hold up in many time-based systems where the hidden, unobserved events occur sequentially, one after the other. Together, they help to significantly reduce the computational complexity of both learning and inference.

The hidden and observed layer are connected via a one-to-one mapping relationship. The probability of each observed event is assumed to depend only on the state of the hidden event at the same time step. Given a sequence of $N$ hidden events $X_1$, $X_2$,…, $X_N$ and observed events $Y_1$, $Y_2$,…, $Y_N$ we have:

In a graph, this one-to-one relationship looks like:

The conditional probability $p(y_i | x_i)$, called the emission probability, is also assumed to be time-homogenous, further reducing the model's complexity. It is the second key parameter to be learned, alongside the transition probability.

HMMs’ chain structure with two layers, one hidden and one observed, makes them particularly suited to sequence labeling problems like NER. Unknown name tags constitute the hidden layer, while word tokens from the input sequence make up the observed layer:

Notation

Each name tag apart from the “O” tag contains either a B- (beginning) or I- (inside) prefix. This way, when there are two named entities next to each other (“Argentine Nestor”), there is no confusion about when an entity stops and the next one begins.

There are 9 name tags, resulting in a total of 9² = 81 possible transitions between any two consecutive hidden states. Each transition path has its own associated transition probability:

– transition visualization

The number of emission probabilities is orders of magnitude larger. Each node in the observed layer can be any word in the known vocabulary, whose size ranges anywhere from thousands to hundreds of thousands of words (the one created for the HMM in this article contains 23,622 tokens). Let N be the number of word tokens in the known vocabulary. There is a total of 9N possible emission probabilities:

– emission visualization

The training step involves learning the two key sets of parameters, the transition and emission probabilities, as well as a special set of start probabilities. All can be obtained as normalized rates from the training data.

For example, to get the transition probability from state “O” to state “B-LOC”, $p(B{-}LOC | O)$, we need only two numbers: the number of times state “O” is followed by any other state (that is, it isn't the last state in the sequence), as $N_O$, and the number of times state “O” is followed by state “B-LOC”, as $N_{O→B-LOC}$. The desired transition probaility is simply $\frac{N_{O→B-LOC}}{N_O}$.

In the context of HMMs, inference involves answering useful questions about hidden states given observed values, or about missing values in a partially observed sequence. For NER, we are focused on the first type of inference. Specifically, we want to perform maximum a posteriori (MAP) inference: identifying the most likely sequence of states in the hidden layer conditioned on observed values.

The number of possible sequences in the hidden layer can easily be intractably large. For any two consecutive states, there are 81 potential transition paths. For 3 consecutive states there are 82² different paths. For 4 states there are 82³ paths, and so on.

– trellis graph w/ 4 states & potentially fade-out effect on the right

Given the massive number of potential paths, brute force algorithms are not an option. Doing so would involve calculating the probability of every possible transition path in a sequence, and then finding the path with the highest probability among them.

Luckily, there is an efficient strategy that accomplishes the same goal with substantially less complexity: the Viterbi algorithm. It is a dynamic algorithm that constructs a trellis-shaped memory structure to progressively find the most likely path as it progesses through the given sequence. For a detailed account of the Viterbi algorithm and its exact steps, refer to