Natural language is sequential but highly contextual, comprising of multiple interdependent parts. Consider the following sentences:
“I can never sing bass.”
"A light bass provided balance to the song."
"A large, beautiful bass flashed by the boat."
"Bass" means something different in each sentence and sounds different in the third sentence compared to the first two. A human reader would find the meanings to be rather intuitive given the context available in each sentence. For example, the fact that the "bass" flashed by a boat tells us that it can move and is likely underwater. Given the limited set of definitions for "bass", we quickly infer that it must be referring to a type of fish.
This chain of reasoning happens quickly and comes as second nature to human readers. But can we teach a machine to reason in a similar way? Can a language model disambiguate the role and meaning of words that are highly dependent on those around them?
Without resorting to complex neural networks, there are compact probabilistic models provide reliable and interpretable predictions in many sequence modeling problems, like part-of-speech tagging and named entity recognition.
In this article, we’ll explore three such models, each building on top of the one before: Hidden Markov Models (HMMs), Maximum-Entropy Markov Models (MEMMs), and Conditional Random Fields (CRFs). All three are probabilistic graphical models, which we’ll cover in the next section.
Probabilistic models can easily ballon in complexity as we add more and more variables to the system. This makes complex models with more than a few variables difficult to interpret, if not inscrutable.
Let’s start with a simple example with just two random variables, and . Assume that is conditionally dependent on , so we can write its distribution as . By the definition of conditional probabilities, we have:
Now, we want to model the joint distribution of and – . This appears in the equation above, so we just need to rearrange the terms to get:
If we know the values of and , we can now derive . This is a simple enough example.
Add more variables, however, and the resulting factorization can get messy, fast. Assume we have two more variables: and . is conditionally dependent on and , and on and . The factorization becomes:
The relationship between variables is noticeably more opaque, hidden behind second and third-order dependencies. This will only get worse as we add more variables. What we need is a compact way to represent this large network of conditional dependencies.
Graphs offer a natural solution to this problem. Complex, multivariate distributions can be factorized into compact graphical representations that are easier to understand and interpret.
Let's return to the first example with two variables. The joint probability distribution is, again:
In a graph, each term on the right hand side can be represented by a single node. An arrow indicates conditional dependence. To express the conditional dependence in , we add an arrow pointing from to :
We can construct a generative story of how A and B are sampled from the graphical representation:
Click "Draw Samples" to generate new random samples from A and B.