<link rel="stylesheet" href="/katex.min.css"/>

Learning What’s in a Name with Graphical Models

An overview of the structure and statistical assumptions behind linear-chain graphical models – effective and interpretable approaches to sequence labeling problems such as named entity recognition.

Which of these words refer to a named entity?

Starting prediction server…

I. Probabilistic Graphical Models

Factorizing Joint Distributions

p(a,b)=p(a)p(ba)p(a, b) = p(a) \cdot p(b|a)
p(a,b,c,d)=p(a)p(ba)p(c)p(da,b,c)p(a, b, c, d) = p(a) \cdot p(b|a) \cdot p(c) \cdot p(d|a, b, c)

Directed Acyclic Graphs

Sampled Population

II. Hidden Markov Models

The Hidden Layer

p(sis1,s2,,si1)=p(sisi1)for all i{2,,N}p(s_i | s_1, s_2,…, s_{i-1}) = p(s_i | s_{i-1}) \\ \footnotesize \textrm{for all $i \in \{2,…, N\}$}
p(sisi1)=p(si+1si)for all i{2,,N1}p(s_i | s_{i-1}) = p(s_{i+1} | s_i) \\ \footnotesize \textrm{for all $i \in \{2,…, N-1\}$}

The Observed Layer

p(ois1,s2,,sN)=p(oisi)for all i{1,2,,N}p(o_i | s_1, s_2,…, s_N) = p(o_i | s_i) \\ \footnotesize \textrm{for all $i \in \{1, 2,…, N\}$}

Representing Named Entities

Training

Inference

Results

Name tag predictions by HMM:

Starting prediction server…

Accuracy

90.1%

Precision

64.2%

Recall

55.8%

F₁ Score

59.7%

Precision

TagNo OOV1+ OOV
ORG0.800.21
PER0.850.62
LOC0.870.06
MISC0.780.12
ALL0.840.39

Recall

TagNo OOV1+ OOV
ORG0.640.33
PER0.580.59
LOC0.710.05
MISC0.540.06
ALL0.640.41

Limitations

TagEntity Length 1Entity Length 2Entity Length 3Entity Length 4Entity Length 5
ORG0.610.680.300.120.28
PER0.960.670.70
LOC0.880.36
MISC0.900.460.240.12
ALL0.770.610.310.120.29
Entity LengthOOV Rate 0OOV Rate 0.2OOV Rate 0.25OOV Rate 0.33OOV Rate 0.4OOV Rate 0.5OOV Rate 0.6OOV Rate 0.66OOV Rate 0.75OOV Rate 0.8OOV Rate 1
10.860.30
20.800.500.52
30.780.150.060
40.420.22000
50.670.2200.14

III. Maximum Entropy Markov Models

Discriminative Structure

Modeling Word Features

b(ot)={1  if ot has shape "Xxx"0  otherwiseb(o_t) = \begin{cases} 1\ \textrm{ if $o_t$ has shape "Xxx"} \\ 0\ \textrm{ otherwise} \end{cases}
fb,s(ot,st)={1  if b(ot)=1 and st=s0  otherwisef_{\langle b, s \rangle}(o_t, s_t) = \begin{cases} 1\ \textrm{ if $b(o_t) = 1$ and $s_t = s$} \\ 0\ \textrm{ otherwise} \end{cases}

State Transitions

Ps(so)=1Z(o,s)exp(aλa fa(o,s))P_{s\prime}(s|o) = \frac{1}{Z(o, s\prime)} \exp\left(\sum_{a} \lambda_a \ f_a(o, s)\right)

Calculating PO(B-LOC | 'UK')

With weights retrieved from a MEMM trained on CoNLL-2003 data. Numbers are rounded to 3 decimal places for clarity.
Feature-State Pair (a)λafaλafa

PO(B-LOC | 'UK') = eSUM(λafa)  / Z
≈ e0 / 1.000
≈ 0

Training & Inference

Results

Most Informative Features when Previous State is 

FeatureStateWeight
word='germany'B-LOC11.492
word='van'B-PER8.972
word='wall'B-ORG8.525
word='della'B-PER7.86
lowercase='della'B-PER7.86
is_not_title_caseB-PER-6.949
word='de'B-PER6.781
shape='X.X.'O-6.713
shape='xxxx'B-ORG-6.642
word='CLINTON'B-ORG6.456
Name tag predictions by MEMM:

Starting prediction server…

Accuracy

93.1%

Precision

72.9%

Recall

63.5%

F₁ Score

67.9%

Precision

TagNo OOV1+ OOV
ORG0.810.36
PER0.820.80
LOC0.820.17
MISC0.740.14
ALL0.800.54

Recall

TagNo OOV1+ OOV
ORG0.680.12
PER0.720.57
LOC0.890.29
MISC0.780.02
ALL0.790.37

Advantage Over HMMs

Entity LengthOOV Rate 0OOV Rate 0.2OOV Rate 0.25OOV Rate 0.33OOV Rate 0.4OOV Rate 0.5OOV Rate 0.6OOV Rate 0.66OOV Rate 0.75OOV Rate 0.8OOV Rate 1
10.830.34
20.760.720.55
30.600.560.590.50
40.160.200
50.600.50
TagEntity Length 1Entity Length 2Entity Length 3Entity Length 4Entity Length 5
ORG0.760.690.840.360.80
PER0.590.910.660.25
LOC0.800.330.350
MISC0.820.570.290.18
ALL0.770.700.580.160.43

References

  1. MUC-7 Named Entity Task Definition (Version 3.5) PDF
    Nancy Chinchor. 1998. In Seventh Message Understanding Conference (MUC-7).
  2. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning
    Daphne Koller and Nir Friedman. 2009. The MIT Press.
  3. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
    Judea Pearl. 1988. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
  4. Genes, Themes and Microarrays: Using Information Retrieval for Large-Scale Gene Analysis
    Hagit Shatkay, Stephen Edwards, W John Wilbur, and Mark Boguski. 2000. In Proceedings of the  International Conference on Intelligent Systems for Molecular Biology, 317–328.
  5. Information Extraction Using Hidden Markov Models
    Timothy Robert Leek. 1997. Master’s Thesis, UC San Diego.
  6. Information Extraction with HMMs and Shrinkage PDF
    Dayne Freitag and Andrew McCallum. 1999. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction (AAAI Techinical Report WS-99-11), 31–36.
  7. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition
    Lawrence R Rabiner. 1989. Proceedings of the IEEE 77, 2: 257–286. https://doi.org/10.1109/5.18626
  8. An Algorithm that Learns What’s in a Name PDF
    Daniel M. Bikel, Richard Schwartz, and Ralph M. Weischedel. 1999. Machine Learning 34, 1: 211–231. https://doi.org/10.1023/A:1007558221122
  9. Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm
    A. Viterbi. 1967. IEEE Transactions on Information Theory 13, 2: 260–269.
  10. Appendix A.4 – Decoding: The Viterbi Algorithm PDF
    Daniel Jurafsky and James H. Martin. 2021.
  11. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition PDF
    Erik F. Tjong Kim Sang and Fien De Meulder. 2003. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 142–147.
  12. Maximum Entropy Markov Models for Information Extraction and Segmentation PDF
    Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. 2000. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML ’00), 591–598.
  13. Maximum entropy models for antibody diversity Link
    Thierry Mora, Aleksandra M. Walczak, William Bialek, and Curtis G. Callan. 2010. Proceedings of the National Academy of Sciences 107, 12: 5405–5410. https://doi.org/10.1073/pnas.1001705107
  14. Human Behavior Modeling with Maximum Entropy Inverse Optimal Control. PDF
    Brian Ziebart, Andrew Maas, J. Bagnell, and Anind Dey. 2009. In Papers from the 2009 AAAI Spring Symposium, Technical Report SS-09-04, Stanford, California, USA, 92–97.
  15. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes PDF
    Andrew Ng and Michael Jordan. 2001. In Advances in Neural Information Processing Systems.
  16. Inducing Features of Random Fields
    S. Della Pietra, V. Della Pietra, and J. Lafferty. 1997. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 4: 380–393. https://doi.org/ 10.1109/34.588021