Learning What’s in a Name with Graphical Models

An overview of the structure and statistical assumptions behind linear-chain graphical models — effective and interpretable approaches to sequence labeling problems such as named entity recognition.

Which of these words refer to a named entity?

Starting prediction server…

I. Probabilistic Graphical Models

Factorizing Joint Distributions

p(a,b)=p(a)p(ba)p(a, b) = p(a) \cdot p(b|a)

Notation: the shorthand p(a) means p(A = a), that is, the probability of variable A taking value a.

p(a,b,c,d)=p(a)p(ba)p(c)p(da,b,c)p(a, b, c, d) = p(a) \cdot p(b|a) \cdot p(c) \cdot p(d|a, b, c)

Directed Acyclic Graphs

Sampled Population

II. Hidden Markov Models

The Hidden Layer

p(sis1,s2,,si1)=p(sisi1)for all i{2,,N}p(s_i | s_1, s_2,…, s_{i-1}) = p(s_i | s_{i-1}) \\ \footnotesize\textrm{for all $i\,{\scriptscriptstyle \in}\,\{2,…, N\}$}
p(sisi1)=p(si+1si)for all i{2,,N1}p(s_i | s_{i-1}) = p(s_{i+1} | s_i) \\ \footnotesize\textrm{for all $i\,{\scriptscriptstyle \in}\,\{2,…, N-1\}$}

The Observed Layer

p(ois1,s2,,sN)=p(oisi)for all i{1,2,,N}p(o_i | s_1, s_2,…, s_N) = p(o_i | s_i) \\ \footnotesize\textrm{for all $i\,{\scriptscriptstyle \in}\,\{1, 2,…, N\}$}

Representing Named Entities

Representation: rather than labeling each node using the name of the variable it represents (X₁, Y₁) as we have until this point, we'll instead display the value of that variable (“O”, “Great”). This helps make the graphs easier to read.

Training

Inference

Results

Name tag predictions by HMM:

Starting prediction server…

Accuracy

90.1%

Precision

64.2%

Recall

55.8%

F₁ Score

59.7%

Precision

TagNo OOV1+ OOV
ORG0.80.21
PER0.850.62
LOC0.870.06
MISC0.780.12
ALL0.840.39

Recall

TagNo OOV1+ OOV
ORG0.640.33
PER0.580.59
LOC0.710.05
MISC0.540.06
ALL0.640.41

Limitations

TagEntity Length 1Entity Length 2Entity Length 3Entity Length 4Entity Length 5
ORG0.610.680.30.120.28
PER0.960.670.7
LOC0.880.36
MISC0.90.460.240.12
ALL0.770.610.310.120.29
Entity LengthOOV Rate 0OOV Rate 0.2OOV Rate 0.25OOV Rate 0.33OOV Rate 0.4OOV Rate 0.5OOV Rate 0.6OOV Rate 0.66OOV Rate 0.75OOV Rate 0.8OOV Rate 1
10.860.3
20.80.50.52
30.780.150.060
40.420.22000
50.670.2200.14

III. Maximum Entropy Markov Models

Discriminative Structure

Word Features

b(ot)={1 if ot has shape “Xxx”0 otherwiseb(o_t) = \begin{cases} \textrm{1 if $o_t$ has shape “Xxx”} \\ \textrm{0 otherwise} \end{cases}
fb,s(ot,st)={1 if b(ot)=1 and st=s0 otherwisef_{\langle b, s \rangle}(o_t, s_t) = \begin{cases} \textrm{1 if $b(o_t) = 1$ and $s_t = s$} \\ \textrm{0 otherwise} \end{cases}

State Transitions

ps(so)=1Z(o,s)exp(aλafa(o,s))p_{s\prime}(s|o) = \frac{1}{Z(o, s\prime)} \exp\left(\sum_{a}{\lambda_a \, f_a(o, s)}\right)

Calculating pO(B-LOC | “UK”)

With weights retrieved from a MEMM trained on CoNLL-2003 data. Numbers are rounded to 3 decimal places for clarity.
Feature-State Pair (a)λafaλafa

pO(B-LOC | “UK”) = eSUM(λafa)  / Z
≈ e0 / 1
≈ 0

Training & Inference

Results

Most Informative Features when Previous State is 

Current Word FeatureCurrent StateWeight
word='germany'B-LOC11.492
word='van'B-PER8.972
word='wall'B-ORG8.525
word='della'B-PER7.86
lowercase='della'B-PER7.86
is_not_title_caseB-PER-6.949
word='de'B-PER6.781
shape='X.X.'O-6.713
shape='xxxx'B-ORG-6.642
word='CLINTON'B-ORG6.456
Name tag predictions by MEMM:

Starting prediction server…

Accuracy

93.1%

Precision

72.9%

Recall

63.5%

F₁ Score

67.9%

Precision

TagNo OOV1+ OOV
ORG0.810.36
PER0.820.8
LOC0.820.17
MISC0.740.14
ALL0.80.54

Recall

TagNo OOV1+ OOV
ORG0.680.12
PER0.720.57
LOC0.890.29
MISC0.780.02
ALL0.790.37

Advantage Over HMMs

Entity LengthOOV Rate 0OOV Rate 0.2OOV Rate 0.25OOV Rate 0.33OOV Rate 0.4OOV Rate 0.5OOV Rate 0.6OOV Rate 0.66OOV Rate 0.75OOV Rate 0.8OOV Rate 1
10.830.34
20.760.720.55
30.60.560.590.5
40.160.20
50.60.5
TagEntity Length 1Entity Length 2Entity Length 3Entity Length 4Entity Length 5
ORG0.760.690.840.360.8
PER0.590.910.660.25
LOC0.80.330.350
MISC0.820.570.290.18
ALL0.770.70.580.160.43

Label Bias Problem

IV. Conditional Random Fields

Markov Random Fields

ϕ(X,Y)={3 if X=1 and Y=12 if X=0 and Y=01 otherwise\phi(X, Y) = \begin{cases} \textrm{3 if $X = 1$ and $Y = 1$} \\ \textrm{2 if $X = 0$ and $Y = 0$} \\ \textrm{1 otherwise} \end{cases}
p(A,B,C)=1Zϕ(A,B)ϕ(B,C)ϕ(C,A)where Z is a normalization factorp(A,B,C) = \frac{1}{Z}\,\phi(A,B)\,\phi(B,C)\,\phi(C,A) \\ \footnotesize\textrm{where Z is a normalization factor}
p(x1,x2,)=1ZcCϕc(xc)where Z is a normalization factorand C is the set of cliques in Gp(x_1, x_2,…) = \frac{1}{Z}\prod_{c\,{\scriptscriptstyle \in}\,C}{\phi_c(x_c)} \\ \footnotesize\textrm{where Z is a normalization factor} \\ \footnotesize\textrm{and C is the set of cliques in $\mathcal{G}$}

Calculating p(A, B, C, D, E, F)

p(1, 1, 1, 0, 0, 0)
=1/Z
⨯ ɸABC(1, 1, 1)
⨯ ɸAB(1, 1)
⨯ ɸBC(1, 1)
⨯ ɸAC(1, 1)
⨯ ɸCD(1, 0)
⨯ ɸDEF(0, 0, 0)
⨯ ɸDE(0, 0)
⨯ ɸEF(0, 0)
⨯ ɸDF(0, 0)

=1 / 28,915
⨯ 3
⨯ 3
⨯ 3
⨯ 3
⨯ 1
⨯ 2
⨯ 2
⨯ 2
⨯ 2

0.0448

ɸABC = ɸAB = ɸ(x) = 3 if x1 = x2 = … = 1
2 if x1 = x2 = … = 0
1 otherwise

Conditional Form

p(yx)=1Z(x)cCϕc(yc,xc)where Z is a normalization factorand C is the set of cliques in thegraph G representing the labels yp(y|x) = \frac{1}{Z(x)}\prod_{c\,{\scriptscriptstyle \in}\,C}{\phi_c(y_c, x_c)} \\ \footnotesize\textrm{where Z is a normalization factor} \\ \footnotesize\textrm{and C is the set of cliques in the} \\ \footnotesize\textrm{graph $\mathcal{G}$ representing the labels $y$}
Linear-chain CRF where the hidden layer depends on the current, previous, and future observations.

Exponential Factors

ϕc(yc,xc)=exp(aλafa(yc,xc))where fa is a feature function defined for clique cand λa is the weight parameter for fa\phi_c(y_c, x_c) = \exp\left(\sum_{a}{\lambda_a \, f_a(y_c, x_c)}\right) \\ \footnotesize\textrm{where $f_a$ is a feature function defined for clique $c$} \\ \footnotesize\textrm{and $\lambda_a$ is the weight parameter for $f_a$}

References

  1. MUC-7 Named Entity Task Definition (Version 3.5) PDF
    Nancy Chinchor. 1998. In Seventh Message Understanding Conference (MUC-7).
  2. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning
    Daphne Koller and Nir Friedman. 2009. The MIT Press.
  3. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
    Judea Pearl. 1988. Morgan Kaufmann Publishers Inc, San Francisco, CA, USA.
  4. Genes, Themes and Microarrays: Using Information Retrieval for Large-Scale Gene Analysis
    Hagit Shatkay, Stephen Edwards, W John Wilbur, and Mark Boguski. 2000. In Proceedings of the  International Conference on Intelligent Systems for Molecular Biology, 317–328.
  5. Information Extraction Using Hidden Markov Models
    Timothy Robert Leek. 1997. Master’s Thesis, UC San Diego.
  6. Information Extraction with HMMs and Shrinkage PDF
    Dayne Freitag and Andrew McCallum. 1999. In Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction (AAAI Techinical Report WS-99-11), 31–36.
  7. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition
    Lawrence R Rabiner. 1989. Proceedings of the IEEE 77, 2: 257–286. https://doi.org/10.1109/5.18626
  8. An Algorithm that Learns What’s in a Name PDF
    Daniel M. Bikel, Richard Schwartz, and Ralph M. Weischedel. 1999. Machine Learning 34, 1: 211–231. https://doi.org/10.1023/A:1007558221122
  9. Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm
    A. Viterbi. 1967. IEEE Transactions on Information Theory 13, 2: 260–269.
  10. Appendix A.4 — Decoding: The Viterbi Algorithm PDF
    Daniel Jurafsky and James H. Martin. 2021. In Speech and Language Processing. 8–10.
  11. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition PDF
    Erik F. Tjong Kim Sang and Fien De Meulder. 2003. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, 142–147.
  12. Maximum Entropy Markov Models for Information Extraction and Segmentation PDF
    Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. 2000. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML ’00), 591–598.
  13. Maximum Entropy Models for Antibody Diversity Link
    Thierry Mora, Aleksandra M. Walczak, William Bialek, and Curtis G. Callan. 2010. Proceedings of the National Academy of Sciences 107, 12: 5405–5410. https://doi.org/10.1073/pnas.1001705107
  14. Human Behavior Modeling with Maximum Entropy Inverse Optimal Control PDF
    Brian Ziebart, Andrew Maas, J. Bagnell, and Anind Dey. 2009. In Papers from the 2009 AAAI Spring Symposium, Technical Report SS-09-04, Stanford, California, USA, 92–97.
  15. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes PDF
    Andrew Ng and Michael Jordan. 2001. In Advances in Neural Information Processing Systems.
  16. Inducing Features of Random Fields
    S. Della Pietra, V. Della Pietra, and J. Lafferty. 1997. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 4: 380–393. https://doi.org/ 10.1109/34.588021
  17. Une Approche théorique de l’Apprentissage Connexionniste: Applications à la Reconnaissance de la Parole
    Léon Bottou. 1991. Université de Paris X.
  18. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data PDF
    John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML ’01), 282–289.
  19. The Label Bias Problem Link
    Awni Hannun. 2019. Awni Hannun — Writing About Machine Learning.
  20. Discriminative Probabilistic Models for Relational Data Link
    Ben Taskar, Pieter Abbeel, and Daphne Koller. 2013. https://doi.org/10.48550/ARXIV.1301.0604
  21. Accurate Information Extraction from Research Papers using Conditional Random Fields Link
    Fuchun Peng and Andrew McCallum. 2004. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, 329–336.
  22. Discriminative Fields for Modeling Spatial Dependencies in Natural Images PDF
    Sanjiv Kumar and Martial Hebert. 2003. In Advances in Neural Information Processing Systems.
  23. Multiscale Conditional Random Fields for Image Labeling Link
    Xuming He, R.S. Zemel, and M.A. Carreira-Perpinan. 2004. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004, CVPR 2004, II–II. https://doi.org/10.1109/CVPR.2004.1315232
  24. Conditional Random Fields as Recurrent Neural Networks Link
    Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip H. S. Torr. 2015. In 2015 IEEE International Conference on Computer Vision (ICCV). https://doi.org/10.1109/iccv.2015.179
  25. Convolutional CRFs for Semantic Segmentation Link
    Marvin T. T. Teichmann and Roberto Cipolla. 2018. https://doi.org/10.48550/arxiv.1805.04777
  26. RNA Secondary Structural Alignment with Conditional Random Fields Link
    Kengo Sato and Yasubumi Sakakibara. 2005. Bioinformatics 21: ii237–ii242. https://doi.org/10.1093/bioinformatics/bti1139
  27. Protein Fold Recognition Using Segmentation Conditional Random Fields (SCRFs)
    Yan Liu, Jaime Carbonell, Peter Weigele, and Vanathi Gopalakrishnan. 2006. J. Comput. Biol. 13, 2: 394–406.
  28. Introduction to Markov Random Fields
    Andrew Blake and Pushmeet Kohli. 2011. In Markov Random Fields for Vision and Image Processing. The MIT Press.