Named Entity Recognition

What

Extract and classify named entities from text: people, organizations, locations, dates, etc.

"Apple was founded by Steve Jobs in Cupertino"
 ORG                  PERSON       LOCATION

BIO tagging scheme

Standard approach for token-level NER. Each token gets a label:

  • B-{TYPE}: Beginning of an entity
  • I-{TYPE}: Inside an entity (continuation)
  • O: Outside (no entity)
Apple     → B-ORG
was       → O
founded   → O
by        → O
Steve     → B-PER
Jobs      → I-PER
in        → O
Cupertino → B-LOC

This lets models handle multi-token entities like “Steve Jobs” (B-PER, I-PER).

Common entity types (CoNLL-2003 / Ontonotes)

TagDescriptionExamples
PERPersonElon Musk, Marie Curie
ORGOrganizationSpaceX, WHO, NATO
LOCLocationCupertino, Estonia, Mount Everest
DATE/TIMETemporal2002, next Monday, 3 PM
MONEYMonetary values$5 million, €20
PERCENTPercentages10%, 50 percent
PRODUCTProductsiPhone, Windows 95
EVENTEventsWorld War II, Olympics

Approaches

1. Rule-based (CRF)

Traditional approach using Conditional Random Fields. Feature engineering: word shape, prefix/suffix, POS tags, gazetteers. Works well with limited data but requires careful feature design.

2. BiLSTM-CRF

Bidirectional LSTM with a CRF layer on top. Captures sequential context in both directions, CRF models label dependencies (e.g., I-PER cannot follow B-LOC).

3. Fine-tuned transformer (BERT-based)

Current standard — fine-tune a pretrained language model:

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
 
model = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
model("Elon Musk founded SpaceX in 2002")
# [{'entity_group': 'PER', 'word': 'Elon Musk', 'score': 0.999},
#  {'entity_group': 'ORG', 'word': 'SpaceX', 'score': 0.998}]

For SpanishNER, ChineseNER, etc.: use xlm-roberta-large-NER or models from HuggingFace:

# Multilingual NER
ner = pipeline("ner", model="Davlan/xlm-roberta-base-ner-hrl", aggregation_strategy="simple")

4. Chinese NER

Chinese doesn’t use whitespace tokenization — requires word segmentation first (Jieba, pkuseg), or character-level models. BERT-based models with character input + word segmentation features perform best.

5. Nested/chinese-style NER

Standard BIO only handles flat entities. For overlapping entities (e.g., “CEO of Apple” → CEO is a role AND Apple is an ORG), use:

  • Head-driven phrase structure
  • Multi-task learning (NER + entity typing)
  • Layered models (detect entity spans first, then classify)

Evaluation

Standard token-level metrics. Entity match requires boundary + type to match:

from seqeval.metrics import classification_report
from sklearn_crfsuite import CRF
 
# CRF must output sequence of labels
y_true = [['O', 'B-PER', 'I-PER', 'O'], ['B-ORG', 'O', 'O']]
y_pred = [['O', 'B-PER', 'I-PER', 'O'], ['B-ORG', 'O', 'O']]
 
print(classification_report(y_true, y_pred))

Key metrics: F1 per entity type (PER, ORG, LOC), macro F1 (average over types), micro F1 (token-level).

Datasets

DatasetDomainLanguagesEntity Types
CoNLL-2003News (Reuters)EnglishPER, ORG, LOC, MISC
Ontonotes 5.0News, web, conversationalEnglish, Chinese, Arabic18 types
WiNGPTChineseChineseMedical
WikiNERWikipedia9 languagesPER, LOC, ORG
MultiCoNERMultilingual34 languages33 types

Practical guidance

Data size: 1,000–10,000 labeled sentences for good BERT fine-tuning performance.

Annotation quality: Entity boundaries are critical. Use double-annotation with adjudication. Entity type consistency matters — define clear guidelines (e.g., “Tesla the car company” = ORG, “tesla the unit” = no entity).

Domain shift: NER models trained on news struggle with social media (casual language, abbreviations, memes). Use domain-adapted models or fine-tune on in-domain data.

Augmentation: Back-translation, synonym replacement, contextual augmentation — works for NER but less critical than quantity of labeled data.

Key papers

  • Bidirectional LSTM-CRF Models for Sequence Tagging (Huang et al., 2015)
  • NER with Trilingual Linguistic Features (Lample et al., 2016)
  • Fine-tuned Language Models for Text Classification (Peters et al., 2018) — ELMo for NER
  • BERT for NER (Devlin et al., 2019) — arXiv:1810.04805