Named Entity Recognition

What

Extract and classify named entities from text: people, organizations, locations, dates, etc.

"Apple was founded by Steve Jobs in Cupertino"
 ORG                  PERSON       LOCATION

BIO tagging scheme

Standard approach for token-level NER. Each token gets a label:

B-{TYPE}: Beginning of an entity
I-{TYPE}: Inside an entity (continuation)
O: Outside (no entity)

Apple     → B-ORG
was       → O
founded   → O
by        → O
Steve     → B-PER
Jobs      → I-PER
in        → O
Cupertino → B-LOC

This lets models handle multi-token entities like “Steve Jobs” (B-PER, I-PER).

Common entity types (CoNLL-2003 / Ontonotes)

Tag	Description	Examples
PER	Person	Elon Musk, Marie Curie
ORG	Organization	SpaceX, WHO, NATO
LOC	Location	Cupertino, Estonia, Mount Everest
DATE/TIME	Temporal	2002, next Monday, 3 PM
MONEY	Monetary values	$5 million, €20
PERCENT	Percentages	10%, 50 percent
PRODUCT	Products	iPhone, Windows 95
EVENT	Events	World War II, Olympics

Approaches

1. Rule-based (CRF)

Traditional approach using Conditional Random Fields. Feature engineering: word shape, prefix/suffix, POS tags, gazetteers. Works well with limited data but requires careful feature design.

2. BiLSTM-CRF

Bidirectional LSTM with a CRF layer on top. Captures sequential context in both directions, CRF models label dependencies (e.g., I-PER cannot follow B-LOC).

3. Fine-tuned transformer (BERT-based)

Current standard — fine-tune a pretrained language model:

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
 
model = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
model("Elon Musk founded SpaceX in 2002")
# [{'entity_group': 'PER', 'word': 'Elon Musk', 'score': 0.999},
#  {'entity_group': 'ORG', 'word': 'SpaceX', 'score': 0.998}]

For SpanishNER, ChineseNER, etc.: use xlm-roberta-large-NER or models from HuggingFace:

# Multilingual NER
ner = pipeline("ner", model="Davlan/xlm-roberta-base-ner-hrl", aggregation_strategy="simple")

4. Chinese NER

Chinese doesn’t use whitespace tokenization — requires word segmentation first (Jieba, pkuseg), or character-level models. BERT-based models with character input + word segmentation features perform best.

5. Nested/chinese-style NER

Standard BIO only handles flat entities. For overlapping entities (e.g., “CEO of Apple” → CEO is a role AND Apple is an ORG), use:

Head-driven phrase structure
Multi-task learning (NER + entity typing)
Layered models (detect entity spans first, then classify)

Evaluation

Standard token-level metrics. Entity match requires boundary + type to match:

from seqeval.metrics import classification_report
from sklearn_crfsuite import CRF
 
# CRF must output sequence of labels
y_true = [['O', 'B-PER', 'I-PER', 'O'], ['B-ORG', 'O', 'O']]
y_pred = [['O', 'B-PER', 'I-PER', 'O'], ['B-ORG', 'O', 'O']]
 
print(classification_report(y_true, y_pred))

Key metrics: F1 per entity type (PER, ORG, LOC), macro F1 (average over types), micro F1 (token-level).

Datasets

Dataset	Domain	Languages	Entity Types
CoNLL-2003	News (Reuters)	English	PER, ORG, LOC, MISC
Ontonotes 5.0	News, web, conversational	English, Chinese, Arabic	18 types
WiNGPT	Chinese	Chinese	Medical
WikiNER	Wikipedia	9 languages	PER, LOC, ORG
MultiCoNER	Multilingual	34 languages	33 types

Practical guidance

Data size: 1,000–10,000 labeled sentences for good BERT fine-tuning performance.

Annotation quality: Entity boundaries are critical. Use double-annotation with adjudication. Entity type consistency matters — define clear guidelines (e.g., “Tesla the car company” = ORG, “tesla the unit” = no entity).

Domain shift: NER models trained on news struggle with social media (casual language, abbreviations, memes). Use domain-adapted models or fine-tune on in-domain data.

Augmentation: Back-translation, synonym replacement, contextual augmentation — works for NER but less critical than quantity of labeled data.

Key papers

Bidirectional LSTM-CRF Models for Sequence Tagging (Huang et al., 2015)
NER with Trilingual Linguistic Features (Lample et al., 2016)
Fine-tuned Language Models for Text Classification (Peters et al., 2018) — ELMo for NER
BERT for NER (Devlin et al., 2019) — arXiv:1810.04805

AI/ML Notes

Explorer

Named Entity Recognition

Named Entity Recognition

What

BIO tagging scheme

Common entity types (CoNLL-2003 / Ontonotes)

Approaches

1. Rule-based (CRF)

2. BiLSTM-CRF

3. Fine-tuned transformer (BERT-based)

4. Chinese NER

5. Nested/chinese-style NER

Evaluation

Datasets

Practical guidance

Key papers

Links

Graph View

Table of Contents

Backlinks