Named Entity Recognition
What
Extract and classify named entities from text: people, organizations, locations, dates, etc.
"Apple was founded by Steve Jobs in Cupertino"
ORG PERSON LOCATION
BIO tagging scheme
Standard approach for token-level NER. Each token gets a label:
- B-{TYPE}: Beginning of an entity
- I-{TYPE}: Inside an entity (continuation)
- O: Outside (no entity)
Apple → B-ORG
was → O
founded → O
by → O
Steve → B-PER
Jobs → I-PER
in → O
Cupertino → B-LOC
This lets models handle multi-token entities like “Steve Jobs” (B-PER, I-PER).
Common entity types (CoNLL-2003 / Ontonotes)
| Tag | Description | Examples |
|---|---|---|
| PER | Person | Elon Musk, Marie Curie |
| ORG | Organization | SpaceX, WHO, NATO |
| LOC | Location | Cupertino, Estonia, Mount Everest |
| DATE/TIME | Temporal | 2002, next Monday, 3 PM |
| MONEY | Monetary values | $5 million, €20 |
| PERCENT | Percentages | 10%, 50 percent |
| PRODUCT | Products | iPhone, Windows 95 |
| EVENT | Events | World War II, Olympics |
Approaches
1. Rule-based (CRF)
Traditional approach using Conditional Random Fields. Feature engineering: word shape, prefix/suffix, POS tags, gazetteers. Works well with limited data but requires careful feature design.
2. BiLSTM-CRF
Bidirectional LSTM with a CRF layer on top. Captures sequential context in both directions, CRF models label dependencies (e.g., I-PER cannot follow B-LOC).
3. Fine-tuned transformer (BERT-based)
Current standard — fine-tune a pretrained language model:
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
model = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
model("Elon Musk founded SpaceX in 2002")
# [{'entity_group': 'PER', 'word': 'Elon Musk', 'score': 0.999},
# {'entity_group': 'ORG', 'word': 'SpaceX', 'score': 0.998}]For SpanishNER, ChineseNER, etc.: use xlm-roberta-large-NER or models from HuggingFace:
# Multilingual NER
ner = pipeline("ner", model="Davlan/xlm-roberta-base-ner-hrl", aggregation_strategy="simple")4. Chinese NER
Chinese doesn’t use whitespace tokenization — requires word segmentation first (Jieba, pkuseg), or character-level models. BERT-based models with character input + word segmentation features perform best.
5. Nested/chinese-style NER
Standard BIO only handles flat entities. For overlapping entities (e.g., “CEO of Apple” → CEO is a role AND Apple is an ORG), use:
- Head-driven phrase structure
- Multi-task learning (NER + entity typing)
- Layered models (detect entity spans first, then classify)
Evaluation
Standard token-level metrics. Entity match requires boundary + type to match:
from seqeval.metrics import classification_report
from sklearn_crfsuite import CRF
# CRF must output sequence of labels
y_true = [['O', 'B-PER', 'I-PER', 'O'], ['B-ORG', 'O', 'O']]
y_pred = [['O', 'B-PER', 'I-PER', 'O'], ['B-ORG', 'O', 'O']]
print(classification_report(y_true, y_pred))Key metrics: F1 per entity type (PER, ORG, LOC), macro F1 (average over types), micro F1 (token-level).
Datasets
| Dataset | Domain | Languages | Entity Types |
|---|---|---|---|
| CoNLL-2003 | News (Reuters) | English | PER, ORG, LOC, MISC |
| Ontonotes 5.0 | News, web, conversational | English, Chinese, Arabic | 18 types |
| WiNGPT | Chinese | Chinese | Medical |
| WikiNER | Wikipedia | 9 languages | PER, LOC, ORG |
| MultiCoNER | Multilingual | 34 languages | 33 types |
Practical guidance
Data size: 1,000–10,000 labeled sentences for good BERT fine-tuning performance.
Annotation quality: Entity boundaries are critical. Use double-annotation with adjudication. Entity type consistency matters — define clear guidelines (e.g., “Tesla the car company” = ORG, “tesla the unit” = no entity).
Domain shift: NER models trained on news struggle with social media (casual language, abbreviations, memes). Use domain-adapted models or fine-tune on in-domain data.
Augmentation: Back-translation, synonym replacement, contextual augmentation — works for NER but less critical than quantity of labeled data.
Key papers
- Bidirectional LSTM-CRF Models for Sequence Tagging (Huang et al., 2015)
- NER with Trilingual Linguistic Features (Lample et al., 2016)
- Fine-tuned Language Models for Text Classification (Peters et al., 2018) — ELMo for NER
- BERT for NER (Devlin et al., 2019) — arXiv:1810.04805
Links
- BERT and Masked Language Models
- Text Preprocessing
- NLP Roadmap
- Tokenization — word vs subword tokenization matters for NER