Knowledge Graph Benchmarks¶

End-to-end benchmarks for the knowledge graph pipeline: entity extraction, relation extraction, entity resolution, and graph-augmented retrieval (GraphRAG).

Sub-categories¶

NER Extraction¶

Compares NER models on entity extraction quality (micro F1) and performance. Evaluated on Gutenberg texts and standard NER benchmark datasets with gold labels.

NER Models¶

Model	Type	Params	Size	Description
GLiNER small-v2.1	Zero-shot NER	166M	611 MB	Lightweight generalist entity extraction
GLiNER medium-v2.1	Zero-shot NER	209M	781 MB	Medium-capacity zero-shot NER
GLiNER large-v2.1	Zero-shot NER	459M	1780 MB	High-capacity zero-shot NER
NuNerZero	Zero-shot NER	~400M	1800 MB	NumIND zero-shot NER; labels must be lowercase
GNER-T5 base	Seq2seq NER	248M	990 MB	Generative NER via T5-base (slower, higher quality)
GNER-T5 large	Seq2seq NER	783M	3100 MB	Generative NER via T5-large
spaCy en_core_web_lg	Statistical NER	—	560 MB	spaCy's large English pipeline
Qwen3-4B (LLM)	LLM NER	4B	2500 MB	Qwen3-4B instruction-tuned via llama.cpp GGUF
Qwen3-8B (LLM)	LLM NER	8B	5000 MB	Qwen3-8B instruction-tuned via llama.cpp GGUF
Phi-4-mini (LLM)	LLM NER	3.8B	2500 MB	Microsoft Phi-4-mini instruction-tuned via llama.cpp GGUF
Gemma-3-4B (LLM)	LLM NER	4B	2500 MB	Google Gemma 3 4B instruction-tuned via llama.cpp GGUF

NER Datasets¶

Dataset	Source	Description
Wealth of Nations (3300)	Project Gutenberg	Literary text chunks (no gold labels — speed only)
CrossNER (AI)	HuggingFace	CrossNER AI domain; BIO-tagged entities
CrossNER (CoNLL-2003)	HuggingFace	CoNLL-2003 via CrossNER; PER, ORG, LOC, MISC
CrossNER (Literature)	HuggingFace	CrossNER literature domain
CrossNER (Music)	HuggingFace	CrossNER music domain
CrossNER (Politics)	HuggingFace	CrossNER politics domain
CrossNER (Science)	HuggingFace	CrossNER science domain
Few-NERD (supervised)	HuggingFace	Fine-grained NER with 66 entity types (supervised split)
Few-NERD (inter)	HuggingFace	Few-NERD inter-domain split
Few-NERD (intra)	HuggingFace	Few-NERD intra-domain split

Relation Extraction¶

Evaluates relation extraction quality (triple F1) on standard RE benchmark datasets. Uses NER-based entity pair extraction as a crude relation proxy.

RE Datasets¶

Dataset	Source	Description
DocRED	HuggingFace	Document-level relation extraction
WebNLG	HuggingFace	RDF triple verbalization and extraction
CoNLL-04	HuggingFace	Joint entity and relation extraction

Entity Resolution¶

Evaluates the HNSW blocking + Jaro-Winkler matching + Leiden clustering pipeline on standard ER benchmark datasets.

ER Datasets¶

Dataset	Source	Description
Wealth of Nations (3300)	Project Gutenberg	Literary entity mentions with spelling variations
FEBRL1	Freely Extensible Biomedical Record Linkage	Synthetic person records with controlled duplication

GraphRAG Retrieval¶

Measures whether graph expansion after a VSS or BM25 entry point improves retrieval quality.

Retrieval Configurations¶

Entry Point	Expansion	Description
VSS	none / BFS-1 / BFS-2	Semantic vector search with optional 1-hop or 2-hop graph expansion
BM25	none / BFS-1 / BFS-2	FTS5 keyword search with optional graph expansion

Charts¶

Charts are generated by uv run -m benchmarks.harness analyse --category kg.

Knowledge Graph Benchmarks¶

Sub-categories¶

NER Extraction¶

NER Models¶

NER Datasets¶

Relation Extraction¶

RE Datasets¶

Entity Resolution¶

ER Datasets¶

GraphRAG Retrieval¶

Retrieval Configurations¶

Charts¶

NER Extraction Speed by Model¶

Entity Count by Model¶

Entity F1 by NER Model¶

NER Precision vs Recall by Model¶

NER Speed vs Quality (Pareto Frontier)¶

Triple F1 by RE Model¶

RE Speed vs Quality¶

Entity Resolution Pairwise F1 by Dataset¶

Entity Resolution B-Cubed F1 by Dataset¶

GraphRAG Passage Recall@10 by Entry+Expansion¶

GraphRAG Retrieval Latency by Entry+Expansion¶