Skip to content

Knowledge Graph Benchmarks

End-to-end benchmarks for the knowledge graph pipeline: entity extraction, relation extraction, entity resolution, and graph-augmented retrieval (GraphRAG).

Sub-categories

NER Extraction

Compares NER models on entity extraction quality (micro F1) and performance. Evaluated on Gutenberg texts and standard NER benchmark datasets with gold labels.

NER Models

Model Type Params Size Description
GLiNER small-v2.1 Zero-shot NER 166M 611 MB Lightweight generalist entity extraction
GLiNER medium-v2.1 Zero-shot NER 209M 781 MB Medium-capacity zero-shot NER
GLiNER large-v2.1 Zero-shot NER 459M 1780 MB High-capacity zero-shot NER
NuNerZero Zero-shot NER ~400M 1800 MB NumIND zero-shot NER; labels must be lowercase
GNER-T5 base Seq2seq NER 248M 990 MB Generative NER via T5-base (slower, higher quality)
GNER-T5 large Seq2seq NER 783M 3100 MB Generative NER via T5-large
spaCy en_core_web_lg Statistical NER 560 MB spaCy's large English pipeline
Qwen3-4B (LLM) LLM NER 4B 2500 MB Qwen3-4B instruction-tuned via llama.cpp GGUF
Qwen3-8B (LLM) LLM NER 8B 5000 MB Qwen3-8B instruction-tuned via llama.cpp GGUF
Phi-4-mini (LLM) LLM NER 3.8B 2500 MB Microsoft Phi-4-mini instruction-tuned via llama.cpp GGUF
Gemma-3-4B (LLM) LLM NER 4B 2500 MB Google Gemma 3 4B instruction-tuned via llama.cpp GGUF

NER Datasets

Dataset Source Description
Wealth of Nations (3300) Project Gutenberg Literary text chunks (no gold labels — speed only)
CrossNER (AI) HuggingFace CrossNER AI domain; BIO-tagged entities
CrossNER (CoNLL-2003) HuggingFace CoNLL-2003 via CrossNER; PER, ORG, LOC, MISC
CrossNER (Literature) HuggingFace CrossNER literature domain
CrossNER (Music) HuggingFace CrossNER music domain
CrossNER (Politics) HuggingFace CrossNER politics domain
CrossNER (Science) HuggingFace CrossNER science domain
Few-NERD (supervised) HuggingFace Fine-grained NER with 66 entity types (supervised split)
Few-NERD (inter) HuggingFace Few-NERD inter-domain split
Few-NERD (intra) HuggingFace Few-NERD intra-domain split

Relation Extraction

Evaluates relation extraction quality (triple F1) on standard RE benchmark datasets. Uses NER-based entity pair extraction as a crude relation proxy.

RE Datasets

Dataset Source Description
DocRED HuggingFace Document-level relation extraction
WebNLG HuggingFace RDF triple verbalization and extraction
CoNLL-04 HuggingFace Joint entity and relation extraction

Entity Resolution

Evaluates the HNSW blocking + Jaro-Winkler matching + Leiden clustering pipeline on standard ER benchmark datasets.

ER Datasets

Dataset Source Description
Wealth of Nations (3300) Project Gutenberg Literary entity mentions with spelling variations
FEBRL1 Freely Extensible Biomedical Record Linkage Synthetic person records with controlled duplication

GraphRAG Retrieval

Measures whether graph expansion after a VSS or BM25 entry point improves retrieval quality.

Retrieval Configurations

Entry Point Expansion Description
VSS none / BFS-1 / BFS-2 Semantic vector search with optional 1-hop or 2-hop graph expansion
BM25 none / BFS-1 / BFS-2 FTS5 keyword search with optional graph expansion

Charts

Charts are generated by uv run -m benchmarks.harness analyse --category kg.

NER Extraction Speed by Model

Entity Count by Model

Entity F1 by NER Model

NER Precision vs Recall by Model

NER Speed vs Quality (Pareto Frontier)

Triple F1 by RE Model

RE Speed vs Quality

Entity Resolution Pairwise F1 by Dataset

Entity Resolution B-Cubed F1 by Dataset

GraphRAG Passage Recall@10 by Entry+Expansion

GraphRAG Retrieval Latency by Entry+Expansion