Knowledge Graph Benchmarks¶
End-to-end benchmarks for the knowledge graph pipeline: entity extraction, relation extraction, entity resolution, and graph-augmented retrieval (GraphRAG).
Sub-categories¶
NER Extraction¶
Compares NER models on entity extraction quality (micro F1) and performance. Evaluated on Gutenberg texts and standard NER benchmark datasets with gold labels.
NER Models¶
| Model | Type | Params | Size | Description |
|---|---|---|---|---|
| GLiNER small-v2.1 | Zero-shot NER | 166M | 611 MB | Lightweight generalist entity extraction |
| GLiNER medium-v2.1 | Zero-shot NER | 209M | 781 MB | Medium-capacity zero-shot NER |
| GLiNER large-v2.1 | Zero-shot NER | 459M | 1780 MB | High-capacity zero-shot NER |
| NuNerZero | Zero-shot NER | ~400M | 1800 MB | NumIND zero-shot NER; labels must be lowercase |
| GNER-T5 base | Seq2seq NER | 248M | 990 MB | Generative NER via T5-base (slower, higher quality) |
| GNER-T5 large | Seq2seq NER | 783M | 3100 MB | Generative NER via T5-large |
| spaCy en_core_web_lg | Statistical NER | — | 560 MB | spaCy's large English pipeline |
| Qwen3-4B (LLM) | LLM NER | 4B | 2500 MB | Qwen3-4B instruction-tuned via llama.cpp GGUF |
| Qwen3-8B (LLM) | LLM NER | 8B | 5000 MB | Qwen3-8B instruction-tuned via llama.cpp GGUF |
| Phi-4-mini (LLM) | LLM NER | 3.8B | 2500 MB | Microsoft Phi-4-mini instruction-tuned via llama.cpp GGUF |
| Gemma-3-4B (LLM) | LLM NER | 4B | 2500 MB | Google Gemma 3 4B instruction-tuned via llama.cpp GGUF |
NER Datasets¶
| Dataset | Source | Description |
|---|---|---|
| Wealth of Nations (3300) | Project Gutenberg | Literary text chunks (no gold labels — speed only) |
| CrossNER (AI) | HuggingFace | CrossNER AI domain; BIO-tagged entities |
| CrossNER (CoNLL-2003) | HuggingFace | CoNLL-2003 via CrossNER; PER, ORG, LOC, MISC |
| CrossNER (Literature) | HuggingFace | CrossNER literature domain |
| CrossNER (Music) | HuggingFace | CrossNER music domain |
| CrossNER (Politics) | HuggingFace | CrossNER politics domain |
| CrossNER (Science) | HuggingFace | CrossNER science domain |
| Few-NERD (supervised) | HuggingFace | Fine-grained NER with 66 entity types (supervised split) |
| Few-NERD (inter) | HuggingFace | Few-NERD inter-domain split |
| Few-NERD (intra) | HuggingFace | Few-NERD intra-domain split |
Relation Extraction¶
Evaluates relation extraction quality (triple F1) on standard RE benchmark datasets. Uses NER-based entity pair extraction as a crude relation proxy.
RE Datasets¶
| Dataset | Source | Description |
|---|---|---|
| DocRED | HuggingFace | Document-level relation extraction |
| WebNLG | HuggingFace | RDF triple verbalization and extraction |
| CoNLL-04 | HuggingFace | Joint entity and relation extraction |
Entity Resolution¶
Evaluates the HNSW blocking + Jaro-Winkler matching + Leiden clustering pipeline on standard ER benchmark datasets.
ER Datasets¶
| Dataset | Source | Description |
|---|---|---|
| Wealth of Nations (3300) | Project Gutenberg | Literary entity mentions with spelling variations |
| FEBRL1 | Freely Extensible Biomedical Record Linkage | Synthetic person records with controlled duplication |
GraphRAG Retrieval¶
Measures whether graph expansion after a VSS or BM25 entry point improves retrieval quality.
Retrieval Configurations¶
| Entry Point | Expansion | Description |
|---|---|---|
| VSS | none / BFS-1 / BFS-2 | Semantic vector search with optional 1-hop or 2-hop graph expansion |
| BM25 | none / BFS-1 / BFS-2 | FTS5 keyword search with optional graph expansion |
Charts¶
Charts are generated by uv run -m benchmarks.harness analyse --category kg.
NER Extraction Speed by Model¶
Entity Count by Model¶
Entity F1 by NER Model¶
NER Precision vs Recall by Model¶
NER Speed vs Quality (Pareto Frontier)¶
Triple F1 by RE Model¶
RE Speed vs Quality¶
Entity Resolution Pairwise F1 by Dataset¶
Entity Resolution B-Cubed F1 by Dataset¶
GraphRAG Passage Recall@10 by Entry+Expansion¶
GraphRAG Retrieval Latency by Entry+Expansion¶