Skip to content

Centrality and Community

Five TVFs for structural graph analysis on any existing edge table: four centrality measures (graph_degree, graph_node_betweenness, graph_edge_betweenness, graph_closeness) and Leiden community detection (graph_leiden). All support weighted, directed, and temporally filtered inputs through the shared constraint syntax.

When to use what

Question TVF
Which nodes have the most connections? graph_degree
Which nodes bridge separate clusters? graph_node_betweenness
Which edges hold the graph together? graph_edge_betweenness
Which nodes can reach everyone fastest? graph_closeness
What clusters exist? graph_leiden
What's the top node in each cluster? Leiden + betweenness, joined

Setup — a two-cluster graph

Every example below uses this tiny graph — two triangles joined by a single bridge edge (dave → eve):

.load ./muninn

CREATE TABLE edges (src TEXT, dst TEXT, weight REAL DEFAULT 1.0);

INSERT INTO edges VALUES
  ('alice', 'bob',   1.0), ('alice', 'carol', 1.0), ('bob',   'carol', 1.0),
  ('bob',   'dave',  1.0), ('carol', 'dave',  1.0),
  ('dave',  'eve',   1.0),   -- bridge
  ('eve',   'frank', 1.0), ('eve',   'grace', 1.0), ('frank', 'grace', 1.0);

Degree centrality

Cheapest centrality — in/out/total edge counts per node. Run this first as a sanity check before spending time on betweenness or closeness.

SELECT node, in_degree, out_degree, degree
  FROM graph_degree
  WHERE edge_table = 'edges' AND src_col = 'src' AND dst_col = 'dst'
    AND direction = 'both';
node    in_degree  out_degree  degree
------  ---------  ----------  ------
alice   2.0        2.0         4.0
bob     3.0        3.0         6.0
carol   3.0        3.0         6.0
dave    3.0        3.0         6.0
eve     3.0        3.0         6.0
frank   2.0        2.0         4.0
grace   2.0        2.0         4.0

Weighted: pass weight_col = 'weight' and degrees become sums of edge weights.

Normalized (values in [0, 1], scaled by N - 1):

SELECT node, round(centrality, 3) AS c FROM graph_degree
  WHERE edge_table = 'edges' AND src_col = 'src' AND dst_col = 'dst'
    AND direction = 'both' AND normalized = 1
  ORDER BY c DESC;

Node betweenness

Brandes' O(VE) algorithm. Identifies bridge nodes — those sitting on many shortest paths. In the example graph, dave and eve are the two endpoints of the only bridge, so they dominate:

SELECT node, round(centrality, 3) AS c FROM graph_node_betweenness
  WHERE edge_table = 'edges' AND src_col = 'src' AND dst_col = 'dst'
    AND direction = 'both' AND normalized = 1
  ORDER BY c DESC;
node    c
------  -----
dave    0.600
eve     0.600
bob     0.133
carol   0.133
frank   0.133
grace   0.133
alice   0.000

Performance on large graphs

Exact betweenness is O(VE) — slow on anything over ~50k nodes. The auto_approx_threshold constraint switches to source-sampling when the graph exceeds the threshold:

-- Approximate on graphs larger than 10k nodes (samples ceil(√N) sources)
SELECT node, centrality FROM graph_node_betweenness
  WHERE edge_table = 'edges' AND src_col = 'src' AND dst_col = 'dst'
    AND direction = 'both' AND auto_approx_threshold = 10000
  ORDER BY centrality DESC LIMIT 20;

Default threshold is 50,000. Set it lower on graphs you know are large.

GraphRAG signal

Bridge nodes are the most valuable retrieval context in knowledge graphs — they connect otherwise disjoint topic clusters. Betweenness is typically worth caching as a regular table and recomputing on a schedule rather than on every query.


Edge betweenness

Same algorithm, per-edge output. Returns one row per edge present in the graph.

SELECT src, dst, round(centrality, 3) AS c FROM graph_edge_betweenness
  WHERE edge_table = 'edges' AND src_col = 'src' AND dst_col = 'dst'
    AND direction = 'both'
  ORDER BY c DESC LIMIT 3;
src     dst     c
------  ------  ------
dave    eve     12.000
bob     dave    5.000
carol   dave    5.000

The dave → eve edge has by far the highest score — removing it would split the graph. This is the signal used by the Girvan-Newman hierarchical clustering algorithm and by muninn's entity resolution cascade (for bridge-edge removal before Leiden).


Closeness centrality

Inverse of total shortest-path distance — high closeness means a node can reach every other node in few hops. muninn uses Wasserman-Faust normalization by default, which handles disconnected graphs gracefully: if a node can only reach R of the N−1 other nodes, its score is scaled by (R / (N−1))².

SELECT node, round(centrality, 3) AS c FROM graph_closeness
  WHERE edge_table = 'edges' AND src_col = 'src' AND dst_col = 'dst'
    AND direction = 'both'
  ORDER BY c DESC;

normalized = 1 is the default (unlike degree and betweenness). Pass normalized = 0 for unscaled scores.


Direction modes

All centrality TVFs accept direction:

Value Behavior
'forward' Follow edges src → dst only
'reverse' Follow edges dst → src only
'both' (default for most) Treat edges as undirected

For undirected graphs stored with one row per edge, use 'both'. For DAGs or directed graphs where edge direction carries meaning, pick 'forward' or 'reverse' deliberately.


Temporal filtering

All centrality TVFs accept optional timestamp_col, time_start, time_end. When all three are supplied, the TVF loads only edges whose timestamp falls within the window — useful for time-sliced social or activity graphs.

CREATE TABLE events (src TEXT, dst TEXT, ts TEXT);
INSERT INTO events VALUES
  ('alice', 'bob',   '2026-01-15T10:00:00'),
  ('bob',   'carol', '2026-02-01T14:30:00'),
  ('carol', 'dave',  '2026-03-03T09:15:00');

-- Betweenness over January-only edges
SELECT node, centrality FROM graph_node_betweenness
  WHERE edge_table = 'events' AND src_col = 'src' AND dst_col = 'dst'
    AND timestamp_col = 'ts'
    AND time_start = '2026-01-01T00:00:00'
    AND time_end   = '2026-01-31T23:59:59'
    AND direction = 'both';

Timestamps are compared as strings (ISO 8601 is the safe choice).


Leiden community detection

The Leiden algorithm (Traag, Waltman & van Eck, 2019) partitions a graph into communities by maximizing modularity, with a guarantee that every community is internally well-connected — unlike Louvain, which can produce phantom communities that split apart on inspection.

SELECT node, community_id, round(modularity, 3) AS mod
  FROM graph_leiden
  WHERE edge_table = 'edges' AND src_col = 'src' AND dst_col = 'dst'
    AND direction = 'both';
node    community_id  mod
------  ------------  -----
alice   0             0.432
bob     0             0.432
carol   0             0.432
dave    0             0.432
eve     1             0.432
frank   1             0.432
grace   1             0.432

Resolution parameter

resolution controls the granularity of the partition:

Resolution Effect
< 1.0 Fewer, larger communities
1.0 (default) Standard modularity
> 1.0 More, smaller communities
-- Finer partitioning
SELECT node, community_id FROM graph_leiden
  WHERE edge_table = 'edges' AND src_col = 'src' AND dst_col = 'dst'
    AND resolution = 2.5;

Sweep a few values (0.5, 1.0, 2.0) and pick the partition that matches your domain understanding — modularity alone is not sufficient for choosing a resolution.

Weighted communities

SELECT node, community_id FROM graph_leiden
  WHERE edge_table = 'edges' AND src_col = 'src' AND dst_col = 'dst'
    AND weight_col = 'weight';

Strong edges are more likely to keep endpoints in the same community.

Microsoft GraphRAG pattern

Microsoft's GraphRAG uses Leiden for hierarchical retrieval: detect communities, compute a summary embedding per community (mean of member vectors, or an LLM-generated label), search supernodes first, drill into the matching community. muninn supplies the building blocks — see muninn_label_groups for the labeling step.


Combining centrality with communities

A common pattern: detect communities, then pick the most important node inside each.

WITH node_comm AS (
  SELECT node, community_id FROM graph_leiden
    WHERE edge_table = 'edges' AND src_col = 'src' AND dst_col = 'dst'
      AND direction = 'both'
),
node_cent AS (
  SELECT node, centrality FROM graph_node_betweenness
    WHERE edge_table = 'edges' AND src_col = 'src' AND dst_col = 'dst'
      AND direction = 'both' AND normalized = 1
)
SELECT nc.community_id, nc.node, round(cent.centrality, 3) AS c
  FROM node_comm nc
  JOIN node_cent cent USING (node)
  ORDER BY nc.community_id, c DESC;

dave and eve will emerge as representatives of their respective communities — both sit on the bridge, which gives them maximum betweenness among their neighbors. These "community representatives" are a good context unit for LLM summarization (muninn_summarize) or prompt-grounding.

Where to go next

References