Chat and Extraction¶
Run a GGUF LLM — chat completion, summarization, NER, relation extraction — from inside SQLite. The extraction functions use grammar-constrained decoding, so JSON output is guaranteed well-formed.
Load a chat model¶
Like embedding models, chat models live in a session-scoped registry:
.load ./muninn
INSERT INTO temp.muninn_chat_models(name, model)
SELECT 'Qwen3.5-4B',
muninn_chat_model('models/Qwen3.5-4B-Instruct.Q4_K_M.gguf');
SELECT name, n_ctx FROM temp.muninn_chat_models;
Pass an optional second argument to override context length (capped at the model's training context):
INSERT INTO temp.muninn_chat_models(name, model)
SELECT 'Qwen3.5-4B',
muninn_chat_model('models/Qwen3.5-4B-Instruct.Q4_K_M.gguf', 16384);
Recommended GGUF chat models¶
| Model | Quant | File | Notes |
|---|---|---|---|
| Qwen2.5-3B-Instruct | Q4_K_M | 1.9 GB | Fast, capable, no reasoning mode |
| Qwen3.5-4B-Instruct | Q4_K_M | 2.4 GB | Has <think> reasoning mode; use skip_think=1 to bypass |
| Llama-3.2-3B-Instruct | Q4_K_M | 2.0 GB | Meta's small model, strong general instruction following |
| Phi-3.5-mini-instruct | Q4_K_M | 2.4 GB | Microsoft, strong on reasoning tasks |
| Mistral-7B-Instruct-v0.3 | Q4_K_M | 4.1 GB | Well-known baseline |
Larger models give better extraction accuracy but slower per-token generation. For production NER/RE, the sweet spot is 3–7B parameters at Q4_K_M quantization on Metal.
muninn_chat — free-form generation¶
SELECT muninn_chat(
'Qwen3.5-4B',
'Explain HNSW indexes in one sentence.',
NULL, -- no grammar
64, -- max_tokens
NULL, -- default system prompt
1 -- skip_think = 1 (strip Qwen3.5 <think> blocks)
);
HNSW is a graph-based approximate nearest-neighbor index that organizes vectors
into a hierarchy of navigable small-world graphs for O(log N) search.
Full signature:
muninn_chat(
model_name TEXT,
prompt TEXT,
grammar TEXT = NULL, -- GBNF expression, constrains token sampling
max_tokens INTEGER = n_ctx,
system_prompt TEXT = NULL,
skip_think INTEGER = 0 -- 1 → inject closed <think></think>, bypassing reasoning
) -> TEXT
GBNF grammar example¶
-- Force output to one of three sentiment labels
SELECT muninn_chat(
'Qwen3.5-4B',
'Sentiment of: "The food was incredible, but the service was terrible."',
'root ::= "positive" | "negative" | "neutral"',
8
);
GBNF grammars guarantee the output shape regardless of the model — the sampler rejects any token that would violate the grammar. For more on GBNF, see llama.cpp docs.
muninn_extract_entities — named entity recognition¶
Supervised mode (labels provided)¶
SELECT muninn_extract_entities(
'Qwen3.5-4B',
'Elon Musk founded Tesla in 2003 in Palo Alto.',
'person,organization,date,location'
);
{"entities":[
{"text":"Elon Musk","type":"person","score":0.98},
{"text":"Tesla","type":"organization","score":0.97},
{"text":"2003","type":"date","score":0.95},
{"text":"Palo Alto","type":"location","score":0.94}
]}
Unsupervised mode (open extraction)¶
Omit the labels argument to let the model propose its own types:
{"entities":[
{"text":"Elon Musk","type":"person","score":0.97},
{"text":"Tesla","type":"company","score":0.96},
{"text":"2003","type":"year","score":0.93},
...
]}
Unsupervised types are the model's own taxonomy — useful for exploratory analysis, less useful when types must match a downstream schema.
Working with the JSON result¶
The returned TEXT has SQLite subtype 'J', so json_each() works directly:
WITH ner AS (
SELECT muninn_extract_entities('Qwen3.5-4B',
'Elon Musk founded Tesla in 2003.',
'person,organization,date') AS result
)
SELECT value ->> 'text' AS entity,
value ->> 'type' AS type
FROM ner, json_each(ner.result, '$.entities');
muninn_extract_relations — relation extraction¶
Supervised — entities already identified¶
SELECT muninn_extract_relations(
'Qwen3.5-4B',
'Tesla acquired Maxwell Technologies in 2019 for $218 million.',
json('[{"text":"Tesla","type":"organization"},
{"text":"Maxwell Technologies","type":"organization"}]')
);
Unsupervised — discover both entities and relations¶
SELECT muninn_extract_relations(
'Qwen3.5-4B',
'Tesla acquired Maxwell Technologies in 2019 for $218 million.'
);
muninn_extract_ner_re — combined NER + RE¶
One generation covers both tasks when the entity spans inform the relation extraction:
SELECT muninn_extract_ner_re(
'Qwen3.5-4B',
'Tesla acquired Maxwell Technologies in 2019.',
'person,organization', -- entity labels
'acquired,founded,employed_by' -- relation labels
);
{
"entities":[
{"text":"Tesla","type":"organization","score":0.97},
{"text":"Maxwell Technologies","type":"organization","score":0.95}
],
"relations":[
{"head":"Tesla","rel":"acquired","tail":"Maxwell Technologies","score":0.98}
]
}
Batch variants¶
Single-text extraction costs one full prompt evaluation per row. For bulk processing, the batch variants submit multiple prompts as multi-sequence llama_batch calls, sharing KV cache setup:
SELECT muninn_extract_entities_batch(
'Qwen3.5-4B',
json('["Tesla was founded by Elon Musk.",
"SpaceX launched Falcon 9 in 2010.",
"Apple released iPhone in 2007."]'),
'person,organization,date',
4 -- batch_size, max 8
);
[
{"entities":[{"text":"Tesla","type":"organization","score":0.97},
{"text":"Elon Musk","type":"person","score":0.98}]},
{"entities":[{"text":"SpaceX","type":"organization","score":0.96},
{"text":"Falcon 9","type":"organization","score":0.94},
{"text":"2010","type":"date","score":0.95}]},
{"entities":[{"text":"Apple","type":"organization","score":0.97},
{"text":"iPhone","type":"organization","score":0.92},
{"text":"2007","type":"date","score":0.96}]}
]
Batch variants return a JSON array of per-text results in input order.
| Function | Output schema |
|---|---|
muninn_extract_entities_batch |
[{"entities":[...]}, ...] |
muninn_extract_ner_re_batch |
[{"entities":[...], "relations":[...]}, ...] |
Batch size caps at 8 — larger batches degrade per-item quality as KV-cache contention increases.
muninn_summarize¶
SELECT muninn_summarize(
'Qwen3.5-4B',
'Long article text goes here...',
256 -- max_tokens for the summary
);
Qwen3.5 <think> reasoning blocks are stripped from the output automatically — you get only the final summary text.
End-to-end: building a knowledge graph from text¶
.load ./muninn
INSERT INTO temp.muninn_chat_models(name, model)
SELECT 'Qwen3.5-4B',
muninn_chat_model('models/Qwen3.5-4B-Instruct.Q4_K_M.gguf');
-- Source documents
CREATE TABLE docs (id INTEGER PRIMARY KEY, content TEXT);
INSERT INTO docs(content) VALUES
('Tesla acquired Maxwell Technologies in 2019 for $218 million.'),
('SpaceX launched Falcon 9 in 2010, founded by Elon Musk.'),
('Apple released iPhone in 2007; Steve Jobs unveiled it on stage.');
-- Extract NER + RE per document (batched)
CREATE TEMP TABLE extractions AS
SELECT id,
value AS extraction
FROM docs,
json_each(
muninn_extract_ner_re_batch(
'Qwen3.5-4B',
(SELECT json_group_array(content) FROM docs),
'person,organization,date',
'acquired,founded,released,employed_by',
4
)
);
-- Materialize entities and relations as proper tables
CREATE TABLE entities AS
SELECT DISTINCT
e.value ->> 'text' AS name,
e.value ->> 'type' AS type
FROM extractions, json_each(extractions.extraction, '$.entities') e;
CREATE TABLE relations AS
SELECT r.value ->> 'head' AS src,
r.value ->> 'rel' AS relation,
r.value ->> 'tail' AS dst
FROM extractions, json_each(extractions.extraction, '$.relations') r;
-- Now run any graph TVF over the extracted relations
SELECT node, centrality FROM graph_node_betweenness
WHERE edge_table = 'relations' AND src_col = 'src' AND dst_col = 'dst'
AND direction = 'both'
ORDER BY centrality DESC LIMIT 5;
This is the recipe used by the GraphRAG Cookbook — from raw text to a queryable knowledge graph in one SQL script.
Performance notes¶
| Operation | Approx throughput on M1 Pro, Qwen3.5-4B Q4_K_M, Metal |
|---|---|
muninn_chat |
~60 tokens/sec |
muninn_extract_entities (single) |
~1 text/sec for short passages |
muninn_extract_entities_batch (size 4) |
~2.5× speedup over singles |
muninn_extract_ner_re_batch (size 4) |
~2× speedup over two separate calls |
These numbers scale roughly linearly with model size and inversely with quantization precision.
See also¶
- API Reference — LLM chat and extraction
- Entity Resolution — combines extraction + clustering + LLM borderline refinement
- GraphRAG Cookbook — extraction + graph retrieval end-to-end