Chat and Extraction¶

Run a GGUF LLM — chat completion, summarization, NER, relation extraction — from inside SQLite. The extraction functions use grammar-constrained decoding, so JSON output is guaranteed well-formed.

Load a chat model¶

Like embedding models, chat models live in a session-scoped registry:

.load ./muninn

INSERT INTO temp.muninn_chat_models(name, model)
  SELECT 'Qwen3.5-4B',
         muninn_chat_model('models/Qwen3.5-4B-Instruct.Q4_K_M.gguf');

SELECT name, n_ctx FROM temp.muninn_chat_models;

name        n_ctx
----------  -----
Qwen3.5-4B  8192

Pass an optional second argument to override context length (capped at the model's training context):

INSERT INTO temp.muninn_chat_models(name, model)
  SELECT 'Qwen3.5-4B',
         muninn_chat_model('models/Qwen3.5-4B-Instruct.Q4_K_M.gguf', 16384);

Recommended GGUF chat models¶

Model	Quant	File	Notes
Qwen2.5-3B-Instruct	Q4_K_M	1.9 GB	Fast, capable, no reasoning mode
Qwen3.5-4B-Instruct	Q4_K_M	2.4 GB	Has `<think>` reasoning mode; use `skip_think=1` to bypass
Llama-3.2-3B-Instruct	Q4_K_M	2.0 GB	Meta's small model, strong general instruction following
Phi-3.5-mini-instruct	Q4_K_M	2.4 GB	Microsoft, strong on reasoning tasks
Mistral-7B-Instruct-v0.3	Q4_K_M	4.1 GB	Well-known baseline

Larger models give better extraction accuracy but slower per-token generation. For production NER/RE, the sweet spot is 3–7B parameters at Q4_K_M quantization on Metal.

`muninn_chat` — free-form generation¶

SELECT muninn_chat(
  'Qwen3.5-4B',
  'Explain HNSW indexes in one sentence.',
  NULL,          -- no grammar
  64,            -- max_tokens
  NULL,          -- default system prompt
  1              -- skip_think = 1 (strip Qwen3.5 <think> blocks)
);

HNSW is a graph-based approximate nearest-neighbor index that organizes vectors
into a hierarchy of navigable small-world graphs for O(log N) search.

Full signature:

muninn_chat(
    model_name TEXT,
    prompt TEXT,
    grammar TEXT       = NULL,    -- GBNF expression, constrains token sampling
    max_tokens INTEGER = n_ctx,
    system_prompt TEXT = NULL,
    skip_think INTEGER = 0        -- 1 → inject closed <think></think>, bypassing reasoning
) -> TEXT

GBNF grammar example¶

-- Force output to one of three sentiment labels
SELECT muninn_chat(
  'Qwen3.5-4B',
  'Sentiment of: "The food was incredible, but the service was terrible."',
  'root ::= "positive" | "negative" | "neutral"',
  8
);

negative

GBNF grammars guarantee the output shape regardless of the model — the sampler rejects any token that would violate the grammar. For more on GBNF, see llama.cpp docs.

`muninn_extract_entities` — named entity recognition¶

Supervised mode (labels provided)¶

SELECT muninn_extract_entities(
  'Qwen3.5-4B',
  'Elon Musk founded Tesla in 2003 in Palo Alto.',
  'person,organization,date,location'
);

{"entities":[
  {"text":"Elon Musk","type":"person","score":0.98},
  {"text":"Tesla","type":"organization","score":0.97},
  {"text":"2003","type":"date","score":0.95},
  {"text":"Palo Alto","type":"location","score":0.94}
]}

Unsupervised mode (open extraction)¶

Omit the labels argument to let the model propose its own types:

SELECT muninn_extract_entities(
  'Qwen3.5-4B',
  'Elon Musk founded Tesla in 2003 in Palo Alto.'
);

{"entities":[
  {"text":"Elon Musk","type":"person","score":0.97},
  {"text":"Tesla","type":"company","score":0.96},
  {"text":"2003","type":"year","score":0.93},
  ...
]}

Unsupervised types are the model's own taxonomy — useful for exploratory analysis, less useful when types must match a downstream schema.

Working with the JSON result¶

The returned TEXT has SQLite subtype 'J', so json_each() works directly:

WITH ner AS (
  SELECT muninn_extract_entities('Qwen3.5-4B',
           'Elon Musk founded Tesla in 2003.',
           'person,organization,date') AS result
)
SELECT value ->> 'text' AS entity,
       value ->> 'type' AS type
  FROM ner, json_each(ner.result, '$.entities');

entity     type
---------  ------------
Elon Musk  person
Tesla      organization
2003       date

`muninn_extract_relations` — relation extraction¶

Supervised — entities already identified¶

SELECT muninn_extract_relations(
  'Qwen3.5-4B',
  'Tesla acquired Maxwell Technologies in 2019 for $218 million.',
  json('[{"text":"Tesla","type":"organization"},
         {"text":"Maxwell Technologies","type":"organization"}]')
);

{"relations":[
  {"head":"Tesla","rel":"acquired","tail":"Maxwell Technologies","score":0.98}
]}

Unsupervised — discover both entities and relations¶

SELECT muninn_extract_relations(
  'Qwen3.5-4B',
  'Tesla acquired Maxwell Technologies in 2019 for $218 million.'
);

`muninn_extract_ner_re` — combined NER + RE¶

One generation covers both tasks when the entity spans inform the relation extraction:

SELECT muninn_extract_ner_re(
  'Qwen3.5-4B',
  'Tesla acquired Maxwell Technologies in 2019.',
  'person,organization',           -- entity labels
  'acquired,founded,employed_by'   -- relation labels
);

{
  "entities":[
    {"text":"Tesla","type":"organization","score":0.97},
    {"text":"Maxwell Technologies","type":"organization","score":0.95}
  ],
  "relations":[
    {"head":"Tesla","rel":"acquired","tail":"Maxwell Technologies","score":0.98}
  ]
}

Batch variants¶

Single-text extraction costs one full prompt evaluation per row. For bulk processing, the batch variants submit multiple prompts as multi-sequence llama_batch calls, sharing KV cache setup:

SELECT muninn_extract_entities_batch(
  'Qwen3.5-4B',
  json('["Tesla was founded by Elon Musk.",
        "SpaceX launched Falcon 9 in 2010.",
        "Apple released iPhone in 2007."]'),
  'person,organization,date',
  4                                  -- batch_size, max 8
);

[
  {"entities":[{"text":"Tesla","type":"organization","score":0.97},
               {"text":"Elon Musk","type":"person","score":0.98}]},
  {"entities":[{"text":"SpaceX","type":"organization","score":0.96},
               {"text":"Falcon 9","type":"organization","score":0.94},
               {"text":"2010","type":"date","score":0.95}]},
  {"entities":[{"text":"Apple","type":"organization","score":0.97},
               {"text":"iPhone","type":"organization","score":0.92},
               {"text":"2007","type":"date","score":0.96}]}
]

Batch variants return a JSON array of per-text results in input order.

Function	Output schema
`muninn_extract_entities_batch`	`[{"entities":[...]}, ...]`
`muninn_extract_ner_re_batch`	`[{"entities":[...], "relations":[...]}, ...]`

Batch size caps at 8 — larger batches degrade per-item quality as KV-cache contention increases.

`muninn_summarize`¶

SELECT muninn_summarize(
  'Qwen3.5-4B',
  'Long article text goes here...',
  256                -- max_tokens for the summary
);

Qwen3.5 <think> reasoning blocks are stripped from the output automatically — you get only the final summary text.

End-to-end: building a knowledge graph from text¶

.load ./muninn

INSERT INTO temp.muninn_chat_models(name, model)
  SELECT 'Qwen3.5-4B',
         muninn_chat_model('models/Qwen3.5-4B-Instruct.Q4_K_M.gguf');

-- Source documents
CREATE TABLE docs (id INTEGER PRIMARY KEY, content TEXT);
INSERT INTO docs(content) VALUES
  ('Tesla acquired Maxwell Technologies in 2019 for $218 million.'),
  ('SpaceX launched Falcon 9 in 2010, founded by Elon Musk.'),
  ('Apple released iPhone in 2007; Steve Jobs unveiled it on stage.');

-- Extract NER + RE per document (batched)
CREATE TEMP TABLE extractions AS
SELECT id,
       value AS extraction
  FROM docs,
       json_each(
         muninn_extract_ner_re_batch(
           'Qwen3.5-4B',
           (SELECT json_group_array(content) FROM docs),
           'person,organization,date',
           'acquired,founded,released,employed_by',
           4
         )
       );

-- Materialize entities and relations as proper tables
CREATE TABLE entities AS
SELECT DISTINCT
       e.value ->> 'text' AS name,
       e.value ->> 'type' AS type
  FROM extractions, json_each(extractions.extraction, '$.entities') e;

CREATE TABLE relations AS
SELECT r.value ->> 'head' AS src,
       r.value ->> 'rel'  AS relation,
       r.value ->> 'tail' AS dst
  FROM extractions, json_each(extractions.extraction, '$.relations') r;

-- Now run any graph TVF over the extracted relations
SELECT node, centrality FROM graph_node_betweenness
  WHERE edge_table = 'relations' AND src_col = 'src' AND dst_col = 'dst'
    AND direction = 'both'
  ORDER BY centrality DESC LIMIT 5;

This is the recipe used by the GraphRAG Cookbook — from raw text to a queryable knowledge graph in one SQL script.

Performance notes¶

Operation	Approx throughput on M1 Pro, Qwen3.5-4B Q4_K_M, Metal
`muninn_chat`	~60 tokens/sec
`muninn_extract_entities` (single)	~1 text/sec for short passages
`muninn_extract_entities_batch` (size 4)	~2.5× speedup over singles
`muninn_extract_ner_re_batch` (size 4)	~2× speedup over two separate calls

These numbers scale roughly linearly with model size and inversely with quantization precision.