Text Embeddings¶
How to go from raw text to a populated hnsw_index. muninn ships its own GGUF embedder (muninn_embed) that runs inside SQLite with optional Metal GPU acceleration — this is the preferred path. Python-side and remote-API embedders are also supported for workflows that already have embedding code.
The three paths¶
| Path | Embedder lives | Best for |
|---|---|---|
Native muninn_embed |
In muninn, via llama.cpp | New projects, SQL-native pipelines, macOS with Metal |
| Python batch (sentence-transformers) | Your application | Existing ML code, very large bulk ingestion |
| Remote API (sqlite-rembed) | OpenAI / Nomic / Cohere / Ollama | API-driven pipelines, models muninn doesn't run locally |
All three produce the same float32 blob — once vectors are in the HNSW index, the downstream graph, centrality, and retrieval code is identical regardless of how they got there.
Path 1 — Native muninn_embed (preferred)¶
Loading a GGUF model¶
muninn stores models in a session-scoped virtual table called temp.muninn_models. Register a model once per connection:
.load ./muninn
INSERT INTO temp.muninn_models(name, model)
SELECT 'MiniLM', muninn_embed_model('models/all-MiniLM-L6-v2.Q8_0.gguf');
SELECT name, dim FROM temp.muninn_models;
On macOS, all model layers are offloaded to the Metal GPU by default. Override with MUNINN_GPU_LAYERS=0 for CPU-only. See Getting Started.
Downloading a model¶
Pick one based on your language coverage, quality ceiling, and file-size budget.
mkdir -p models
# English, tiny & fast
curl -L -o models/all-MiniLM-L6-v2.Q8_0.gguf \
https://huggingface.co/leliuga/all-MiniLM-L6-v2-GGUF/resolve/main/all-MiniLM-L6-v2.Q8_0.gguf
| Model | Dims | Quant | File | Strengths |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Q8_0 | 36 MB | Smallest, fast, English only |
| nomic-embed-text-v1.5 | 768 | Q4_K_M | 84 MB | Long context (8192 tok), Matryoshka |
| BGE-small-en-v1.5 | 384 | Q8_0 | 37 MB | Strong English retrieval |
| BGE-M3 | 1024 | Q4_K_M | 438 MB | 100+ languages |
| Qwen3-Embedding-8B | 4096 | Q4_K_M | 4.7 GB | State-of-the-art retrieval quality |
Find more GGUF embedding models on HuggingFace or the curated collection.
Model pooling (MEAN for BERT-family, LAST for Qwen3, etc.) is read from the GGUF metadata — muninn never hardcodes it.
Embed + index in a single statement¶
CREATE TABLE documents (id INTEGER PRIMARY KEY, content TEXT);
INSERT INTO documents(content) VALUES
('The quick brown fox jumps over the lazy dog'),
('A fast runner sprints across the field'),
('SQLite is a lightweight embedded database'),
('Vector search finds similar items by distance'),
('Neural networks learn patterns from data');
CREATE VIRTUAL TABLE docs_vec USING hnsw_index(
dimensions=384, metric='cosine'
);
INSERT INTO docs_vec(rowid, vector)
SELECT id, muninn_embed('MiniLM', content) FROM documents;
Semantic search — embed the query inline¶
SELECT d.content, round(v.distance, 4) AS dist
FROM docs_vec v JOIN documents d ON d.id = v.rowid
WHERE v.vector MATCH muninn_embed('MiniLM', 'fast animal')
AND k = 3
ORDER BY v.distance;
content dist
--------------------------------------------------- ------
A fast runner sprints across the field 0.3881
The quick brown fox jumps over the lazy dog 0.5217
Neural networks learn patterns from data 0.7402
Auto-embed new rows with a trigger¶
CREATE TEMP TRIGGER docs_auto_embed AFTER INSERT ON documents
BEGIN
INSERT INTO docs_vec(rowid, vector)
VALUES (NEW.id, muninn_embed('MiniLM', NEW.content));
END;
INSERT INTO documents(content) VALUES ('Graph databases store relationships');
-- The trigger embedded and indexed the new row automatically.
Use TEMP triggers for model-backed embedding
Persistent triggers (CREATE TRIGGER without TEMP) are written into the database schema. Opening the database later without muninn loaded (or without that model registered) causes schema compilation to fail. TEMP triggers live only for the session and avoid this trap.
Re-embed on update¶
CREATE TEMP TRIGGER docs_auto_reembed AFTER UPDATE OF content ON documents
BEGIN
DELETE FROM docs_vec WHERE rowid = NEW.id;
INSERT INTO docs_vec(rowid, vector)
VALUES (NEW.id, muninn_embed('MiniLM', NEW.content));
END;
Unloading a model¶
Performance notes¶
- Single-text embedding throughput on M1 Pro (Metal, MiniLM, 384 dim): ~5,000 embeds/sec
muninn_embeddoes not batch internally — one call per row. For 100k+ document bulk ingestion, Path 2 (Python batch) can be 3–5× faster.- CPU fallback (
MUNINN_GPU_LAYERS=0) is ~3× slower on Apple Silicon, but works identically. - Multiple SQLite connections can call
muninn_embedconcurrently; the registry is thread-safe, each connection gets its own compute context.
Path 2 — Python batch via sentence-transformers¶
Best for bulk ingestion where Python can batch thousands of texts into a single GPU forward pass.
import sqlite3, struct
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
dim = model.get_sentence_embedding_dimension() # 384
db = sqlite3.connect("mydata.db")
db.enable_load_extension(True)
db.load_extension("./muninn")
db.enable_load_extension(False)
db.execute(f"""
CREATE VIRTUAL TABLE IF NOT EXISTS docs_vec
USING hnsw_index(dimensions={dim}, metric='cosine')
""")
documents = [
(1, "The quick brown fox jumps over the lazy dog"),
(2, "A fast runner sprints across the field"),
(3, "SQLite is a lightweight embedded database"),
]
texts = [t for _, t in documents]
vectors = model.encode(texts, normalize_embeddings=True)
db.executemany(
"INSERT INTO docs_vec(rowid, vector) VALUES (?, ?)",
[(i, struct.pack(f"{dim}f", *vec.tolist()))
for (i, _), vec in zip(documents, vectors)],
)
db.commit()
# Query-time: embed the query in Python, pass the blob
query_vec = model.encode("fast animal", normalize_embeddings=True)
query_blob = struct.pack(f"{dim}f", *query_vec.tolist())
for rowid, distance in db.execute(
"SELECT rowid, distance FROM docs_vec "
"WHERE vector MATCH ? AND k = 3", (query_blob,)
):
print(rowid, f"{distance:.4f}")
Batch encoding is the win
model.encode(texts) processes many strings in a single forward pass — the speedup vs. per-row embedding scales roughly with batch size. For large ingestion, batch on the Python side even if you use muninn_embed at query time.
Path 3 — Remote API via sqlite-rembed¶
For OpenAI / Nomic / Cohere / Jina / Ollama. sqlite-rembed adds a rembed() scalar that makes one HTTP call per row. Useful for API-only models muninn can't run locally (text-embedding-3-large, Cohere Embed v3, etc.).
.load rembed0
.load ./muninn
INSERT INTO temp.rembed_clients(name, options) VALUES ('openai', 'openai');
-- Reads OPENAI_API_KEY from the environment
CREATE VIRTUAL TABLE docs_vec USING hnsw_index(dimensions=1536, metric='cosine');
INSERT INTO docs_vec(rowid, vector)
SELECT id, rembed('openai', content) FROM documents;
One HTTP call per row
rembed() does not batch. Each row is one round-trip to the provider. For thousands of rows, run Path 2 in Python (which can batch provider APIs) and then insert the blobs.
Vector format reference¶
Every embedding path — muninn, Python, remote — must produce the same blob format:
| Property | Value |
|---|---|
| Encoding | Raw little-endian IEEE 754 float32 array |
| Size | 4 × dimensions bytes |
| Header | none |
| Normalization | Recommended for metric='cosine'; required if you want distances in [0, 2] |
Choosing a model¶
| Priority | Model | Dim | Why |
|---|---|---|---|
| Smallest / fastest | all-MiniLM-L6-v2 | 384 | 22M params, sub-ms on Metal |
| Best general English | nomic-embed-text-v1.5 | 768 | Long context, Matryoshka-truncatable |
| Multilingual | BGE-M3 | 1024 | 100+ languages |
| Retrieval quality ceiling | Qwen3-Embedding-8B | 4096 | Top MTEB, 4.7 GB file |
Matryoshka embeddings
nomic-embed-text-v1.5 and some BGE models support Matryoshka Representation Learning — you can truncate the output to a shorter dimension (e.g. 128 from 768) for faster search with minimal quality loss. Truncate, then re-normalize.
Combining with graph retrieval¶
Once embeddings are in the index, the full retrieval pipeline is path-agnostic:
-- Phase 1: vector seed
SELECT rowid FROM docs_vec
WHERE vector MATCH muninn_embed('MiniLM', 'find close matches') AND k = 5;
-- Phase 2: graph expansion from seeds
SELECT node FROM graph_bfs
WHERE edge_table = 'relationships' AND src_col = 'src' AND dst_col = 'dst'
AND start_node = ?seed AND max_depth = 2;
-- Phase 3: centrality ranking
SELECT node, centrality FROM graph_node_betweenness
WHERE edge_table = 'relationships' AND src_col = 'src' AND dst_col = 'dst'
AND direction = 'both';
See GraphRAG Cookbook for the full pipeline.
See also¶
- API Reference —
muninn_embed - API Reference —
hnsw_index - Chat and Extraction — same GGUF infrastructure, chat-side
- GraphRAG Cookbook — end-to-end pipeline