Domino IQ RAG: A Built-In Pipeline That Wires Your NSFs Straight Into a Local LLM

May 5, 2026 1,176 words

📚 This is the deep-dive half of a two-part Domino IQ series. If you haven’t seen the overview — the architecture, install flow, and how NotesLLMRequest / NotesLLMResponse work from application code — read Domino IQ: What It Means to Run an LLM Inside the Domino Server first, then come back here for how RAG layers on top.

What RAG looks like inside Domino IQ

Domino 14.5.1 ships RAG (Retrieval-Augmented Generation) support — before the LLM¹ answers, it pulls semantically-relevant documents from an NSF you point it at, so the response is “grounded, current, and domain-specific.” A typical RAG stack out in the wild involves OpenAI plus Pinecone plus LangChain glue. Domino IQ packs the whole stack into server tasks.

When you mark a Command document as RAG-enabled, the DominoIQ task in the Domino server process spins up locally:

a llama-server instance with the LLM inference model,
a llama-server instance with the embedding model²,
a vector database server,
optionally, a llama-server instance with a guard model³.

When a user sends a prompt, it gets semantically searched against the vector DB, the matching NSF documents are pulled and prepended to the prompt, and the whole package goes to the LLM — never leaving your Domino server.

Three things that are different from generic RAG

1. Local mode is mandatory. RAG support is “available only when the Domino IQ server is configured in Local mode” — the LLM handling RAG content runs on your server, not a cloud API. That single rule eliminates the data-residency arguments common in financial / legal / healthcare deployments.

2. ACL and Readers fields are honored natively. When a database is enabled as a RAG source, “LLM command request handling honors both the ACL and any applicable Readers field in the documents of the RAG-enabled database by using the Notes DN in the authenticated session.” User A’s prompt only gets matches from documents A is allowed to read; documents protected by a Readers field for B never leak into A’s context. This is brutal to do correctly in a generic RAG pipeline.

3. The DominoIQ task owns the lifecycle. Embedding model, vector DB, LLM inference, vectorization — all driven by two server tasks (DominoIQ and updall). No Pinecone account to manage, no separate ETL pipeline to maintain.

Prerequisites

✓ dominoiq.nsf upgraded with the 14.5.1-shipped dominoiq.ntf
✓ Transaction logging enabled on the Domino IQ servers hosting RAG-enabled DBs
✓ Each RAG source DB replicated to the Domino IQ server
✓ Each RAG source DB full-text indexed with "Index attachments" enabled
✓ A GGUF-format embedding model picked (mxbai-embed-large, bge-me,
  nomic-embed-text, snowflake-arctic-embed, etc.)

Pick the embedding model based on your content’s language and the vector dimensionality you want — keep in mind it must be in GGUF⁴ format (llama.cpp’s standard distribution format).

Configuration walkthrough

Step A: Embedding & Vector DB in dominoiq.nsf

Open dominoiq.nsf → Add Configuration / Edit Configuration:

Embedding Model tab: pick your GGUF embedding model, set status to Enabled, change the port if you must, raise the context size above the default 1024 if your model supports it.
Vector Database tab: change port or host if you’re running multiple partitions.

Step B: RAG fields on the Command document

On the Command document, the RAG-specific fields are:

Field	Purpose	Default
`RAG enabled database`	DB path (relative to the data directory)	—
`RAG enabled fields`	Which fields get embedded into the vector DB	—
`RAG threshold`	Minimum semantic-search score for a match to qualify	0.7
`RAG maximum responses`	Cap on matches included in the LLM context	25
`Maximum tokens`	LLM prompt token limit (RAG inflates the prompt — raise this)	—

Step C: Enable and vectorize

Restart the Domino IQ server — that’s how the embedding and vector DB sub-processes come up.
Administrator client → Files panel → right-click the source DB → Domino IQ → Enable.
From the server console: load updall -w <rag-dbname> — this is the vectorization step that ingests the NSF’s documents into the vector DB.

Vectorization can take several minutes depending on document count and attachment size; the embedding model has to compute a vector for every relevant field and attachment.

Calling it: LLMReq

Once configured, application code invokes the AI command via the LLMReq method in LotusScript or Java. From the caller’s perspective it’s an ordinary method call: pass a prompt, get back a string — under the covers, the request flows through vector search → top-N document retrieval → LLM inference.

Heads-up: as of 14.5.1 the admin docs don’t pin down the full LLMReq signature there (it lives in Designer help). Test it against a small dataset first to confirm the wiring matches your expectation.

Security: ACL + Readers, automatic

The security model hinges on this single sentence: “LLM command request handling honors both the ACL and any applicable Readers field in the documents of the RAG-enabled database by using the Notes DN in the authenticated session.” In a generic RAG pipeline this filtering has to happen somewhere in your application layer and is famously easy to get wrong; Domino IQ enforces it server-side, so leaking another user’s restricted document into the context is structurally prevented.

Bonus: HCL recommends choosing a model with paraphrasing capability so retrieved RAG content isn’t reproduced verbatim in the response — extra defense against accidental disclosure of source text.

Capacity and performance tuning

Because RAG inflates the prompt, the default LLM context size won’t cut it. Under the Advanced tab in dominoiq.nsf, the Special parameters field takes options like -c 65000; combined with the concurrent-request setting, the math works like “10 concurrent × -c 65000 ≈ 6500 tokens of context per request.” GPU offloading parameters live on the same tab.

Also worth pinning down explicitly: in the Command’s system prompt document, tell the model exactly how to respond when no RAG matches are found. Skip this and the LLM will confidently invent something — exactly what RAG was supposed to prevent.

The clean disable workflow

Clear the RAG source DB field on the Command document, save.
Restart the Domino IQ server.
From the server console: load updall -w -d <RAG-dbname> — wait for completion, this purges the NSF’s entries from the vector DB.
Administrator client → Files panel → right-click the DB → Domino IQ → Disable.

Order matters — clear the Command document first, purge vector entries after. Don’t reverse.

Why this beats bolting on a generic SaaS LLM

Building a comparable RAG pipeline with OpenAI / Anthropic + a hosted vector DB is technically doable but you immediately run into:

ETL’ing data out of Domino — often a contractual or compliance non-starter,
re-implementing ACL / Readers-field filtering in your application layer,
a separate vector DB account, traffic shaping, and rate-limit budgeting,
LLM API billing and per-token usage monitoring.

Domino IQ RAG collapses those four concerns into one configured Command document. The cost is you do need a server that can run a GGUF model — specifically, an NVIDIA GPU (compute capability 5.2+ minimum, 8.0+ recommended for production) on 64-bit Windows or Linux (no CPU-only mode, no macOS, no ARM). For most existing Domino shops that’s a fair trade.

LLM (Large Language Model) — neural networks like GPT, Claude, or Llama, trained on massive text corpora to generate natural-language responses from a prompt. ↩
An embedding model maps text into fixed-dimensional vectors (e.g. 768 or 1024 dimensions) so that semantically-related sentences land near each other in vector space — RAG uses it to index NSF documents for semantic search. ↩
A guard model is a separate small LLM that runs before or after the main inference model, filtering inappropriate input (jailbreak prompts, sensitive questions) or output (harmful content, leaks of sensitive source text). ↩
GGUF (GPT-Generated Unified Format) is a single-file binary model format from the llama.cpp ecosystem, packaging weights, tokenizer, and metadata into one file that can be mmap-loaded directly into memory — a common distribution format for on-device inference. ↩

Sources

← Back to all posts