Domino IQ: What It Means to Run an LLM Inside the Domino Server

May 5, 2026 1,040 words

What Domino IQ is

Domino 14.5 introduces Domino IQ — an AI inference engine baked into the Domino server backend. Quoting the docs: “The Domino server, when configured for Domino IQ, starts an inference engine from the Dbserver process,” and it “handles AI queries locally within the Domino server.”

For developers it boils down to: two new LotusScript classes (NotesLLMRequest / NotesLLMResponse) send a prompt to a local LLM¹ and get a response back — no external OpenAI or Anthropic API to wire up, no need to ETL Notes data out.

Why run the LLM on the Domino server

A bolted-on LLM SaaS looks fastest at first: hit an API, get a response, done. But the typical Domino-shop scenario hits walls:

Data residency: Notes/Domino apps often hold data that’s contractually or legally required to stay on the Domino server (finance, legal, healthcare, government). Sending a prompt to OpenAI is the data leaving the governed perimeter.
ACL and Readers fields: Domino’s permission model is server-side enforced. With an external LLM, “should this user see this document as context” has to be re-implemented in the application layer.
Billing / volume: Per-token SaaS pricing isn’t favorable for sustained high-volume internal usage.

Domino IQ collapses the inference engine and the Domino server into a single process boundary, which handles all three at once. The cost is you have to be able to run a GGUF² model on your own server.

Architecture and hardware

The execution environment is opinionated: Windows or Linux 64-bit only, with an NVIDIA GPU. No macOS, no ARM, no pure-CPU mode.

Minimum / recommended specs:

Component	Minimum	Production
CPU	4 cores	—
RAM	8 GB	—
GPU	NVIDIA, 8 GB memory, compute capability 5.2+	compute capability 8.0+
CUDA	12.x (bundled)	—

On Windows the GPU can use both GPU memory and half of system RAM as Shared RAM — friendly for smaller pilots.

Once running, the DominoIQ task spawns the inference engine subprocess from Dbserver, listening on your configured port alongside the other Domino server tasks.

Install flow

The install ships as a separate zip (llama.zip) downloaded from the My HCLSoftware Portal:

Stage llama.zip somewhere on the Domino server.
Stop the Domino server (this step is required).
Windows (admin): delete the old llama-server.exe, llama.dll, ggml/*.dll, and the three CUDA libraries; Linux (root): delete llama-server, libllama.so, libggml*.so, and matching CUDA libraries.
Unzip llama.zip into the Domino program directory (not the data directory).
Linux only: chown root:root and chmod ugo+rx on the extracted binaries.
Start Domino and verify the DominoIQ task loads its model; if not, move on to configuration.

Fresh installs on 14.5 don’t have the binary-deletion step. The cleanup is only relevant on upgrades from one 14.5.x to another.

Two-phase configuration in dominoiq.nsf

Once installed, every setting lives in dominoiq.nsf on the Domino IQ Administration server, split into two layers:

1. Configuration document — the engine itself

Defines how the inference engine runs:

which GGUF model to load (each Domino IQ server can only run a single LLM)
which port to listen on
GPU offloading parameters
concurrent-request count
advanced: Special parameters (llama.cpp flags like -c <ctx-size>)
optional: guard model³ wiring (filters inappropriate input/output)

2. Command document — the application contract

After the Configuration brings the engine up, the Command document defines a specific AI command. Application code references the command by name.

Key fields:

Field	Purpose
`Configuration(s)`	Which server(s) the command applies to — server name or wildcard `*`
`Model`	Optional, only for remote AI endpoints
`Command`	Name of the matching System Prompt
`System prompt`	Same name you used in `Command`
`Description`	Free-form notes
`Maximum tokens`	Per OpenAI Chat Completion spec
`Temperature`	Same — randomness control

Cache behavior: changes to a Command document take up to 5 minutes to be picked up. When iterating on a prompt, either wait or restart the DominoIQ task.

System Prompt documents: behavior pulled out as its own object

A System Prompt document holds the actual instruction telling the LLM how to respond, independent of any Command. The Command “mounts” it.

This split lets you:

Share one system prompt across multiple Commands.
Edit the prompt without touching Command wiring.

One example ships out of the box: StdSummarizeEmailThread — for summarising mail threads.

Calling it from app code: NotesLLMRequest / NotesLLMResponse

Once configured, LotusScript (or Java) reaches the engine through two classes:

NotesLLMRequest — builds the request (Command name, prompt body, token limits, etc.)
NotesLLMResponse — receives the response

The detailed method signatures live in Designer help under NotesLLMRequest / NotesLLMResponse — the admin docs don’t enumerate them. Test against a small dataset first to confirm the wiring.

From the caller’s perspective it’s a normal synchronous API call. Under the hood it’s process-local IPC to the inference engine, so no network round-trip latency.

Where RAG fits in

14.5 introduces Domino IQ; 14.5.1 layers RAG support on top — pointing the engine at an NSF as a vector source so the LLM gets semantically-relevant documents prepended to the prompt as context.

For “give the LLM a prompt, get a free-form response” workflows (summarize, classify, extract, translate), one Command document is enough. For “LLM should answer using documents from NSF X” workflows, you also need RAG mode on top.

The RAG-specific configuration, automatic ACL inheritance into vector search, and the updall -w vectorization step are covered in detail in the companion piece: Domino IQ RAG: A Built-In Pipeline That Wires Your NSFs Straight Into a Local LLM.

Why this trade-off works for Domino shops

Building a comparable external-LLM pipeline is doable, but for an existing Domino shop it usually isn’t worth it:

The dev / admin team already knows Domino — adding a vendor adds an ops surface.
Re-implementing Notes ACL / Readers-field filtering in the application layer is famously easy to get wrong.
The existing server room + existing GPU is often enough to run a GGUF model — no cloud bill.
“Data doesn’t leave the server” is a hard requirement in some regulated industries.

The only real constraint is the NVIDIA GPU + 64-bit OS — usually not a blocker for shops that are already on the 14.5 upgrade path.

LLM (Large Language Model) — neural networks like GPT, Claude, or Llama, trained on massive text corpora to generate natural-language responses from a prompt. ↩
GGUF (GPT-Generated Unified Format) is a single-file binary model format from the llama.cpp ecosystem, packaging weights, tokenizer, and metadata into one file that can be mmap-loaded directly into memory — a common distribution format for on-device inference. ↩
A guard model is a separate small LLM that runs before or after the main inference model, filtering inappropriate input (jailbreak prompts, sensitive questions) or output (harmful content, leaks of sensitive source text). ↩

Sources

← Back to all posts