Domino IQ: What It Means to Run an LLM Inside the Domino Server
What Domino IQ is
Domino 14.5 introduces Domino IQ — an AI inference engine baked into the Domino server backend. Quoting the docs: “The Domino server, when configured for Domino IQ, starts an inference engine from the Dbserver process,” and it “handles AI queries locally within the Domino server.”
For developers it boils down to: two new LotusScript classes (NotesLLMRequest / NotesLLMResponse) send a prompt to a local LLM1 and get a response back — no external OpenAI or Anthropic API to wire up, no need to ETL Notes data out.
Why run the LLM on the Domino server
A bolted-on LLM SaaS looks fastest at first: hit an API, get a response, done. But the typical Domino-shop scenario hits walls:
- Data residency: Notes/Domino apps often hold data that’s contractually or legally required to stay on the Domino server (finance, legal, healthcare, government). Sending a prompt to OpenAI is the data leaving the governed perimeter.
- ACL and Readers fields: Domino’s permission model is server-side enforced. With an external LLM, “should this user see this document as context” has to be re-implemented in the application layer.
- Billing / volume: Per-token SaaS pricing isn’t favorable for sustained high-volume internal usage.
Domino IQ collapses the inference engine and the Domino server into a single process boundary, which handles all three at once. The cost is you have to be able to run a GGUF2 model on your own server.
Architecture and hardware
The execution environment is opinionated: Windows or Linux 64-bit only, with an NVIDIA GPU. No macOS, no ARM, no pure-CPU mode.
Minimum / recommended specs:
| Component | Minimum | Production |
|---|---|---|
| CPU | 4 cores | — |
| RAM | 8 GB | — |
| GPU | NVIDIA, 8 GB memory, compute capability 5.2+ | compute capability 8.0+ |
| CUDA | 12.x (bundled) | — |
On Windows the GPU can use both GPU memory and half of system RAM as Shared RAM — friendly for smaller pilots.
Once running, the DominoIQ task spawns the inference engine subprocess from Dbserver, listening on your configured port alongside the other Domino server tasks.
Install flow
The install ships as a separate zip (llama.zip) downloaded from the My HCLSoftware Portal:
- Stage
llama.zipsomewhere on the Domino server. - Stop the Domino server (this step is required).
- Windows (admin): delete the old
llama-server.exe,llama.dll,ggml/*.dll, and the three CUDA libraries; Linux (root): deletellama-server,libllama.so,libggml*.so, and matching CUDA libraries. - Unzip
llama.zipinto the Domino program directory (not the data directory). - Linux only:
chown root:rootandchmod ugo+rxon the extracted binaries. - Start Domino and verify the
DominoIQtask loads its model; if not, move on to configuration.
Fresh installs on 14.5 don’t have the binary-deletion step. The cleanup is only relevant on upgrades from one 14.5.x to another.
Two-phase configuration in dominoiq.nsf
Once installed, every setting lives in dominoiq.nsf on the Domino IQ Administration server, split into two layers:
1. Configuration document — the engine itself
Defines how the inference engine runs:
- which GGUF model to load (each Domino IQ server can only run a single LLM)
- which port to listen on
- GPU offloading parameters
- concurrent-request count
- advanced: Special parameters (llama.cpp flags like
-c <ctx-size>) - optional: guard model3 wiring (filters inappropriate input/output)
2. Command document — the application contract
After the Configuration brings the engine up, the Command document defines a specific AI command. Application code references the command by name.
Key fields:
| Field | Purpose |
|---|---|
Configuration(s) | Which server(s) the command applies to — server name or wildcard * |
Model | Optional, only for remote AI endpoints |
Command | Name of the matching System Prompt |
System prompt | Same name you used in Command |
Description | Free-form notes |
Maximum tokens | Per OpenAI Chat Completion spec |
Temperature | Same — randomness control |
Cache behavior: changes to a Command document take up to 5 minutes to be picked up. When iterating on a prompt, either wait or restart the DominoIQ task.
System Prompt documents: behavior pulled out as its own object
A System Prompt document holds the actual instruction telling the LLM how to respond, independent of any Command. The Command “mounts” it.
This split lets you:
- Share one system prompt across multiple Commands.
- Edit the prompt without touching Command wiring.
One example ships out of the box: StdSummarizeEmailThread — for summarising mail threads.
Calling it from app code: NotesLLMRequest / NotesLLMResponse
Once configured, LotusScript (or Java) reaches the engine through two classes:
NotesLLMRequest— builds the request (Command name, prompt body, token limits, etc.)NotesLLMResponse— receives the response
The detailed method signatures live in Designer help under NotesLLMRequest / NotesLLMResponse — the admin docs don’t enumerate them. Test against a small dataset first to confirm the wiring.
From the caller’s perspective it’s a normal synchronous API call. Under the hood it’s process-local IPC to the inference engine, so no network round-trip latency.
Where RAG fits in
14.5 introduces Domino IQ; 14.5.1 layers RAG support on top — pointing the engine at an NSF as a vector source so the LLM gets semantically-relevant documents prepended to the prompt as context.
For “give the LLM a prompt, get a free-form response” workflows (summarize, classify, extract, translate), one Command document is enough. For “LLM should answer using documents from NSF X” workflows, you also need RAG mode on top.
The RAG-specific configuration, automatic ACL inheritance into vector search, and the updall -w vectorization step are covered in detail in the companion piece: Domino IQ RAG: A Built-In Pipeline That Wires Your NSFs Straight Into a Local LLM.
Why this trade-off works for Domino shops
Building a comparable external-LLM pipeline is doable, but for an existing Domino shop it usually isn’t worth it:
- The dev / admin team already knows Domino — adding a vendor adds an ops surface.
- Re-implementing Notes ACL / Readers-field filtering in the application layer is famously easy to get wrong.
- The existing server room + existing GPU is often enough to run a GGUF model — no cloud bill.
- “Data doesn’t leave the server” is a hard requirement in some regulated industries.
The only real constraint is the NVIDIA GPU + 64-bit OS — usually not a blocker for shops that are already on the 14.5 upgrade path.
Footnotes
-
LLM (Large Language Model) — neural networks like GPT, Claude, or Llama, trained on massive text corpora to generate natural-language responses from a prompt. ↩
-
GGUF (GPT-Generated Unified Format) is a single-file binary model format from the llama.cpp ecosystem, packaging weights, tokenizer, and metadata into one file that can be mmap-loaded directly into memory — a common distribution format for on-device inference. ↩
-
A guard model is a separate small LLM that runs before or after the main inference model, filtering inappropriate input (jailbreak prompts, sensitive questions) or output (harmful content, leaks of sensitive source text). ↩