Local AI with Ollama
A complete tour of the local-AI integration: what it is, why it exists, how the setup wizard works, how a chat request travels from your keyboard to the model and back, how the persona is built, how effect commands are emitted and gated, and how to troubleshoot when something goes wrong.
On This Page
- What is "Local AI"?
- Why local instead of cloud?
- Architecture at a glance
- The setup wizard
- The request lifecycle
- How the persona is built
- Enrichment & structured output
- Effect commands
- Safety & permissions
- Persistent chat memory
- Warm-up & lifecycle
- Error handling
- Picking a model
- Privacy
- Troubleshooting
- Advanced: remote Ollama
- Glossary
TL;DR
Local AI runs a language model on your computer via Ollama. Nothing leaves your machine, no request limits, no login. Trade-off: ~6.6 GB of disk and slower first response. The setup wizard handles install, download, and verification. Once on, every Companion feature works exactly as it does with cloud AI - because both providers implement the same internal interface.
What is "Local AI"?
The Companion (the floating avatar that talks to you, reacts to your screen, and can trigger effects) has two possible brains:
- Cloud AI - the default. Your chat lines go to a small proxy server we run, which forwards them to a hosted language model. Comes with daily limits (100/day for free, 1000/day for supporters), no install required.
- Local AI - you run a language model on your own computer using Ollama. Nothing leaves your machine. No request limits. No subscription required. You pay in disk space, RAM, and (optionally) GPU cycles instead.
Both brains implement the same internal interface (IAiService) so every feature that talks to the Companion - chat, screen awareness, video reactions, lock screen, keyword catches - works identically with either backend. You can switch between them in Companion → AI at any time without restarting the app.
Why local instead of cloud?
You'd choose local AI if any of these matter to you:
- Privacy. Chat lines, screen context, persona - none of it travels over the internet. The model lives in
%LOCALAPPDATA%\Programs\Ollamaand runs onhttp://localhost:11434. There is no AI-related network egress from the app once it's set up. - No daily limits. The cloud proxy caps requests to keep costs sane. Local AI is bounded only by how fast your hardware can run the model.
- No login. Cloud AI needs an account; local AI does not.
- Customization. Want a 70B model? A roleplay-tuned fine-tune? A totally different persona? Just
ollama pulla different tag and point the app at it. - Offline use. Once installed, local AI works with the network unplugged.
The tradeoffs are real:
- Disk. The default
qwen3.5:latestmodel is about 6.6 GB. Bigger models can be 20-40 GB. - First response is slow. Cold-start (model loading from disk into RAM or VRAM) is ~30-60 seconds for an 8B-class model on CPU. The app warms the model on startup to hide this, but the very first request after a fresh install still takes a moment.
- Response time depends on hardware. On a GPU, replies feel snappy (1-3 seconds). On a CPU-only laptop, a chat reply can take 10-20 seconds. Reasoning models (qwen3, deepseek-r1) are slower still, which is why the app sends
think:falseto disable their internal reasoning phase by default.
Architecture at a glance
Everything in the Companion's brain lives under Services/AIService/. The shape:
AiServiceStrategy is what the rest of the app talks to. It checks CompanionPrompt.UseLocalAi on every call and lazily constructs whichever provider is active. Switching providers at runtime is free - no restart, no re-init.
The cloud provider is stateless: every request includes the full system prompt and the user line; the proxy holds no state. The local provider holds a persistent chat history in memory and on disk so the Companion can remember you across sessions.
The setup wizard, step by step
Opening Companion → Use Local AI for the first time launches the Local AI Setup Wizard. It's a single window that walks through every step needed to go from zero to a working local model.
1 Detect
Before showing anything, the wizard probes the machine: looks for ollama.exe under %LOCALAPPDATA%\Programs\Ollama\ and calls GET /api/tags on localhost:11434 with a 4-second timeout. Based on what it finds, it picks one of four entry points:
| State | What it means | What happens next |
|---|---|---|
| Ready | Ollama is running, target model is pulled | Skip to smoke test |
| RunningNoModel | Ollama is running but the target model isn't there | Skip to pull |
| InstalledNotRunning | Ollama is installed but the HTTP server isn't up | Start ollama serve headlessly, then continue |
| NotInstalled | No Ollama at all | Show the consent screen |
2 Consent
If Ollama isn't installed, the wizard asks for explicit consent before touching disk. It tells you Ollama is about to be downloaded from the official installer URL, that the default model is about 6.6 GB, and that you can change the model under "Advanced." There's also a "manual install" link that opens ollama.com/download for users who'd rather install it themselves.
3 Download installer
The wizard streams OllamaSetup.exe from ollama.com/download/OllamaSetup.exe to your %TEMP% folder. Progress is reported every 200 ms with a live rate (e.g. 240.5 MB / 700.0 MB (34%) · 18.3 MB/s). If you cancel mid-download, the partial file is cleaned up.
4 Silent install
The downloaded installer is launched with the NSIS silent flag (/S). No installer UI, no progress bar from Ollama itself - just a hidden process that puts Ollama under %LOCALAPPDATA%\Programs\Ollama\. After the process exits with code 0, the wizard polls /api/tags for up to 60 seconds waiting for the service to come up.
Two safety choices: the cancel button is disabled during install (Ollama's NSIS installer doesn't roll back cleanly if interrupted), and if the post-install auto-start doesn't bind port 11434 within 60 seconds, the wizard spawns ollama.exe serve itself with a hidden window. It deliberately avoids ollama app.exe because that's the GUI chat client in newer Ollama versions and would pop a window.
The wizard tracks any ollama serve process it spawns and terminates it on app exit. Servers started by the Ollama tray app or the installer's own auto-start are left alone.
On success, the OllamaSetup.exe in %TEMP% is deleted. On failure it's intentionally kept so you (or a retry) can inspect it without re-downloading 700 MB.
5 Pull the model
The wizard streams POST /api/pull with stream:true. Ollama sends back NDJSON: one JSON object per line, one per layer of the model file. Each line includes a status, a digest, and completed/total byte counts so the wizard can show real progress per layer.
- HTTP client uses an infinite timeout - a 6.6 GB pull can exceed any reasonable per-request limit, and the NDJSON output is the heartbeat.
- Ollama caches partial layers, so if you cancel and re-run, the pull picks up where it left off.
- Errors come back as
{"error":"..."}in the stream (usually for unknown model names) and the wizard surfaces them verbatim.
6 Smoke test
One tiny request to confirm the wiring:
If a message.content comes back, the wizard records the elapsed time and declares success. This both warms the model into RAM (so your first real chat is fast) and proves end-to-end that everything's wired up.
7 Done
On success, the wizard writes two settings and saves:
From this point on, every Companion call routes through the local provider. The error screen has a Retry button that re-runs detection - the right next step after a failure depends entirely on what state Ollama is now in.
The request lifecycle
What happens when you type a message into the Companion chat and hit enter, assuming local AI is selected. (Awareness reactions, video-done hints, lock-screen comments, and keyword catches all follow the same path with different system prompts and inputs.)
Concurrency control
A semaphore guarantees one in-flight request at a time. Behavior depends on who's asking:
- User-triggered request while busy - drops the new call but returns a random "still thinking" phrase (e.g. "Bambi's thinking real hard right now...") so you don't get silence. Mods can supply their own thinking phrases.
- Automated request while busy (awareness, video-done, etc.) - drops silently. Better to skip a passive reaction than queue them up and have the Companion fire stale comments seconds later.
Prompt freshness
The system message at index 0 is rebuilt on every call. Changes to your persona, knowledge base, mods, or content mode take effect immediately - no need to restart or clear history.
History rollback on failure
If the request errors or returns empty content, the just-appended user turn is popped off so it doesn't poison future requests with an unanswered turn.
The think:false flag
Reasoning models (qwen3, deepseek-r1, and their relatives) have an internal "thinking" phase where they output long chains of reasoning before the actual answer. For roleplay chat this adds 30-50 seconds of latency for no benefit. think:false cuts that out. Non-reasoning models ignore the flag.
How the persona is built
The system prompt sent to the model is assembled by BambiSprite from several layers. This is shared between the cloud and local providers - both end up sending the same system prompt structure. From outer to inner:
- Persona block. The "Bad Influence Bestie" character description: tone (casual texting), topics (makeup, pink things, empty heads), role (tempt the user into being blank). If Slut Mode is on and the current personality preset defines a Slut Mode variant, that variant replaces the default - same character, spicier vibe.
- Explicit reaction rules. How the Companion reacts when the user mentions explicit topics: flustered redirect rather than full roleplay. Can be overridden per-personality.
- Knowledge base. Lists of audio playlists and videos the Companion is allowed to recommend, with strict instructions to use exact titles (otherwise the app can't auto-link them). For BambiCloud playlists, the AI is told to wrap titles in markdown link syntax with the exact URL.
- Global knowledge base links. Anything the user added to the Knowledge Base Links list - extra videos, custom content packs, the user's own files.
- HypnoTube link pool. If the user configured their own pool, those video names are appended. Names are resolved against the known-links map so the auto-linker can wrap them as clickable URLs.
- Screen awareness rules. How to react to different categories (work, social, shopping, streaming, hypno content, idle). The Companion sees context as
[Category: X | App: Y | Title: Z | Duration: N]and is expected to react appropriately. - Output rules. Length cap (typically ~15 words), emoji cap (1 per message), no bracket tags in the visible reply.
- Quiz context (if you've taken the in-app quiz). The Companion sees your archetype and a short profile snippet, with instructions to reference it naturally ~20% of the time.
- Mod-aware substitutions. If you're using a mod that renames the user ("Bambi" -> "Unit" for Drone mod, or your chosen term for Sissy Hypno mode), the entire prompt is run through a substitution pass.
Every layer can be customized independently. You can write a totally different persona while keeping the knowledge base intact, or vice versa.
The enrichment block and structured output
When "Allow AI to control effects" is on, the local provider inserts an extra context message right after the system prompt. This is the enrichment block. It's sent as a user-role message but clearly marked [CONTEXT BLOCK - NOT DIALOGUE] so the model treats it as operating instructions rather than something to reply to.
Forces structured JSON output
The block explicitly tells the model that any earlier persona instruction saying "no brackets" or "respond only with text" is overridden by this format. Many community personality presets include strict "no JSON, no tags, just text" rules, which would otherwise conflict with the effect-emission format. The override resolves the conflict in favor of the structured output, and the response field carries the plain-text reply the user actually sees.
Tells the model when to fire effects
The block lists supported commands and gives concrete examples of phrases that should trigger them:
| User says | Effect to emit |
|---|---|
| "flash me" / "make me see flashes" | flash_image |
| "spawn bubbles" / "start bubbles" | bubbles (on) |
| "stop bubbles" | bubbles (off) |
| "subliminal X" / "flash the word X" | subliminal |
| "spiral" / "show me a spiral" | spiral |
| "pink filter" / "make my screen pink" | pink |
| "lock card" / "lock me with the mantra X" | mantra_lockscreen |
| "vibrate" / "buzz me" / "haptic" | haptic |
| "play X" | audio or video |
Crucially, the block also says: when the user is just chatting, leave effects empty. Don't fire unprovoked. Combined with the per-effect permission gates, this is what keeps the AI from spam-firing flashes at you during normal conversation.
Provides live context
Two final blocks appear in the enrichment:
The timestamp gives the model a sense of "now." The data block is the contents of assets/knowledge.json - a flat list of static facts the Companion is allowed to know (terminology, names, lore). If the file is missing, the data block is just [].
Sets reply etiquette
- Keep
responseshort (the persona's word limit still applies). - Don't echo the user's request word-for-word.
- When you DO trigger an effect, briefly acknowledge it ("Flashing for you, hot stuff~").
- Don't trigger video unprompted - videos are disruptive.
Effects off? No enrichment block
When the master "Allow AI to control effects" toggle is off, the entire enrichment block is removed from the conversation. The model goes back to producing plain-text replies, no JSON wrapping. The parser falls back to treating any incidental JSON as garbage and stripping it out.
Effect commands: letting the AI control the app
The Companion can trigger 11 distinct effect types (plus a none no-op the parser ignores):
flash_image
Flash random images on-screen.
bubbles
Start/stop the bubble-popping minigame.
subliminal
Show subliminal text.
mantra_lockscreen
Make the user chant a mantra.
spiral
Spinning spiral overlay.
pink
Pink color overlay.
bounce
Bouncing text overlay.
haptic
Vibrate a connected toy.
audio
Play an audio file.
video
Play a video file.
getbacktome
Schedule a follow-up after a delay.
Tolerant parsing
The parsing pipeline is intentionally tolerant. Local models love to wrap JSON in markdown fences, mix prose and JSON, leave trailing commas, or close braces incorrectly:
- If the response is wrapped in a
```json ... ```fence, the fence is stripped. - If the response is pure JSON with a
responsefield, it's parsed directly. - Otherwise the parser scans the text for
{...}blocks, tries each, replaces any with aresponsefield by their content (so JSON becomes prose), and collects anyeffectsarrays. - A repair pass handles trailing commas, mismatched braces, and unquoted keys before parsing.
- A sanitizer strips any leftover
[Category: ...]or[Mode/Tag]tags the model copied from the input.
Dispatch path
Each parsed command goes through AiCommandService.ExecuteCommand:
- Validate against settings (master toggle + per-effect gate).
- Enforce a per-response cap (max 3 commands per AI reply).
- Append a human-readable line to the AI Brain → Live actions feed on the Companion tab.
- Build and run the concrete command via the command factory.
The 3-commands-per-reply cap is hard. If the model emits five flash effects in one response (which happens with some models), only the first three execute.
Safety: permissions, caps, and the master toggle
The defaults are conservative. Even after you turn on local AI, the Companion can't fire effects until you explicitly enable them.
AllowAiToControlEffectsOFFMaster toggle. When off, no effect fires regardless of per-effect settings, and the enrichment block isn't even sent.AllowAiBubblesONVisual, passive.AllowAiSubliminalONVisual, passive.AllowAiBounceONVisual, passive.AllowAiFlashOFFIntrusive.AllowAiVideoOFFDisruptive.AllowAiAudioOFFDisruptive.AllowAiOverlayOFFCovers spiral + pink.AllowAiLockCardOFFIntrusive.AllowAiHapticOFFHardware. Plus a MaxAiHapticIntensity ceiling (default 0.6) regardless of AI-emitted value.AllowAiGetBackToMeOFFRecursive (schedules another AI call).The dispatcher checks, in order: master toggle, per-effect toggle, batch cap (3 per reply), and finally per-command field clamps applied at execution time. This three-layer defense means even a misbehaving or jailbroken model can't do something destructive - at worst it spams logs with rejected commands.
Persistent chat memory
The local provider remembers your conversation across app launches. One of the key differences from the cloud provider, which is stateless by design.
- After every successful exchange, the user/assistant pair is appended to in-memory history.
- An async write fires to flush the dialogue to
%APPDATA%\ConditioningControlPanel\local_chat_history.json. Disk I/O is off the response path, so chat latency is unaffected. - On next launch, the history is read back. The system prompt and enrichment block are NOT persisted - they're rebuilt fresh on every call so prompt edits take effect immediately.
The persisted file is capped at 50 pairs (100 messages). When the cap is exceeded, the oldest pairs are dropped first. Keeps the file under ~200 KB in practice and bounds the context the model has to chew through.
You can turn this off in Companion → AI by unchecking "Remember chat between sessions" - that flips ChatMemoryEnabled to false and the provider stops both reading and writing the file. Clearing memory is also available; it deletes both the in-memory history and the on-disk file.
Warm-up, lifecycle, and shutdown
Warm-up on startup
At app startup, right after the AI strategy is constructed, a fire-and-forget warm-up sends POST /api/generate with the configured model and keep_alive=30m and an empty body. Ollama interprets this as "load the model into memory but don't generate anything." The keep_alive value asks the model to stay resident longer than the default 5 minutes - without this, the model would unload after 5 minutes of inactivity and the next chat would pay the cold-start cost again.
Warm-up is silent on failure. If Ollama isn't running yet, it just logs and moves on - the next real chat will surface a clear error.
Shutdown
If the wizard spawned ollama serve itself (because the post-install auto-start didn't fire), that process is tracked. On app exit, only that process is killed. Servers started by the Ollama tray app or the installer's own auto-start are left running - they belong to the user, not to us.
Host changes
If you change the Ollama host while the app is running (say, to point at a remote machine), the host check on every request notices the change, disposes the old HTTP client, and rebuilds one against the new base address. No restart needed.
Error handling and fallbacks
Local model failures look very different from cloud failures, so the local provider has dedicated error-to-text mapping:
| Symptom | What you see | What it means |
|---|---|---|
| Connection refused | Can't reach Ollama at ... - looks like it isn't running. Start Ollama, or install it from ollama.com | HTTP server isn't bound. Ollama crashed or never started. |
| DNS failure | Can't reach Ollama host ... - check the host setting in Companion → AI | Wrong host name, almost always a typo in a remote-host config. |
| Timeout | Ollama took too long to respond. The first request after launch can take ~30-60s as the model loads - try once more. | Model was cold and didn't finish loading inside the 5-minute client timeout, or you picked a huge model. |
| 404 / model not found | Ollama: model 'X' not found - check 'ollama list' or pull it | Settings point at a tag you don't have pulled. |
| Generic HTTP error | Ollama HTTP NNN: ... | Surfaces the structured error field from Ollama if present. |
If a request returns 200 but with empty content (rare, seen with some models on heavy load), the user gets the mode-appropriate fallback line and the user turn is rolled back from history.
Picking a model
The default is qwen3.5:latest. It's a good fit because:
- ~6.6 GB (fits in most consumer setups).
- Reasoning model - so it can follow the structured-output instructions reliably - but we send
think:falseto skip the slow reasoning phase during chat. - Strong on instruction-following and JSON output, which matters for the effect-command flow.
That said, the provider is model-agnostic. Anything you can pull through Ollama and chat with via /api/chat should work. To switch:
- Pull the new tag manually:
ollama pull mistral-nemo:latest. - Open Companion → AI and either re-run the setup wizard with an advanced model name, or edit the value in settings directly.
- The strategy notices the change on the next chat - no restart needed.
Rough guidance
| Tier | Examples | Notes |
|---|---|---|
| 3B-8B params | qwen3.5, llama3.1:8b, mistral-nemo, gemma2:9b | ~5-8 GB on disk. Best chat latency on consumer hardware. Start here. |
| 13B-22B params | mistral-small, llama3.1:13b | ~10-14 GB. Noticeably better prose, much slower without a GPU. |
| 30B+ | large mixture-of-experts models | Real GPU with 24+ GB VRAM strongly recommended. Brutal warm-up. |
| Reasoning | qwen3, deepseek-r1 | Work fine - we send think:false to keep latency reasonable. |
| Uncensored | dolphin, hermes, abliterated | Useful if you find the default too prudish about explicit roleplay. |
List what's installed with ollama list, or visit http://localhost:11434/api/tags in a browser.
Privacy
Once local AI is set up, the only AI-related network traffic from the app is:
- Ollama's own model downloads (only when you run
ollama pullor use the wizard's pull step) - go directly to Ollama's CDN. - The Ollama installer download (once, from
ollama.com) - only during the wizard's install step.
After that, every chat request goes to http://localhost:11434. The model itself runs entirely on your machine. The Companion's chat history is stored in %APPDATA%\ConditioningControlPanel\local_chat_history.json in plain JSON - readable by anything that can open a text file. If that matters, turn off "Remember chat between sessions" or use full-disk encryption.
The cloud provider, by contrast, sends each chat line + your system prompt + your screen-awareness context to the proxy, which forwards to a hosted model. We log request counts and basic auth state but do not log chat content. See the privacy policy for the full breakdown.
Troubleshooting
"Can't reach Ollama at http://localhost:11434/"
Ollama isn't running. Start the Ollama tray app (Start menu → Ollama), or run ollama serve from a terminal. Open http://localhost:11434 in a browser - you should see "Ollama is running." If you don't, Ollama isn't actually up. If you see "address already in use," check Windows firewall.
"Ollama took too long to respond"
The model loaded for the first time and exceeded the 5-minute timeout. Wait and try again, or pick a smaller model. ollama ps from a terminal shows what's loaded.
"model 'X' not found"
The tag in your settings isn't pulled. ollama pull X from a terminal, then try again. Or re-run the setup wizard.
The Companion replies with JSON or curly braces
You're seeing raw model output the parser couldn't clean. Usually means the model isn't producing the expected {response, effects} shape - try a different model. Small models (~1B) sometimes can't follow structured-output instructions reliably. Or you have a custom personality preset that aggressively forbids structured output; the enrichment block is supposed to override this but some models miss it. Either edit the preset or turn off "Allow AI to control effects" entirely.
Effects fire even though I told it not to
Check three places: the master toggle, the per-effect toggles, and the "Live actions" feed on the Companion tab (which shows what actually fired in the last 30 actions). If you see commands in the feed for effects you've disabled, file an issue with a copy of your logs/crash.log.
The AI is repetitive / boring
Chat history is the usual culprit. Try "Clear chat memory" from the Companion tab to wipe both in-memory and on-disk history. If a long conversation has driven the model into a rut, a fresh start often helps.
Effects feel laggy
Local model latency is real. A chat reply that triggers a bubble effect on a CPU-only laptop takes the chat latency (5-15s) plus the effect dispatch (~50 ms). On a GPU, the chat call drops to 1-3 seconds and the lag becomes imperceptible. If you have a CUDA-capable GPU and Ollama isn't using it, check ollama ps - if the model is "100% CPU," Ollama hasn't detected the GPU. Reinstall Ollama with NVIDIA drivers up to date.
Advanced: pointing at a remote Ollama
The Ollama host setting accepts any URL. If you have a beefier machine on your LAN (or a remote server you trust), point the app at its Ollama instead:
- On the server: start Ollama with
OLLAMA_HOST=0.0.0.0so it binds to all interfaces. By default Ollama only listens on localhost. - Make sure the model you want is pulled on that machine.
- In the app: edit Companion → AI → Ollama Host to
http://your-server:11434/. - The strategy notices the change on the next request, rebuilds its HTTP client, and you're done.
Caveats
Ollama has no authentication - don't expose it to the public internet without a reverse proxy and auth in front of it. Network latency is added to every chat call (negligible on a LAN, dominant over WAN). The default 5-minute client timeout still applies; very slow remotes may need a smaller model or a closer server.
Glossary
| Ollama | Local-model runner from ollama.com. Installs as a background HTTP server (port 11434), pulls models from a registry, and serves them via a chat-completion API. |
| Cloud AI / proxy | Our hosted service that forwards requests to a hosted model. The default option; needs a free account. |
| Local AI | Ollama running on your machine, used as a drop-in replacement for the cloud proxy. |
| Model / tag | A specific weight file Ollama can serve, named like qwen3.5:latest or mistral-nemo:12b-instruct. |
| System prompt | The character and rule description sent to the model at the start of every conversation. Built by BambiSprite. |
| Enrichment block | Extra context message inserted between the system prompt and the dialogue, telling the model to output structured JSON. Only present when "Allow AI to control effects" is on. |
| Effect command | JSON object the AI can emit in the effects array to trigger app features (flash, bubbles, haptic, etc.). |
| Master toggle | AllowAiToControlEffects. The single switch that controls whether the AI can trigger any effect at all. |
| Warm-up | Loading the model into RAM/VRAM ahead of time so the first chat doesn't pay the cold-start cost. Done with an empty /api/generate call at app startup. |
| Persistent chat history | The local_chat_history.json file in %APPDATA%\ConditioningControlPanel\. Caps at 50 user/assistant pairs. Local provider only. |
Questions, suggestions, or "this section is wrong" reports - open an issue at CC-Labs-llc/ccp-bugs or ping in the Discord. The integration shipped in v5.8.4 and the docs will evolve with it.