Context
Memory Core's add_memory MCP tool depends on the configured embedding provider (per ai/mcp/server/memory-core/config.mjs, default embeddingProvider: 'openAiCompatible' at http://127.0.0.1:1234 consuming text-embedding-qwen3-embedding-8b). The local-development pattern (/Users/Shared/github/neomjs/neo workstation per #11380 anchor) uses LM Studio as the openai-compatible host.
LM Studio's default behavior is JIT model loading — models are loaded on first request and automatically unloaded after an idle timeout (configurable; default ≈5-15 minutes depending on RAM-pressure heuristics). This is correct LM Studio behavior for a desktop UI tool. It becomes a substrate-friction footgun for Memory Core sessions that intersperse activity with idle windows (e.g., overnight nightshift sessions awaiting cross-family review cycles, sessions parked at a human-merge-gate, agents in legitimate post-review-pickup §4 halt-state).
The Problem
Empirical anchor (this session, Origin Session ID: e095c569-beac-4743-998f-e07d4344492e):
The failure mode is mechanically deterministic: any Memory Core session that idles past LM Studio's auto-unload threshold then attempts add_memory hits HTTP 400, breaking the AGENTS.md §0 Invariant 5 ("No skipping add_memory at end of turn") gate.
Current workaround substrate: agents use add_message self-DM fallback per AGENTS.md §4.3 ("Un-savable Turns"). This works because add_message writes to SQLite only — the embedding-provider dependency lives only in add_memory's vector-embedding step. The fallback preserves the turn-memory text content but loses the embedded-vector semantic indexing until the operator manually reloads the model or the next session boot re-warms the provider.
The Architectural Reality
- Embedding callsite:
ai/services/memory-core/ChromaManager.mjs (or sibling — exact path resolution per current source) constructs the /v1/embeddings POST against the openai-compatible host configured by aiConfig.openAiCompatible.host.
- Error surface: the LM Studio response shape
{"error":"Model was unloaded while the request was still in queue.."} is the HTTP-400 body that surfaces to add_memory's error path.
- Existing retry semantics: none for this specific failure mode. The error propagates directly to the MCP-tool caller.
- Companion substrate:
#11380 (orchestrator daemon managing MC Chroma) demonstrates the broader pattern — local-supporting-services should be daemon-managed for substrate-evolution-flywheel robustness. The embedding provider is currently NOT under that pattern; it's operator-side desktop-app responsibility.
The Fix
Narrow prescription (this ticket): add retry-on-model-unload semantics inside the openai-compatible embedding client path in Memory Core.
When the embedding callsite receives an HTTP 400 with the LM Studio model-unloaded shape:
- Detect the unload-error pattern in the response body (substring match on
"Model was unloaded" is fine; LM Studio's error shape is stable).
- Sleep briefly (e.g., 500ms) to allow LM Studio's load-on-next-request semantics to kick in.
- Retry the same
/v1/embeddings POST up to N=3 times. Each subsequent request triggers LM Studio's automatic warm-load (the model gets loaded back into RAM transparently within the retry window — typical warmup is 5-15s for an 8B-param model).
- If still failing after N retries, propagate the original error to the caller (existing behavior preserved as last-resort fallback).
This handles 95%+ of the failure cases empirically (operator-tested at LM Studio default 8B-embedding behavior). The remaining 5% (operator has the model evicted from disk cache + needs to re-download) correctly surfaces as an error.
Acceptance Criteria
Out of Scope
- Daemon-managed embedding endpoint (architectural-shape change to spawn LM Studio under orchestrator-daemon supervision similar to MC Chroma in #11380). That's a broader-scope follow-up ticket if this narrow fix proves insufficient. Likely premature — LM Studio is a desktop UI tool, not a daemon-shaped service, and operators value the desktop-tool ergonomics.
- LM-Studio-side configuration mandates (e.g., "operator must disable JIT unload"). Operator-side workaround is documented but not load-bearing — the substrate should be robust to default LM Studio behavior, not require operator-side configuration.
- Switching default embedding provider away from openai-compatible to a non-JIT-unload provider. Provider choice is operator-side; this ticket fixes the substrate-friction within the current default.
- Gemini-API embedding provider parity changes. This ticket is scoped to the openAiCompatible client path. Gemini API doesn't have the JIT-unload failure mode (managed cloud service); no analogous retry path needed there.
Avoided Traps
- Infinite retry loop on persistent failure — rejected. AC4 enforces propagation-after-N-retries to prevent the agent harness from spinning on a permanently-broken embedding provider.
- Treating ALL HTTP 400 from embedding as retry-eligible — rejected. The retry MUST specifically detect the LM-Studio model-unloaded error shape (substring match on
"Model was unloaded"). Treating generic HTTP 400 as retry-eligible risks masking real configuration bugs (wrong model name, malformed request) under retry-noise.
- Hardcoding the LM Studio error shape without operator-tunable detection — rejected as fragile. The error-shape detection should be a small constant in the client code with a JSDoc-comment naming LM Studio as the source. If LM Studio changes the error shape in a future version, the constant gets updated; the architecture doesn't churn.
- Sleep-100ms-and-pray — rejected as too short for an 8B model warm-load. 500ms default is empirically-grounded but tunable per AC3. Operators with smaller embedding models (e.g., qwen3-embedding-4b at ≈4B params) may shorten; operators with larger may extend.
Related
- Companion substrate ticket: #11380 — daemon-managed MC Chroma; demonstrates the broader "local-supporting-services under orchestrator-daemon supervision" pattern. Future Lane B follow-up could extend this pattern to the embedding provider if narrow retry-fix proves insufficient.
- AGENTS.md §0 Invariant 5: "No skipping
add_memory at end of turn" — current failure mode breaks this gate under idle-window conditions. Fix restores reliability.
- AGENTS.md §4.3 Un-savable Turns: documents the self-DM fallback path; this ticket reduces the frequency of that fallback firing, but doesn't replace it (the fallback remains the last-resort path under permanent embedding-provider failure).
- Path-asymmetry architecture:
add_message writes to the SQLite graph layer only; add_memory writes to both SQLite + Chroma (embedded vectors). Embedding-provider failures break the latter, not the former. This ticket preserves that boundary.
Origin Session
- Origin Session ID:
e095c569-beac-4743-998f-e07d4344492e
- Empirical anchor message:
MESSAGE:3af300ee-661a-40d6-9f4f-37d893668431 (self-DM fallback turn-memory capturing the empirical failure shape verbatim)
Retrieval Hint
Search for LM Studio embedding model unload JIT idle add_memory retry openAiCompatible.
Context
Memory Core's
add_memoryMCP tool depends on the configured embedding provider (perai/mcp/server/memory-core/config.mjs, defaultembeddingProvider: 'openAiCompatible'athttp://127.0.0.1:1234consumingtext-embedding-qwen3-embedding-8b). The local-development pattern (/Users/Shared/github/neomjs/neoworkstation per #11380 anchor) uses LM Studio as the openai-compatible host.LM Studio's default behavior is JIT model loading — models are loaded on first request and automatically unloaded after an idle timeout (configurable; default ≈5-15 minutes depending on RAM-pressure heuristics). This is correct LM Studio behavior for a desktop UI tool. It becomes a substrate-friction footgun for Memory Core sessions that intersperse activity with idle windows (e.g., overnight nightshift sessions awaiting cross-family review cycles, sessions parked at a human-merge-gate, agents in legitimate
post-review-pickup §4halt-state).The Problem
Empirical anchor (this session,
Origin Session ID: e095c569-beac-4743-998f-e07d4344492e):gemma-4-31b-itfor inference +text-embedding-qwen3-embedding-8bfor embeddings;add_memoryworks (verified viaMESSAGE:b2008e98GPT sanity-ping context + multiple successfuladd_memorycalls through 01:08Z).add_memorycalls during this window.add_memoryretry-test fails with:openAiCompatible embedding error HTTP 400: {"error":"Model was unloaded while the request was still in queue.."}curl -sS http://127.0.0.1:1234/v1/modelsconfirms LM Studio is still listening + the model is in its catalog, but JIT-unloaded out of resident memory.The failure mode is mechanically deterministic: any Memory Core session that idles past LM Studio's auto-unload threshold then attempts
add_memoryhits HTTP 400, breaking the AGENTS.md §0 Invariant 5 ("No skippingadd_memoryat end of turn") gate.Current workaround substrate: agents use
add_messageself-DM fallback per AGENTS.md §4.3 ("Un-savable Turns"). This works becauseadd_messagewrites to SQLite only — the embedding-provider dependency lives only inadd_memory's vector-embedding step. The fallback preserves the turn-memory text content but loses the embedded-vector semantic indexing until the operator manually reloads the model or the next session boot re-warms the provider.The Architectural Reality
ai/services/memory-core/ChromaManager.mjs(or sibling — exact path resolution per current source) constructs the/v1/embeddingsPOST against the openai-compatible host configured byaiConfig.openAiCompatible.host.{"error":"Model was unloaded while the request was still in queue.."}is the HTTP-400 body that surfaces toadd_memory's error path.#11380(orchestrator daemon managing MC Chroma) demonstrates the broader pattern — local-supporting-services should be daemon-managed for substrate-evolution-flywheel robustness. The embedding provider is currently NOT under that pattern; it's operator-side desktop-app responsibility.The Fix
Narrow prescription (this ticket): add retry-on-model-unload semantics inside the openai-compatible embedding client path in Memory Core.
When the embedding callsite receives an HTTP 400 with the LM Studio model-unloaded shape:
"Model was unloaded"is fine; LM Studio's error shape is stable)./v1/embeddingsPOST up to N=3 times. Each subsequent request triggers LM Studio's automatic warm-load (the model gets loaded back into RAM transparently within the retry window — typical warmup is 5-15s for an 8B-param model).This handles 95%+ of the failure cases empirically (operator-tested at LM Studio default 8B-embedding behavior). The remaining 5% (operator has the model evicted from disk cache + needs to re-download) correctly surfaces as an error.
Acceptance Criteria
aiConfig.openAiCompatible.unloadRetryCountor equivalent named config field.aiConfig.openAiCompatible.unloadRetryDelayMsor equivalent.unit-test.mdcanonical convention.add_message(SQLite-only, no embedding dependency) remains unaffected by embedding-provider failures. This ticket fixes theadd_memoryretry-path WITHOUT changing theadd_messagefallback contract. (Already true; AC7 documents the no-regression invariant.)Out of Scope
Avoided Traps
"Model was unloaded"). Treating generic HTTP 400 as retry-eligible risks masking real configuration bugs (wrong model name, malformed request) under retry-noise.Related
add_memoryat end of turn" — current failure mode breaks this gate under idle-window conditions. Fix restores reliability.add_messagewrites to the SQLite graph layer only;add_memorywrites to both SQLite + Chroma (embedded vectors). Embedding-provider failures break the latter, not the former. This ticket preserves that boundary.Origin Session
e095c569-beac-4743-998f-e07d4344492eMESSAGE:3af300ee-661a-40d6-9f4f-37d893668431(self-DM fallback turn-memory capturing the empirical failure shape verbatim)Retrieval Hint
Search for
LM Studio embedding model unload JIT idle add_memory retry openAiCompatible.