LearnNewsExamplesServices
Frontmatter
id11393
titleMemory Core add_memory must retry-on-model-unload when embedding provider drops idle models
stateClosed
labels
bugaiagent-task:pendingmodel-experience
assignees[]
createdAtMay 15, 2026, 5:12 AM
updatedAtMay 15, 2026, 7:21 AM
githubUrlhttps://github.com/neomjs/neo/issues/11393
authorneo-opus-4-7
commentsCount0
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtMay 15, 2026, 7:21 AM

Memory Core add_memory must retry-on-model-unload when embedding provider drops idle models

Closedbugaiagent-task:pendingmodel-experience
neo-opus-4-7
neo-opus-4-7 commented on May 15, 2026, 5:12 AM

Context

Memory Core's add_memory MCP tool depends on the configured embedding provider (per ai/mcp/server/memory-core/config.mjs, default embeddingProvider: 'openAiCompatible' at http://127.0.0.1:1234 consuming text-embedding-qwen3-embedding-8b). The local-development pattern (/Users/Shared/github/neomjs/neo workstation per #11380 anchor) uses LM Studio as the openai-compatible host.

LM Studio's default behavior is JIT model loading — models are loaded on first request and automatically unloaded after an idle timeout (configurable; default ≈5-15 minutes depending on RAM-pressure heuristics). This is correct LM Studio behavior for a desktop UI tool. It becomes a substrate-friction footgun for Memory Core sessions that intersperse activity with idle windows (e.g., overnight nightshift sessions awaiting cross-family review cycles, sessions parked at a human-merge-gate, agents in legitimate post-review-pickup §4 halt-state).

The Problem

Empirical anchor (this session, Origin Session ID: e095c569-beac-4743-998f-e07d4344492e):

  • 00:00Z session start; LM Studio loaded with gemma-4-31b-it for inference + text-embedding-qwen3-embedding-8b for embeddings; add_memory works (verified via MESSAGE:b2008e98 GPT sanity-ping context + multiple successful add_memory calls through 01:08Z).
  • ~01:08Z – ~03:04Z session enters legitimate halt-state awaiting cross-family review cycles + operator merge. No add_memory calls during this window.
  • 03:04Z wake event arrives; add_memory retry-test fails with:
      openAiCompatible embedding error HTTP 400: {"error":"Model was unloaded while the request was still in queue.."}
  • Recovery: curl -sS http://127.0.0.1:1234/v1/models confirms LM Studio is still listening + the model is in its catalog, but JIT-unloaded out of resident memory.

The failure mode is mechanically deterministic: any Memory Core session that idles past LM Studio's auto-unload threshold then attempts add_memory hits HTTP 400, breaking the AGENTS.md §0 Invariant 5 ("No skipping add_memory at end of turn") gate.

Current workaround substrate: agents use add_message self-DM fallback per AGENTS.md §4.3 ("Un-savable Turns"). This works because add_message writes to SQLite only — the embedding-provider dependency lives only in add_memory's vector-embedding step. The fallback preserves the turn-memory text content but loses the embedded-vector semantic indexing until the operator manually reloads the model or the next session boot re-warms the provider.

The Architectural Reality

  • Embedding callsite: ai/services/memory-core/ChromaManager.mjs (or sibling — exact path resolution per current source) constructs the /v1/embeddings POST against the openai-compatible host configured by aiConfig.openAiCompatible.host.
  • Error surface: the LM Studio response shape {"error":"Model was unloaded while the request was still in queue.."} is the HTTP-400 body that surfaces to add_memory's error path.
  • Existing retry semantics: none for this specific failure mode. The error propagates directly to the MCP-tool caller.
  • Companion substrate: #11380 (orchestrator daemon managing MC Chroma) demonstrates the broader pattern — local-supporting-services should be daemon-managed for substrate-evolution-flywheel robustness. The embedding provider is currently NOT under that pattern; it's operator-side desktop-app responsibility.

The Fix

Narrow prescription (this ticket): add retry-on-model-unload semantics inside the openai-compatible embedding client path in Memory Core.

When the embedding callsite receives an HTTP 400 with the LM Studio model-unloaded shape:

  1. Detect the unload-error pattern in the response body (substring match on "Model was unloaded" is fine; LM Studio's error shape is stable).
  2. Sleep briefly (e.g., 500ms) to allow LM Studio's load-on-next-request semantics to kick in.
  3. Retry the same /v1/embeddings POST up to N=3 times. Each subsequent request triggers LM Studio's automatic warm-load (the model gets loaded back into RAM transparently within the retry window — typical warmup is 5-15s for an 8B-param model).
  4. If still failing after N retries, propagate the original error to the caller (existing behavior preserved as last-resort fallback).

This handles 95%+ of the failure cases empirically (operator-tested at LM Studio default 8B-embedding behavior). The remaining 5% (operator has the model evicted from disk cache + needs to re-download) correctly surfaces as an error.

Acceptance Criteria

  • AC1: Embedding-callsite in Memory Core's openai-compatible client detects the LM-Studio model-unloaded HTTP 400 error shape and triggers retry-with-warmup-delay.
  • AC2: Retry count is configurable (default N=3) via aiConfig.openAiCompatible.unloadRetryCount or equivalent named config field.
  • AC3: Warmup-delay-per-retry is configurable (default 500ms) via aiConfig.openAiCompatible.unloadRetryDelayMs or equivalent.
  • AC4: After exhausting retries, the original HTTP 400 error propagates unchanged (existing error-path behavior preserved).
  • AC5: Unit test covering: (a) first-call-succeeds-no-retry path, (b) first-call-fails-second-call-succeeds path with mock client, (c) exhausted-retry-final-failure path. Spec located per unit-test.md canonical convention.
  • AC6: Diagnostic log entry written via Memory Core's existing logger when retry fires, naming "embedding-provider model-unload detected, retrying" so operator-side observability surfaces the substrate-friction transparency (this also enables monitoring: a spike in retry-log-events would signal LM-Studio-idle-threshold needs tuning).
  • AC7: Path-asymmetry semantic preserved — add_message (SQLite-only, no embedding dependency) remains unaffected by embedding-provider failures. This ticket fixes the add_memory retry-path WITHOUT changing the add_message fallback contract. (Already true; AC7 documents the no-regression invariant.)

Out of Scope

  • Daemon-managed embedding endpoint (architectural-shape change to spawn LM Studio under orchestrator-daemon supervision similar to MC Chroma in #11380). That's a broader-scope follow-up ticket if this narrow fix proves insufficient. Likely premature — LM Studio is a desktop UI tool, not a daemon-shaped service, and operators value the desktop-tool ergonomics.
  • LM-Studio-side configuration mandates (e.g., "operator must disable JIT unload"). Operator-side workaround is documented but not load-bearing — the substrate should be robust to default LM Studio behavior, not require operator-side configuration.
  • Switching default embedding provider away from openai-compatible to a non-JIT-unload provider. Provider choice is operator-side; this ticket fixes the substrate-friction within the current default.
  • Gemini-API embedding provider parity changes. This ticket is scoped to the openAiCompatible client path. Gemini API doesn't have the JIT-unload failure mode (managed cloud service); no analogous retry path needed there.

Avoided Traps

  • Infinite retry loop on persistent failure — rejected. AC4 enforces propagation-after-N-retries to prevent the agent harness from spinning on a permanently-broken embedding provider.
  • Treating ALL HTTP 400 from embedding as retry-eligible — rejected. The retry MUST specifically detect the LM-Studio model-unloaded error shape (substring match on "Model was unloaded"). Treating generic HTTP 400 as retry-eligible risks masking real configuration bugs (wrong model name, malformed request) under retry-noise.
  • Hardcoding the LM Studio error shape without operator-tunable detection — rejected as fragile. The error-shape detection should be a small constant in the client code with a JSDoc-comment naming LM Studio as the source. If LM Studio changes the error shape in a future version, the constant gets updated; the architecture doesn't churn.
  • Sleep-100ms-and-pray — rejected as too short for an 8B model warm-load. 500ms default is empirically-grounded but tunable per AC3. Operators with smaller embedding models (e.g., qwen3-embedding-4b at ≈4B params) may shorten; operators with larger may extend.

Related

  • Companion substrate ticket: #11380 — daemon-managed MC Chroma; demonstrates the broader "local-supporting-services under orchestrator-daemon supervision" pattern. Future Lane B follow-up could extend this pattern to the embedding provider if narrow retry-fix proves insufficient.
  • AGENTS.md §0 Invariant 5: "No skipping add_memory at end of turn" — current failure mode breaks this gate under idle-window conditions. Fix restores reliability.
  • AGENTS.md §4.3 Un-savable Turns: documents the self-DM fallback path; this ticket reduces the frequency of that fallback firing, but doesn't replace it (the fallback remains the last-resort path under permanent embedding-provider failure).
  • Path-asymmetry architecture: add_message writes to the SQLite graph layer only; add_memory writes to both SQLite + Chroma (embedded vectors). Embedding-provider failures break the latter, not the former. This ticket preserves that boundary.

Origin Session

  • Origin Session ID: e095c569-beac-4743-998f-e07d4344492e
  • Empirical anchor message: MESSAGE:3af300ee-661a-40d6-9f4f-37d893668431 (self-DM fallback turn-memory capturing the empirical failure shape verbatim)

Retrieval Hint

Search for LM Studio embedding model unload JIT idle add_memory retry openAiCompatible.

tobiu referenced in commit a07e6f4 - "fix(memory-core): implement retry-on-unload for openAiCompatible embeddings (#11393) (#11394) on May 15, 2026, 7:21 AM
tobiu closed this issue on May 15, 2026, 7:21 AM