LearnNewsExamplesServices
Frontmatter
id9792
titleOptimize OpenAiCompatible Provider by Natively Wrapping LLM Streaming API
stateClosed
labels
enhancementai
assigneestobiu
createdAtApr 8, 2026, 7:22 PM
updatedAtApr 8, 2026, 7:22 PM
githubUrlhttps://github.com/neomjs/neo/issues/9792
authortobiu
commentsCount1
parentIssuenull
subIssues[]
subIssuesCompleted0
subIssuesTotal0
blockedBy[]
blocking[]
closedAtApr 8, 2026, 7:22 PM

Optimize OpenAiCompatible Provider by Natively Wrapping LLM Streaming API

Closedenhancementai
tobiu
tobiu commented on Apr 8, 2026, 7:22 PM

Context & Architectural Strategy

The memory core utilizes local LLM models (via LM Studio or Ollama backend OpenAiCompatible endpoints) heavily for graph extraction (Tri-Vector Synthesis, Topological Conflicts). By default, standard generative endpoint calls stream: false force the local model servers to buffer and serialize entirely localized JSON structures before releasing the REST packet. On local Apple Silicon instances computing large graphs natively, this synchronous wait-lock inflates generation times drastically.

Performance analysis demonstrated that enforcing stream: true (which offloads token string concatenation to V8 without holding an HTTP buffer lock) provides ~30% physical latency reductions over the existing monolithic request architecture, nearly halving Topological inference latency.

Actionable Scope

Refactor the internal architecture of Neo.ai.provider.OpenAiCompatible's default generate() footprint.

  • Map .generate() to dynamically execute the internal this.stream() AsyncGenerator.
  • Concatenate the chunk strings securely in the Node environment.
  • Preserve the existing structural response contract (returning { content, raw }) unchanged so downstream platform dependencies remain functionally decoupled from the streaming optimization.
  • Adhere strictly to the "Anchor & Echo" semantic tagging (@summary) strategy.

Implementation Details

Modified Neo.ai.provider.OpenAiCompatible to absorb the logic internally, resulting in:

  • Tri-Vector Extraction Latency: -6% isolated drop
  • Topological Conflict Extraction: -50% isolated drop

Avoided Pitfalls

  • Contract Violation: Avoid exposing chunked interfaces natively in .generate() to prevent shattering standard API client expectations in existing scripts.
tobiu added the enhancement label on Apr 8, 2026, 7:22 PM
tobiu added the ai label on Apr 8, 2026, 7:22 PM
tobiu referenced in commit dbd6a24 - "feat(ai/provider/OpenAiCompatible.mjs): natively wrap streaming generator to bypass LM Studio serialization buffers (#9792)" on Apr 8, 2026, 7:22 PM
tobiu assigned to @tobiu on Apr 8, 2026, 7:22 PM
tobiu
tobiu Apr 8, 2026, 7:22 PM

Input from Antigravity (Gemini 3.1 Pro):

✦ Successfully refactored OpenAiCompatible.generate() to natively wrap stream: true iteration blocks. This effectively bypasses the monolithic buffer serialization penalties inside local LLM endpoints (LM Studio, llama.cpp), decreasing graph rendering latency without altering external function signatures.

Verification Results:

  • Tri-Vector Extraction latency: Reduced by ~6%
  • Topological Conflict Extraction latency: Reduced by ~50%

Testing manually validated using the continuous DreamService REM-sleep background extraction via runSandman.mjs.

Closing ticket as Definition of Done is met.

tobiu closed this issue on Apr 8, 2026, 7:22 PM