Frontmatter
| id | 9792 |
| title | Optimize OpenAiCompatible Provider by Natively Wrapping LLM Streaming API |
| state | Closed |
| labels | enhancementai |
| assignees | tobiu |
| createdAt | Apr 8, 2026, 7:22 PM |
| updatedAt | Apr 8, 2026, 7:22 PM |
| githubUrl | https://github.com/neomjs/neo/issues/9792 |
| author | tobiu |
| commentsCount | 1 |
| parentIssue | null |
| subIssues | [] |
| subIssuesCompleted | 0 |
| subIssuesTotal | 0 |
| blockedBy | [] |
| blocking | [] |
| closedAt | Apr 8, 2026, 7:22 PM |
Optimize OpenAiCompatible Provider by Natively Wrapping LLM Streaming API
Closedenhancementai
tobiu assigned to @tobiu on Apr 8, 2026, 7:22 PM

tobiu
Apr 8, 2026, 7:22 PM
Input from Antigravity (Gemini 3.1 Pro):
✦ Successfully refactored
OpenAiCompatible.generate()to natively wrapstream: trueiteration blocks. This effectively bypasses the monolithic buffer serialization penalties inside local LLM endpoints (LM Studio, llama.cpp), decreasing graph rendering latency without altering external function signatures.Verification Results:
- Tri-Vector Extraction latency: Reduced by
~6%- Topological Conflict Extraction latency: Reduced by
~50%Testing manually validated using the continuous
DreamServiceREM-sleep background extraction viarunSandman.mjs.Closing ticket as Definition of Done is met.
tobiu closed this issue on Apr 8, 2026, 7:22 PM
Context & Architectural Strategy
The memory core utilizes local LLM models (via LM Studio or Ollama backend
OpenAiCompatibleendpoints) heavily for graph extraction (Tri-Vector Synthesis, Topological Conflicts). By default, standard generative endpoint callsstream: falseforce the local model servers to buffer and serialize entirely localized JSON structures before releasing the REST packet. On local Apple Silicon instances computing large graphs natively, this synchronous wait-lock inflates generation times drastically.Performance analysis demonstrated that enforcing
stream: true(which offloads token string concatenation to V8 without holding an HTTP buffer lock) provides ~30% physical latency reductions over the existing monolithic request architecture, nearly halving Topological inference latency.Actionable Scope
Refactor the internal architecture of
Neo.ai.provider.OpenAiCompatible's defaultgenerate()footprint..generate()to dynamically execute the internalthis.stream()AsyncGenerator.{ content, raw }) unchanged so downstream platform dependencies remain functionally decoupled from the streaming optimization.@summary) strategy.Implementation Details
Modified
Neo.ai.provider.OpenAiCompatibleto absorb the logic internally, resulting in:Avoided Pitfalls
.generate()to prevent shattering standard API client expectations in existing scripts.