How MemPalace Exposed Two Concurrency Bugs Across SQLite and ChromaDB

Running an open-source AI memory system across three machines broke it in ways nobody had seen before.

And that is exactly how the best contributions happen.

I have been using MemPalace as the shared brain for my development setup. It is an open-source project that gives AI agents persistent memory using a "memory palace" metaphor. In my setup, it runs on a Raspberry Pi 5 as a network MCP server, serving memory to AI agents across multiple projects and machines.

After 28,000+ memories, 15 wings, and three machines hitting the same database simultaneously, the interesting problems finally showed up.

The symptoms

The first signal was intermittent "database is locked" errors from SQLite. Sessions would randomly slow to a crawl. At the same time, ChromaDB kept rebuilding its entire vector index from disk on every single operation for no obvious reason.

The system worked fine with one client. With two or three concurrent sessions writing and reading through mcp-proxy SSE connections, it fell apart.

The diagnosis

This is where the AI-assisted development story got real. I was not just using Claude Code to write code. I was using it as a thinking partner to trace root causes across two different database systems.

Working together, we identified two distinct concurrency bugs.

Bug 1: SQLite knowledge graph locking. The busy_timeout was set to just 10 seconds with no application-level retry. When multiple mcp-proxy processes competed for the same SQLite file, the timeout would expire during long WAL checkpoints. The transactions were also using deferred locking, which meant contention was detected mid-transaction instead of at the start.

Bug 2: ChromaDB client cache thrashing. Every write to chroma.sqlite3 changed the file mtime. Other processes detected the mtime change and recreated the PersistentClient, which reloaded the full HNSW vector index from disk. With multiple writers, every operation triggered a full index reload on every other client.

The fix

For SQLite, the changes were straightforward but important:

  • Increase busy_timeout from 10 seconds to 60 seconds

  • Add exponential backoff retry with jitter as a safety net

  • Use BEGIN IMMEDIATE for write transactions so contention is detected up front

  • Tune WAL autocheckpoint behavior and add a clean shutdown handler

For ChromaDB, the fixes focused on stopping unnecessary reconnects:

  • Rate-limit mtime checks to five-second intervals

  • Refresh the stored mtime after our own writes to prevent self-triggered reconnects

  • Preserve safety-critical database disappearance detection so the rebuild path still works

The work is covered by new tests, including a multi-process stress test that runs four processes with 20 concurrent triples against the same database file with zero failures. The full suite passed with 958 tests.

The bigger lesson

The bugs in MemPalace were not bugs for most users. They were bugs for users pushing the tool past its original design assumptions. Running it as a network service with concurrent clients exposed race conditions that single-process usage would never have found.

That is why real production use matters for open source. Toy usage finds toy bugs. Production usage, especially messy multi-machine concurrent usage, finds the bugs that actually need fixing.

I went from user to power user to contributor in about eight days. I migrated 15,000+ memories from a previous system, set up multi-machine access, hit the concurrency wall, diagnosed it, and sent the fix upstream.

That is the open-source loop working exactly as designed.

Source note: related upstream PR: MemPalace PR #948.

© 2025 TRIBALSCALE INC

💪 Developed by TribalScale Design Team

© 2025 TRIBALSCALE INC

💪 Developed by TribalScale Design Team