Throwing away my fancy AI memory system

About 18 months ago I started refining and customising how I worked with AI (Claude in my case), as I’ve written about before. While I was writing that series of posts, I was also working on what I thought was going to be an amazing, super-fancy enhancement to my very basic filesystem/markdown file based system.

The new hotness

The fancy new system was going to replace mere markdown files with a vector database (Qdrant), graph database (Neo4j), paragraph-level chunking, 10+ custom MCP tools, semantic search, etc.

Over a couple of evenings I got this working, and then ran the new system side-by-side with my original setup. This meant I double-handled everything for about a month; uploaded all meeting transcripts and docs in both, asked the same questions of both, etc.

In the bin!

You might recognise this reference if you’re a person of culture and have watched any amount of MCM. 😉

Yeah, cutting to the chase I ended up throwing the new fancy system away in the end. I just found myself not trusting it, and reaching for the original system more and more.

But why?

I’ve been thinking about this quite a bit over the past couple months. I was sure the more complex system would be better. My database background meant I really wanted this to work

The vector database did what you’d expect it to; queries returned results consistently in 1-2 seconds, across 4-5000 indexed documents. Relevance scores were good, but the whole experience felt weirdly sterile. Just asking the LLM still felt more “natural” for some reason.

Similarity != Relevance

I think this the crux of the weirdness I experienced. A super technically-accurate database query isn’t the same as a relevant answer to a question, which could even potentially be less accurate, I guess…?
It’s a fine balance obviously, and the irony isn’t lost on me that I’m preferring the potentially-less-accurate, less sterile, non-database-query results given my data background. 😅

For example, if I wanted to prep for a meeting with Mary, my original/basic system would rely on the LLM using various context from markdown files to extract relevant insights. The semantic search would return much more data; anything related to Mary, projects and meetings Mary was somehow related to, etc.

Chunking destroys narrative structure?

My theory, and I could be wrong, is that its harder to retain context when you’re optimising the data for retrieval. You can use metadata, tags, etc but it wasn’t much of an improvement in my experience.

Perhaps with much more tweaking, and someone smarter than me doing it, it could achieve the level of improvement I thought I’d see. But I found the overhead of managing the complexity of this new system far outweighed any incremental improvements over the markdown file based system.

My queries aren’t exploratory

Semantic search shines when you don’t know where the information you need lives. “Find me anything related to customer churn across a million documents” for example.

But my queries are more categorical because it’s my own “memory”, so I have prompts to guide the LLM on where to find information related to my teams, strategy, org structure, etc. So I need synthesis across smaller datasets more than pure retrieval from massive datasets.

The context window

This is the limitation I think. If you have too broad a system in some way (whether volume of documents, breadth of topics, etc) then you will run into context window limitations relatively quickly, and this is where a sophisticated, multi-layered retrieval system would be necessary.

I have ~6 regularly re-aggregated “memory files” which act as indexes for the more granular data. Each file is kept to ~100-150 lines (averaging around 10kb each). These files are what I have loaded into the project knowledge/context for every chat, which is very manageable.

The advantage of having everything in context (or at least a reasonable summary of everything recent/relevant) is that the LLM has a good overview of everything from which it can choose what’s relevant, and where to delve into more detail (i.e. access more detailed files on the filesystem). So it can more effectively help me prep for my meeting with Mary by getting our previous meeting’s notes, then cross referencing Mary’s Jira updates, and perhaps querying our team Slack channel for any conversations since our last catch-up.

That multi-hop reasoning happens naturally in the language model. A vector search (well, my implementation at least) just returns the top-k most similar chunks and calls it a day.

The good news, and why I think the simple system will itself become more and more powerful, is that every generation of model is getting larger context windows, so as my knowledge base has grown, the context window available to me has managed to keep up, meaning I pretty infrequently run into issues.

I’m not alone it seems

You know how once something in top-of-mind you notice it everywhere? Well now that I was thinking about this I started noticing blog posts and articles which talked about simpler mechanisms out-performing vector databases for RAG – pointing out that embeddings lose context, that retrieval and reasoning are different problems, that “file-first” approaches often outperform more complex architectures for agentic workflows. I’d stumbled into something others had articulated much more clearly. 1, 2, 3

What did I improve?

I did make improvements to the original system, but not in the ways I would’ve guessed 6 months ago. The improvements were tangential; removing friction in the pipeline. Speeding up how I get useful content into the system, and getting better insights out of it.

  • For example I fiddled (a lot) with my various prompts to tweak accuracy of summaries, attribution, non-sensationalist language, etc.
  • I automated weekly Jira syncs with a python script rather than blowing out my context windows trying to get Claude to use tool to directly query Jira itself. 🔥
  • And a raft of other incremental tweaks to improve performance, reduce context usage, clarify or adjust purpose, etc.

The new New Hotness

The biggest enhancement I made which I didn’t really see coming was converting everything to use Claude Code as it’s “primary platform” from Claude Desktop.

I still use both, but think of Claude Code as the “writer”, whereas Desktop is only used for the odd “read” action. This involved converting most of my prompts and pseudo commands into actual slash commands (so much cleaner!), doing away with any reference to MCP tools since Code has everything I need (i.e. just filesystem access!), and most satisfyingly being able to fire up multiple concurrent agents to do things like parallel doc summaries, or heavy “system tasks” like updating the 6 main memory files in parallel. That saves me about 25 minutes each Monday morning alone.

This wasn’t the point of this post though, so rather than going down (more) rabbit holes let me know if there’s any interest in how it works and I’ll write something more detailed.

When does the fancy stuff make sense?

I should be clear this isn’t universal advice. My setup has specific characteristics:

  • Hundreds (maybe almost thousands) of curated documents, not millions
  • Single user (me)
  • Relatively structured, categorical queries rather than exploratory search
  • Context windows that keep growing with each model generation

Vector databases and RAG architectures probably do make sense when you’re dealing with genuine scale that exceeds context windows, multiple users with different access patterns, truly exploratory queries where you don’t know what you’re looking for, or requirements for deterministic retrieval.

My situation is none of those. I process 3 or 4 meeting transcripts and a couple of documents per day on average. I run the memory update process weekly. I query the system maybe 5-10 times daily for meeting prep, writing help, and document review. That’s well within what a curated file system and a capable language model can handle.

My advice

Start simple. Stay simple until simple breaks in some way.

The system I rely on every day is just text files organised in folders, processed by an LLM with decent context.

I learned a fair bit about embeddings and vector databases in the process of building my flashy new system, but the boring solution turned out to be the right one, as is so often the case in tech.

Dave

Leave a Comment