Original Reddit post

My friend Tyler told me last week that he was working on a cool new thing, and it dropped today. The blog post has the benchmarks, but here’s how it works:

In observational memory, the context window is broken into two blocks. The first block is the list of observations (like above). The second block is raw messages that haven’t yet been compressed. When new messages come in, they are appended to the end of the second block. When it hits 30k tokens (the default threshold, though it’s configurable), a separate “observer agent” compresses messages into new observations that are appended to the first block. When observations hit 40k tokens (the default threshold, again configurable), a separate “reflector agent” garbage collects observations that don’t matter. Our token limit defaults are relatively conservative, providing SoTA results on benchmarks while staying well within context window limits. This structure enables consistent prompt caching. Messages keep getting appended until the threshold is hit—full cache hits on every turn. When observation runs, messages are replaced with new observations appended to the existing observation block. The observation prefix stays consistent, so you still get a partial cache hit. Only during reflection (infrequent) is the entire cache invalidated. submitted by /u/thehashimwarren

Originally posted by u/thehashimwarren on r/ArtificialInteligence