Reflection is All You Need

I work with a lot of people that don’t have all that much intuition and experience on how to build great agents. This is not at all a knock on them, not everyone can afford to be as obsessed with these tools as an AI Engineer! I try to encourage them that this is all a new field and The Book on How It’s Done Right doesn’t exist yet. Mostly, because it can’t since the capabilities available are too fast of a moving target. The advice of from yesteryear of what is necessary barely holds up today. Remember having to say IMPORTANT!!!! DO NOT SKIP DO NOT IGNORE!! to get models to follow any instruction at all?

One aspect of this that I emphathize with is the Blank Page Problem, where are they supposed to even start? Well, I am here to tell you… try starting with one sentance!

Here’s an illustrative example based on a real portion of my coding agent.

My first attempt at this section. I’m not kidding when I say start simple!

commit.md

Write a commit message for this diff.

response

Update parser

Updated the parser to handle large files better.

Well, thats not great. Its fairly duplicative, and the elaboration is relatively helpful but doesn’t really tell me why I made that change. If I’m looking at this in a few months I’ll probably want that context.

Lets add some context to our commits

commit.md

Write a commit message for this diff. ++Include why the change was made, not just what.++

response

Add streaming parser

Added a streaming parser to handle large files better, which improves performance for large files.

Ok yeah, I mean thats an improvement in that it does contain more context, and I guess that sort of explains why I wanted to do that. The actual why is pretty bad though since is kind of just restates the same thing.

Now that I see this though, I’m starting to think deeper about my motivation for having the “why” behind the change. I think it would be best if our commit log was structured such that I could read it almost like a decision log? That way I could just thumb through a bunch of small commits, and for each group of changes I have my moment-in-time motivation right there. So to that aim it should be about as exhaustive/explicit as possible, and should have a consistent format so I can always ensure the reasoning section is clear.

Lets tell the agent where the “why” should come from.

commit.md

Write a commit message for this diff. Include why the change was made, not just what. ++The why should be reasoning sourced from our interactions that the reader can't recover from the diff itself (rejected alternatives, constraints, surprises).++

response

Switch to streaming parser for large file ingestion

Swapped out BufferedParser for StreamingParser after hitting OOM errors on files over 500MB. Chunked buffering was on the table but requires a known max chunk size, which we don't have for arbitrary uploads.

Much better! The reasoning is actually useful now because we told it where to pull from. But after using it a bunch I notice the format is inconsistent, sometimes the why is buried in the middle of a sentence, sometimes it’s a separate paragraph. If we start parsing this out at any point deterministically, we’ll want a consistent structure.

Lets make a clear format for our bullets.

commit.md

Write a commit message for this diff. ++Body is bullets in the form `{what} because {why}`, where {why} is++ reasoning sourced from our interactions that the reader can't recover from the diff itself (rejected alternatives, constraints, surprises).

response

add: streaming parser for large file ingestion

- Moved ingest.py from BufferedParser to StreamingParser because the buffered path OOM'd on anything over 500MB. Ruled out chunked buffering since it needs a max chunk size we can't guarantee for arbitrary uploads.

This is looking really good! But after using it a bunch, I notice that sometimes the agent does not include the word because, or it does but it includes it more than once per bullet. If we start parsing this out at any point deterministically, we’ll want to make sure we are as consistent as possible.

Lets write a quick validator we can hook onto any git commit calls our agent makes.

validate.py

def validate_commit_message(msg: str) -> str | None:
  _, body = msg.split("\n\n", 1)
  bullets = [b.strip() for b in body.split("\n- ") if b.strip()]
  for bullet in bullets:
    if bullet.count("because") != 1:
      return f"bullet must match '{{what}} because {{why}}': {bullet!r}"
  return None

When it runs the agent will see

response

rejected: bullet must match '{what} because {why}':
  '- Added streaming parser to handle large files.'
rewrite the message and try again.

Now diving a bit deeper into the mechanics of how I use reflection more formally, what has worked well for me is to consider the Gap between what I expected and what actually happened from three different levels of abstraction:

L1: Concrete

Direct instructive guidance on what should change
Enumrated examples like (rejected alternatives, constraints, surprises) from the prompts .
Added when the model is properly invoked and focused, but just has a blind spot that requires redirection.

L2: Conceptual

Generalized description of the problem space pattern
… where why is the reasoning sourced from our interactions that the reader can’t recover from the diff itself
Used when we cannot reach the level of direction needed through enumearation
- Sometimes enumeration would overfit to the local behavior of this particular run/interaction with the agent

L3: Principle

Grounding motivation, purpose for the interactions, and taste
Used relatively rarely, helpful in aligning the model to broader goals that the author/user has beyond the actions directly specified.
We didn’t have one in our prompts but if we did it might be we need to have a detailed log of what/why so that when we need to contextualize a change that was made a long time ago, we don’t need to rely on user memory to discover the motivation and purpose of those changes.

I started with a skill that specified

a directive to identify the source of the gaps identified by the user
a consideration of each gap at each level of abstraction and potential justifications for choosing the improvement at that level
a report summarizing the process and request for human feedback

I often disagree with the correct abstraction choice (it’s an exceptionally hard problem for LLMs, but also arguably a matter of taste as well). I have found that having the problem space broken up in this way really helps me think about the task at hand clearly and quickly. The nice thing about this process is once you finish reflecting the original gap, if you have any notes for the alignment of the abstraction process you can reflect the reflection skill right after. I do that a lot .

After using this system a bit, one thing that I realized was that the updates to the skills were essentially persistant solutions, but the problems they solved were ephemeral. Once we clear that context window, the original gap that prompted the change is no longer available. So to this aim, I started keeping a sidecar file with structured reflections with things like the different

Glossary