The quiet
divergence.
On judgment, reach, and the gap forming between people who use AI to do less and people who use it to do more.
Twenty-five years of systems built for real organizations. Architecture, governance, compliance: the kind of work where you have to understand the why before you touch a keyboard. You're accountable for what breaks, and what breaks is usually the thing nobody thought to model.
Then AI showed up.
Experienced people didn't get replaced. Something stranger happened. Decades of pattern recognition, business context, and systems thinking suddenly had a direct path back to implementation. Code became a source of truth again, not something you delegated and hoped the spec survived contact with the developers.
I'm shipping more now than at any point in my career. That's not a brag. It's a data point worth sitting with.
The quiet divergence.
The separation is already happening. Everyone's prompting. Everyone's generating. On the surface it all looks the same.
But underneath, something is splitting. Experienced practitioners with real domain expertise are compounding what they know, using AI to get further faster. A lot of other people are outsourcing their thinking to it and calling that productivity. The outputs look similar. The trajectories don't.
One group is steering. The other is riding.
That gap widens every week.
Agents should serve people.
The industry built it backwards. Chatbot at the center, intelligence bolted on. Ask a question, get an answer, lose the thread, start over. The session ends and it forgets everything.
I think that's the wrong model. An agent should work in the background, capturing and connecting things, surfacing what matters when it matters. You steer it when you choose to. Not a chatbot you visit. A system that actually knows you.
That shouldn't be a developer-only tool. That's what I'm building toward with LittleGuy.
Models are temporary. Orchestration is permanent.
Claude today. Something else next quarter. The model is a commodity in a way that most people building on top of specific models haven't fully internalized yet. The harness, the orchestration, the judgment layer: that's what actually compounds over time.
Build for the workflow. Use whatever runs it best today. Keep moving.
The model is not the product.
On commoditization, harnesses, and why the thing you're betting on is probably the wrong thing.
There's a conversation I keep having. Someone corners me after a demo, or slides into my DMs, or raises their hand at the end of a webinar. The question is always some version of: “Which model are you building on?”
It's the wrong question, though I understand why they ask it. When you're new to this space, the model feels like the foundation, the thing that determines everything. Pick the right one and the product works. Pick the wrong one and you're in trouble.
I've been building on Microsoft infrastructure for 25 years. I watched SharePoint become the foundation of enterprise collaboration. I watched Azure become the operating layer for a generation of software. The pattern was the same every time: infrastructure commoditizes, vendors converge, and value migrates up the stack to whoever built the best harness around it.
AI is no different. It just moves faster.
The commodity curve is already happening.
Six months ago, GPT-4 was the obvious choice for serious work. Then Claude 3 Opus came out and changed the calculus. Then Gemini caught up in certain domains. Now there are local models running on consumer hardware that would have been unthinkable two years ago.
Every frontier model is better than it was, and they're all closing in on each other. The benchmarks still show gaps, but in practice, for most production use cases, the difference between the top three or four models is narrower than the difference between a well-engineered harness and a poorly engineered one.
The model is a commodity. The workflow is permanent.
I ship Govern 365, a governance and compliance platform for Microsoft 365 tenants, and we've had this exact conversation with customers who ask which AI we're using under the hood. The honest answer is that it depends on the task. Gemini for planning. Claude for complex reasoning and agentic work. Local models for anything touching sensitive tenant data where we don't want tokens leaving the customer's environment. A different model for embeddings.
We made that decision deliberately, and not just because model-agnostic architecture is more work. Tying your architecture to a single model vendor is the same mistake organizations made when they tied their infrastructure to a single database vendor in the 1990s. The migration cost doesn't show up until it's too late.
What actually compounds.
Two years of building on top of shifting model capabilities has taught me what actually holds value over time. The orchestration layer compounds. The evaluation gates compound. The memory architecture compounds. The prompt engineering patterns, the retry logic, the task decomposition strategies, the evals that tell you when a model has completed the intent rather than just plausibly completing it: all of that compounds.
The model gets replaced. Sometimes by a better version of itself, sometimes by a competitor, sometimes by a local fine-tune that cost three days of training time and now outperforms the frontier on your specific domain.
I run a dark factory model for Govern 365 development. Agents work across git worktrees, executing on an intent queue, with tiered eval gates that catch regressions before they hit the merge queue. The model powering the builder agent has changed three times in the last four months. The eval architecture has not. The intent format has not. The parity checks have not. That's what compounds, and that's the actual moat.
The bet most people are making.
Most product teams are building a thin wrapper around a model API, betting that the model they chose stays best, shipping features that depend on specific model behaviors that will change in the next release. Some of them will get lucky, but luck is what it is.
The teams that will own this space in five years are building the harness: the evaluation layer, the memory plane, the orchestration that survives a model swap, the judgment infrastructure. Claude today, something none of us have heard of next quarter. Bet on the workflow.
What regulated industries actually need from agents.
On the difference between impressive demos and production-grade systems. And why most of what's being sold right now is the former.
I spent three days at a conference in late fall watching AI agent demos. Every one of them was polished. Most of them worked exactly as shown. A few of them were genuinely impressive. None of them would have survived five minutes with a compliance officer from a pharmaceutical company, a financial services regulator, or a healthcare system's legal team.
That gap is the opportunity, and it's also the problem nobody in the AI industry wants to talk about.
What the demos skip.
The demo shows the agent completing the task. It reads the document, extracts the key fields, routes the request, sends the notification, closes the loop. It looks like magic.
Here's what the demo doesn't show: who authorized the agent to read that document, under what permission model, and whether that authorization was logged. If the agent takes an action on behalf of a user, where is the audit trail? If the action crosses a compliance boundary, who is accountable: the user, the agent, or the vendor? When the model hallucinates a field value and it propagates into a downstream system, what's the remediation path?
These aren't pedantic questions. They determine whether an enterprise will actually deploy something. I sit in rooms with procurement teams at life sciences companies, financial services firms, and healthcare organizations. These questions come up in the first thirty minutes, every time. Most AI agent vendors have no answer, or they have an answer that amounts to “we're working on it.”
The delegation chain is not optional.
In Govern 365, every agent action operates under a delegated human principal, with no exceptions. This isn't a philosophical position; it's an architecture requirement. When an agent performs a governance action (recertifying an access policy, flagging a permission anomaly, triggering a remediation workflow), that action is traceable to a human who authorized it. The chain is always: human delegates to agent, agent acts on behalf of human, action is logged with the delegation context.
Until agents make purchase decisions, a human is accountable for everything they do.
This matters enormously in regulated industries. HIPAA doesn't have a carve-out for AI agents. SOX doesn't either. When an agent touches financial records, someone's name is on that action in the audit log, and it had better be a real person who actually authorized what happened. The industry keeps trying to build around this constraint instead of through it: agents that act autonomously without delegation context, systems where accountability is diffuse, workflows where “the AI did it” is treated as an acceptable answer. Organizations that deploy these systems into regulated environments are going to find out the hard way that it isn't.
What production actually looks like.
The eCTD pipeline we built for Govern 365 is a good example of getting this right. Electronic Common Technical Documents for pharmaceutical submissions: dense, structured, regulated, with cross-document linking requirements and chain-of-custody implications.
The import pipeline runs on Temporal, not because Temporal is fashionable, but because we need durable workflow execution with explicit state at every step, human escalation paths when the pipeline hits ambiguity, and a complete execution history we can hand to an auditor. Every document gets a verified hash. Every extraction step is logged. The Gideon semantic query layer sits on top of it, but it's read-only; the pipeline itself doesn't act on what it finds, it surfaces it to a human reviewer.
It's a boring architecture, not the kind of thing that makes a great demo. But it's the kind of thing a compliance officer will sign off on.
The gap is the opportunity.
The teams building production-grade agent systems for regulated industries are not the ones making the loudest noise. They're heads-down, working through the unsexy parts: audit logging, delegation models, error handling, rollback paths, human escalation design. This is where the real enterprise value gets created, not in the demos but in the infrastructure that makes the demos deployable.
I've been building enterprise software for 25 years and the pattern is always the same. The flashy thing gets the attention. The boring infrastructure gets the contracts.
Memory is not a chat log.
On why vector search is not a knowledge graph, and why it matters more than most people think.
There's a subtle lie embedded in most AI memory implementations, and it's so widely accepted that people have stopped questioning it. The lie is this: if you can retrieve relevant text from past conversations, you have memory. You have search, which is a different thing.
What retrieval actually gives you.
Vector search is extraordinarily useful, and I'm not dismissing it. The ability to embed a corpus of text, index it semantically, and retrieve relevant chunks at query time is a genuine capability that didn't exist at this fidelity two years ago.
But retrieval gives you similarity: documents ranked by their distance from a query vector. It doesn't give you understanding. It doesn't give you the relationship between decisions. It can't tell you that this preference was superseded by that decision, which was made in light of a constraint that no longer applies because that project ended.
I built a system called LittleGuy that I've been running as my personal knowledge infrastructure for over a year. It uses a Neo4j graph for structured relationships and pgvector for semantic search, and learning when to use which has taught me more about what AI memory actually requires than any paper I've read on the subject.
The temporal dimension is everything.
Here's a concrete example. In January, I decided to build the Govern 365 agent auth model around a single-mode OBO (on-behalf-of) architecture: all agents delegated by a human principal, no independent agent identities.
That's a decision node in my graph. It has a timestamp. It has the context that led to it, a relationship to the project it applies to, the people involved in making it, and the subsequent decisions that were made because of it.
Three months later, when I'm designing a new feature that touches agent permissions, my system doesn't just surface “relevant text about agent auth.” It surfaces the specific decision, its rationale, its current validity, and whether anything downstream has been affected by subsequent choices.
A chat log tells you what was said. A knowledge graph tells you what you know and how you came to know it.
That distinction sounds academic until you're running a multi-agent system on a real codebase with real production constraints. Then it's the difference between an agent that operates with genuine context and one that hallucinates its way through a task with complete confidence.
Why this matters for enterprise AI.
Most enterprise AI deployments are betting on retrieval: vector databases full of documents, policies, emails, tickets, on the assumption that if you retrieve the right text at query time, the model will figure out what to do with it. For question-answering over a document corpus, that's probably fine.
For anything requiring operational continuity (an agent that works on a project across multiple sessions, a system accumulating institutional knowledge over time, an AI that needs to understand how decisions relate to each other), retrieval is a different category of capability than memory, and treating it as a substitute eventually shows.
What this means in practice.
I'm not saying every team needs to build what I built. LittleGuy is my own infrastructure, built for my own use case, reflecting choices made from years of working with graph databases and knowledge systems.
But the teams that will build durable AI products are the ones that treat knowledge as a first-class concern from day one, not as a retrieval problem but as a modeling problem. What are the entities in your domain? What are the relationships between them? How do those relationships change over time? When a decision gets made, what does it invalidate?
These questions have been the domain of knowledge engineers and ontologists for decades. They got marginalized by the big data era, where storing everything was cheap and retrieval felt like it solved the problem. The agents era is bringing them back.
The dark factory.
On building software without a consistent team. What it actually looks like, what breaks, and what I wouldn't go back from.
Let me describe a morning of work.
I start with a queue of intents. Each intent is a structured description of something that needs to happen in the Govern 365 codebase: a bug fix, a feature, a refactor, a test gap. The format is specific (intent ID, scope, acceptance criteria, the worktree it should run in, the eval tier that gates it). I review the queue, add two new items, reprioritize three, and mark one as blocked pending a design decision I haven't made yet. That takes maybe fifteen minutes, and then I let it run.
By the time I'm done with my first coffee, three intents have completed, two are in progress, and one has been discarded because it introduced a regression that the Tier 1 gate caught. I review the diffs for the completed ones. Two of them are clean. One is technically correct but didn't touch the UI file that should have changed, so I flag it as plumbing-only and kick it back.
This is how Govern 365 Next gets built, not with a team of six developers in a sprint, but with me, a queue, and agents running in git worktrees on my Mac.
What “conductor not operator” actually means.
I use this phrase a lot and I think it confuses people because it sounds like a motivational poster. So let me be concrete.
Operating means watching the agent, correcting it, steering it moment to moment. Your attention is the bottleneck. Conducting means setting the intent, defining the evaluation criteria, reviewing the output, and making decisions that the agent can't make: architectural choices, user experience calls, product direction. The agent runs without you watching it; you look at what it produced.
The shift requires something most people skip: heavy investment in the evaluation layer before the autonomy pays off. If your evals are weak, you can't trust what the agents produce and you end up operating again, reviewing every line and staying in the loop because you have to rather than because you chose to.
The factory doesn't run itself. It runs the parts you've already specified well enough to evaluate.
I spent about six weeks building eval infrastructure before the dark factory model started paying dividends: typecheck gates, lint regression checks, parity verification against a known-good snapshot, screenshot evidence for any intent that touched the UI. Weeks of work that didn't ship anything visible. It felt like overhead. It was the precondition for everything else.
What breaks.
I want to be honest about the failure modes because most people who write about autonomous coding agents don't.
Agents are good at implementing something that was specified clearly. They're much less good at noticing the specification was wrong, or that implementing it will break something adjacent, or that there's a simpler approach that changes the shape of the problem. I've had agents build ten files and 1,200 lines of code for a feature that should have been three files and maybe 200 lines, because I wrote an intent that implied a more complex architecture than necessary. The code was correct, the evals passed, and it was the wrong thing. That's my failure, not the agent's, but it's a failure mode you have to account for. The intent queue is not a replacement for system design; it's downstream of it.
The other thing that breaks: agents don't escalate naturally. When a human developer hits an ambiguity or a design decision they can't resolve, they ask someone. Agents tend to make a choice and keep going, or get stuck in a loop, or produce something that technically satisfies the spec while missing the point. Building explicit escalation paths (cases where the agent surfaces a decision rather than making it) is harder than it sounds, and most frameworks don't have good patterns for it yet.
What I wouldn't go back from.
The accumulation effect. Every completed intent builds on the last one. The codebase gets better in ways I wouldn't have reached if I were doing all the implementation myself, because the agents have no ego about refactors and no resistance to test coverage.
The focus shift. I'm spending my time on product decisions, architecture, customer conversations, and the things that actually require my judgment, not on syntax or boilerplate or the third retry of a Prisma migration.
And the manufacturing analogy is real. Thinking in terms of intent batches and eval gates rather than tickets and standups is a different relationship to the work, one that suits the way I actually think. I have 35 production customers, a codebase that needed a full architectural reinvention, and no consistent engineering team. The dark factory model is not a luxury. It's a survival strategy, and so far it's working.
Stop calling chatbots agents.
On the semantic inflation that's slowing this industry down, what a real agent requires, and what we lose by confusing the two.
Words matter more in early markets than they do anywhere else. When a category is forming, the language people use to describe things shapes what gets built, what gets funded, and what customers expect. Bad language creates bad expectations. Bad expectations create bad deployments. And enough bad deployments turn a genuine technological shift into a hype cycle cautionary tale. We are in the process of doing this to “agent.”
What's being called an agent.
I've evaluated close to a dozen products in the last six months that are marketed as AI agents. Most of them are one of three things.
A chatbot with tool access, where the user types a message, the model calls a function, and the result comes back in the chat. Genuinely useful, and not an agent. A workflow automation with an AI step somewhere in the middle, where documents go in, a model extracts or classifies, and something happens downstream. Also useful, and also not an agent. Or a very expensive if/then tree dressed up with natural language interfaces, where the model parses intent and routes to pre-built flows. Useful in the right context, and not an agent.
None of these require the things that make agents interesting: persistent goal state, multi-step planning, the ability to decompose a complex intent and execute on it across time without moment-to-moment human direction, and the judgment to recognize when something has changed that requires escalation.
A chatbot that books a meeting is not an agent. An agent is what books the meeting, realizes the preferred time conflicts with something it found in a document it read three sessions ago, surfaces the conflict with a recommendation, and waits for your call.
What agents actually require.
Goal persistence is the foundational requirement. The agent has to maintain a representation of what it's trying to accomplish across the duration of the work, not just across a single context window. This is a memory architecture problem more than a model problem.
Decomposition and planning are required in a way that “chain-of-thought reasoning inside a single prompt” doesn't capture. The agent has to take a goal, break it into steps, sequence those steps correctly, identify dependencies, and update the plan when something changes. The model doesn't do this automatically; you have to architect for it.
And honest failure handling, which is the one that separates real implementations from demos. Agents fail. They hit ambiguity and encounter states the original spec didn't account for. A real agent has a graceful path for this: escalation, checkpointing, rollback. Most things being sold as agents have none of this; they produce confident-sounding output and move on.
Why this matters right now.
Every enterprise I work with is in one of two modes: evaluating AI agents cautiously because they don't want to get burned, or having already deployed something marketed as an agent that didn't hold up in production and now sitting skeptical on the sidelines. The skeptical ones are the harder conversation, because they didn't buy something fake. They bought something real, but it was a chatbot with tool access being sold as an autonomous workflow partner. The failure wasn't the technology; it was the expectation that was set.
I spend a meaningful portion of my time re-calibrating these conversations: here's what an agent is, here's what you actually deployed, here's the gap, here's what it would take to close it. The gap is always in the same places: memory architecture, eval infrastructure, escalation design, and the acknowledgment that human oversight isn't a limitation to be engineered away but a feature that makes the system trustworthy enough to actually use.
A standard worth holding.
The best AI work I've seen in the last year isn't the most autonomous; it's the most precisely defined. The clearest intent specs, the most rigorous eval criteria, the most thoughtful human-agent boundary. That's a construction quality argument, not a conservative one. The autonomous capability comes from the precision of the specification, not from removing human judgment from the loop.
Build the thing that actually works. Define it honestly. Earn the word.