You can't evaluate what you don't understand

22 Mar, 2026

There's a popular thesis making the rounds: software engineering as a skill is becoming obsolete. The argument goes that with AI, the interface to building software is now English rather than a programming language — and since everyone already speaks English, the 5-10 years it took to master programming no longer matter.

I use AI every day. It has genuinely changed how I work, and I believe it will continue getting dramatically better. I'm not here to argue that AI isn't as good as people say — it might be even better. My point is different: it is very hard to know how good it actually is unless you already understand the domain it's operating in. And that has consequences for how we talk about what it replaces.

The Pattern We Keep Forgetting

Every major leap in abstraction has come with the same declaration: "now anyone can program."

Punch cards gave way to assembly. Assembly gave way to C. C gave way to higher-level languages. Each transition commoditized the previously-hard thing. Writing machine code used to be the skill. Then memory management. Then manual string handling. Each layer made the last generation's hard problem disappear, and each time, the actual hard problems simply moved up the stack.

AI is the next layer in this sequence — and it's a genuinely impressive one. It makes previously-hard things trivial, faster and more dramatically than any prior shift. Building a CRUD app with AI is easy. But building a CRUD app was already easy — Rails solved it in 2005, WordPress even earlier. The impressive demos tend to showcase problems that were already solved.

The problems that have always been hard — scalability, fault tolerance, security, distributed systems, architecture that holds up when requirements change — have survived every previous wave of abstraction. Maybe this wave is different. AI might genuinely be the one that breaks the pattern. But here's where it gets interesting: how would you know?

The Evaluation Gap

If I read about a potential medical breakthrough, it might sound perfectly reasonable to me. The methodology might seem sound, the results promising, the implications exciting. But I'm not a doctor. I have no framework for distinguishing a genuine advance from something that falls apart under scrutiny. To me, both look identical.

This is the core problem, and it has nothing to do with whether AI is good or bad at its job: the ability to generate a solution and the ability to evaluate a solution are entirely different skills. AI dramatically lowers the cost of generation. It does not lower the cost of evaluation.

When a non-engineer uses AI to build software, the output looks great. The code runs. The app works — at least in the demo. But they can't see what's missing. They can't see the SQL injection vulnerability, the architecture that will collapse under concurrent load, the error handling that's absent, the implicit assumptions that will break when a single dependency updates.

The hardest bugs have never been wrong code. They're absent code — missing edge cases, missing security considerations, missing failure modes. You need to know what should be there to notice that it isn't.

This doesn't mean the evaluation gap is binary. A junior developer with two years of experience can evaluate AI-generated code meaningfully — not as well as a principal engineer, but far better than someone with no engineering background at all. The gap is a spectrum, and partial competence is real and valuable. But the spectrum matters: the further you are from domain expertise, the less your evaluation is worth, and — critically — the less able you are to know how far you are.

Why "Looks Good to Me" Means Nothing

Dan Sperber and colleagues developed the concept of epistemic vigilance — the cognitive mechanisms humans have evolved for detecting unreliable information in communication. A critical feature of these mechanisms: they work by looking for reasons to doubt, not reasons to trust. In the absence of doubt-triggers, information is provisionally accepted. Not through active trust, but through the failure of doubt to activate.

This matters enormously for AI. The signals that normally trigger epistemic vigilance in human communication — hesitation, inconsistency, disfluency, visible uncertainty — are systematically absent from AI output. Large language models are optimized to produce fluent, confident, coherent text. They are, in effect, engineered to pass beneath the radar of our epistemic vigilance systems.

So when a non-expert reads AI-generated output and thinks "this looks good," that is not a positive epistemic judgment. It is the absence of triggered doubt — which, in this context, carries no information at all. The same cognitive response would occur whether the output were brilliant or nonsensical, as long as it were fluent.

This is more precise than the common complaint that "AI sounds confident." The issue isn't that AI is deceptive — it isn't trying to mislead anyone. The issue is that the cognitive mechanisms we evolved to detect unreliable communication are calibrated for human speech, and AI output doesn't trigger them. Fluency is not a signal of correctness, but our brains treat it as one.

The Goldman Recursion

Here's where the argument becomes difficult to dismiss.

Alvin Goldman, in his work on social epistemology, posed what he called the novice/expert problem: when a layperson faces competing expert claims, how can they rationally decide whom to trust? He proposed five criteria — evaluating the quality of arguments, checking expert consensus, examining track records, looking for biases, consulting meta-experts.

Subsequent philosophers demonstrated that Goldman's criteria are themselves expertise-dependent. You need genuine expertise to evaluate whether an argument is actually cogent or merely plausible-sounding. You need to know the field to know which "other experts" are credible. You need meta-expertise to evaluate meta-experts. The criteria, as one critic put it, "can't do the work Goldman wants from them" because deploying them requires the very expertise they're meant to substitute for.

This is the recursive structure at the heart of the problem: any criterion for evaluating an epistemic source requires the epistemic capacity that the source is supposed to replace.

Applied to AI: the person who doesn't understand software can't assess whether AI has "solved" software. The person who doesn't understand law can't assess whether AI has "solved" law. The engineer who says "AI will replace lawyers" and the lawyer who says "AI will replace engineers" are making the exact same error — evaluating output in a domain where they can't distinguish good from bad.

An AI-generated contract might look perfectly fine to an engineer and completely fall apart in court. AI-generated infrastructure might look perfectly fine to a lawyer and completely fall apart under load. The claim "AI makes expertise in X obsolete" can only be reliably evaluated by someone with expertise in X, but the people making the claim almost never have that expertise.

To be clear: AI might already be past that threshold in some domains, or it might get there soon. The problem isn't capability — it's that we are structurally incapable of recognizing the moment it happens. Non-experts will declare victory too early, and experts will be dismissed as protecting their turf.

Epistemic Dependence and the Missing Foundation

John Hardwig argued in 1985 that modern knowledge is irreducibly social. No individual can personally verify more than a tiny fraction of what they believe. His striking conclusion: "rationality sometimes consists in refusing to think for oneself." When someone is genuinely more expert than you in their domain, deferring to them is not intellectual weakness — it is the rational response.

But Hardwig grounded this dependence in something specific: the moral character of the expert. Trust in expertise rests not just on competence but on professional norms, reputational stakes, accountability, and honesty. The epistemic relationship is ultimately an ethical one — you trust the expert partly because they have something to lose by being wrong, and something to uphold by being right.

AI systems have none of this. There are no professional norms, no reputational stakes, no accountability in the human sense. The ethical foundation that Hardwig identified as essential to rational epistemic dependence is absent. This doesn't mean AI output is unreliable — it might be excellent. But it means the basis on which we normally justify trusting an epistemic source doesn't apply in the usual way, and we haven't built adequate substitutes.

Paul Humphreys' concept of epistemic opacity compounds the problem. A process is epistemically opaque when the relationship between its inputs and outputs cannot be surveilled by a human agent. Deep learning systems are epistemically opaque in a strong sense — even experts cannot fully trace why a particular output was generated. For non-experts, this creates a double opacity: they cannot evaluate the output on domain grounds, and they cannot evaluate the process that produced it. They are trusting a source they can't verify, through a mechanism they can't inspect.

Performative Fluency Is Not Expertise

Harry Collins and Robert Evans developed a taxonomy of expertise that clarifies what AI actually gives its users. They distinguish between interactional expertise — fluency in the language and concepts of a field, the ability to talk like an expert — and contributory expertise — the ability to actually do the work, to contribute to the field.

Collins showed that genuine interactional expertise requires deep immersion in the expert community. What AI provides to non-expert users is something I'd call performative fluency: the ability to produce expert-sounding language without the understanding that would allow you to detect when the language is wrong. This is distinct from Collins' interactional expertise because it lacks the tacit understanding that comes from genuine immersion — it is surface-level pattern reproduction, not comprehension.

This is arguably more dangerous than simple ignorance. An ignorant person knows they need help. A person with performative fluency believes they have the help they need. They can speak confidently about distributed systems or contract law or pharmacology, using precisely the right terminology, while having no way to detect when the AI's confident, fluent output has led them astray.

The Reliabilist Objection

The most serious philosophical challenge to this entire argument comes from reliabilism — and it's one I need to address honestly, because it threatens to dissolve the problem.

Goldman himself, in his other and better-known work, is a reliabilist. The core idea: if a process reliably produces true beliefs, then beliefs formed through that process are justified, regardless of whether the agent can inspect the process. You are justified in trusting a calculator without understanding its circuits, a compiler without reading its source, GPS without understanding satellite triangulation. What matters is the track record of the process, not your ability to evaluate individual outputs.

A reliabilist would look at the evaluation gap and say: so what? If AI code generation produces correct, secure, maintainable code 97% of the time, then deferring to it is epistemically justified even without understanding it. The non-expert's inability to evaluate the output is irrelevant if the process is reliable. We don't demand that people understand x86 assembly before trusting a compiler, or elliptic curve cryptography before trusting TLS. We trust these systems through institutional scaffolding, empirical track records, and adversarial testing — not through personal inspection.

This is a genuinely strong objection, and I think it's partially right. We already live in a world of epistemic dependence on opaque systems, and it mostly works. The question is whether AI — in its current form — has earned the kind of trust we extend to compilers and cryptographic libraries.

I think there are important disanalogies. A compiler has a fixed, formally verifiable specification. Its failure modes are well-characterized. It has been adversarially tested over decades by millions of users. Its behavior is deterministic — the same input produces the same output. When it fails, the failure is typically loud and immediate.

AI code generation has none of these properties. Its behavior is stochastic. Its failure modes are not well-characterized — indeed, the most concerning failures are silent ones, where the output looks correct but isn't. It has not been adversarially tested at scale across the domains where it's being deployed. And critically, it fails in a way that is optimized to be undetectable: the same fluency that makes good output convincing makes bad output equally convincing.

The reliabilist framework is correct in principle: if AI reaches demonstrated, domain-specific reliability comparable to a compiler, then trusting it without understanding it becomes justified. But establishing that reliability requires the expertise this essay argues is being bypassed. You cannot determine that an AI system reliably produces correct legal contracts without legal expertise. You cannot determine that it reliably produces secure infrastructure without security expertise. The reliabilist answer to the evaluation gap presupposes that the evaluation has already been done — by someone with the expertise to do it.

So reliabilism doesn't dissolve the problem. It reformulates it: the evaluation gap is not a permanent barrier to trusting AI, but it is a barrier to knowing when that trust is warranted. We can get there. We aren't there yet for most domains. And the gap means we may not notice the difference.

The Pragmatist Challenge

There's a related objection from the pragmatist tradition worth taking seriously. A Deweyan pragmatist would ask: if the AI-generated code works, passes tests, survives production, handles users — what additional epistemic work is being demanded, and why? Pragmatism locates justification in consequences, not in the agent's capacity for inspection. If it works, it works.

This is compelling in practice and incomplete in theory. "It works" is itself an evaluation that depends on what you're measuring and over what time horizon. Code that works today may embed security vulnerabilities that don't manifest until an attacker finds them. Architecture that works at current scale may collapse at 10x. A contract that "works" — in the sense that both parties sign it — may fail catastrophically when disputed in court.

The pragmatist challenge is strongest for domains where feedback is fast, failure is cheap, and consequences are reversible. For those domains, the evaluation gap may genuinely not matter much — ship it, see what happens, fix what breaks. But for domains where failure is slow to manifest, expensive, or irreversible, "it works so far" is weak evidence of "it works." And distinguishing which domain you're in is, again, a judgment that requires expertise.

A Confession

I should acknowledge a tension in this essay that a careful reader will have already noticed: I am a software engineer writing about epistemology. By the logic of my own argument, you should ask whether I can reliably evaluate whether Goldman's recursion applies here, whether Sperber's framework extends correctly, whether Collins' taxonomy bears the weight I've placed on it.

I think this is fair. I've tried to engage with these ideas honestly rather than superficially, but I am not an epistemologist, and the evaluation gap applies to me too. I could be misapplying these frameworks in ways I can't detect — the same way a non-engineer using AI to build software might produce something that looks right to them but falls apart under scrutiny.

I don't think this undermines the argument. If anything, it demonstrates it. The evaluation gap is not a theoretical abstraction — it's something I am living, right now, in writing this essay. The best I can do is be transparent about it and invite correction from people who know this literature better than I do. That's the honest position, and it's the one I'd want anyone relying on AI in a domain they don't fully understand to adopt.

What Would "Replaced" Actually Mean?

If we want to take the question seriously, we need to define what it means for expertise to be replaced. I think there are four markers, and people tend to conflate them:

Output equivalence — the AI produces results indistinguishable from an expert's. This is what most people point to, and it's the weakest marker. "Indistinguishable to whom?" is the question that unravels it. Output that's indistinguishable to a non-expert tells you nothing.

Failure equivalence — not just "does it produce good results" but "does it fail in the same ways, at the same rate, with the same recovery characteristics?" Experts don't just build things that work. They fail gracefully and predictably because they've seen things go wrong before. If AI systems fail in novel, unpredictable ways that require expert intervention to diagnose, the expertise hasn't been replaced — it's been hidden behind a layer that occasionally collapses.

Full-loop autonomy — can the system independently identify problems, diagnose them, decide on an approach, implement it, verify it, and know when to stop? Not on a benchmark, but in the real world, where requirements are wrong, constraints are unstated, and "done" is as much a political question as a technical one.

Compounding judgment — does the AI's work hold up over time? An expert doesn't just solve today's problem. They make decisions that remain sound six months later because they understood where the system was heading. Architecture that optimizes for now but creates compounding debt is precisely the kind of thing that looks great on day one and falls apart on day ninety.

Expertise is genuinely replaced when all four hold simultaneously — as evaluated by existing experts. It's possible AI gets there across the board. But most demonstrations today show only the first marker, judged by non-experts. That's worth being honest about: not because AI is bad, but because we owe ourselves better evidence before drawing conclusions.

So What?

I believe AI will continue improving significantly — probably in ways we can't anticipate. We only ever see the worst models: today's, not the ones ten years from now. It's entirely possible that the pattern of "hard problems move up the stack" eventually breaks. I wouldn't bet against it.

But my argument doesn't depend on where AI capability ends up. It's about how we reason about it right now. The epistemological tradition — from Goldman's recursion through Sperber's vigilance, Hardwig's dependence, and Collins' taxonomy — converges on a consistent finding: evaluating knowledge claims in domains you don't understand is not merely difficult but structurally unreliable, and the subjective sense that you're evaluating well is precisely the signal you cannot trust.

The reliabilist is right that this problem is solvable: build the track records, do the adversarial testing, establish domain-specific reliability benchmarks. The pragmatist is right that for low-stakes, fast-feedback domains, the gap may not matter much in practice. But for the domains where it does matter — and recognizing which those are is itself an expert judgment — we should be skeptical of confident claims about AI's sufficiency from people who aren't experts in the relevant field. Not because they're wrong. They might be right. But they have no reliable way of knowing, and neither do we.