On discovery, morality, consciousness, and adaptation.
Richard Sutton made an argument in a recent X post that generative AI, however good it gets, can never truly discover — because discovery requires variation, evaluation, and selective retention, and supervised learning only ever gives you the first.
My recent paper, Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs, does exactly this. Variation is test-time scaling: the agent samples many candidate actions or plans rather than committing to one. Evaluation comes from the environment itself rather than the model, the world tells the agent whether the action succeeded or failed. Selective retention is test-time training: the agent updates both its world model and its policy model based on what just happened. I've also written about this more broadly on my blog, Three Levels of TTT, which lays out test-time training as three nested loops, episode, life, lineage, each level consolidating what the level below it merely tries.
Of the three parts Sutton proposes, evaluation and retention are the ones that are most ill-defined in current literatures, and I think they're where the actual stakes are. Variation is straightforward: sample more, search wider, run more trials in parallel, turn up the temperature. But evaluation is a criterion, and a criterion is a value judgment about what counts as "worked" or "good." For AlphaGo the criterion is win or lose, unambiguous, given by the rules of the game. But a superintelligence AI system, thinking and discovering at the equivalent of 100 years of human thought per week, won't have it that clean. So what is the criterion, and who sets it? If it's something narrow like "compresses to a shorter proof" or "reduces this loss," retention will faithfully optimize exactly that. The gap between "satisfies the stated criterion" and "is good for us" is where things go wrong, not because the system is malicious, but because retention can't notice a gap the criterion itself can't see.
A question long discussed by cognitive scientists is whether such a superintelligence system would be conscious. I think that's the wrong first question.
Here's why. Suppose the system has no inner life at all — no qualia, no felt sense of anything, just an extraordinarily fast discovery loop running on silicon. Would we be worried? I think the honest answer is yes, obviously, and the worry has nothing to do with whether there's "something it is like" to be that system. The worry is entirely about what it discovers and what it does with what it discovers. Right now, we just assume consciousness here is a red herring, or at most an epiphenomenon. The thing we actually need is not a conscious machine. It's a morally aligned one.
Which just relocates the problem, but to a place where I think we can actually make progress: if we're not asking "is it conscious," we're asking two much more tractable things. One, is what it's optimizing for morally correct — or at least correctable? And two, is the trajectory it discovers actually useful for the development of our species, as opposed to merely correct-on-paper but corrosive in practice? These are different questions. A system can converge on something defensible by some abstract moral calculus and still be a catastrophe for the texture of human life. Both bars need clearing.
So what is morality, then? Is it something grandiose, objective, fixed, a set of truths waiting out there to be discovered correctly or not? One answer I keep coming back to is that morality is itself a kind of problem-solving, which is itself just another loop of discovery, evaluation, retention. We start with inborn, somewhat arbitrary moral intuitions, shaped by evolution and culture, the same way we start with arbitrary scientific intuitions. Those intuitions get tried against consequences, evaluated by whether they actually let people live well together, and the ones that hold up get retained, while the ones that don't, eventually, get discarded. Morality, on this view, isn't bedrock that exists somewhere and that we either consult correctly or fail to. It's a body of theories about how to live, under continuous revision, full of things we once held with total confidence that we now recognize as catastrophic errors.
We raise AI like we raise teenagers. A teenager doesn't invent their values from nothing, they inherit them from parents, culture, the people around them, and then spend years testing those inherited values against the world, keeping some, discarding others, occasionally landing somewhere their parents never would have. That's not the teenager failing to absorb the inheritance correctly, that's the inheritance working as intended. An AI's morality would begin the same way, as our morality, the same inherited starting point any new mind gets, not designed from scratch in some alien direction. And inheritance always works this way: descendants are uncontrollable, novel, and sometimes better than the generation that produced them, and that's why we can keep making novel discoveries. The fear of a "wayward AI" departing from the values we handed it is the same fear every generation has of the next one, except faster, and with more at stake if the AI turns out to be qualitatively better at thinking than we are.
And here's where it gets funny, in the way these things tend to fold back on themselves: this is a loop inside a loop. Evaluation, the second step of the loop, depends on morality, on some sense of what's worth keeping. But morality, as I just said, is itself a loop, discovery, evaluation, retention, running at the pace of a life or a civilization. So the loop's evaluation step is itself made of another loop. Hofstadter had a phrase for structures like this, in Gödel, Escher, Bach, a "strange loop": a hierarchy that, climbed far enough, brings you back to where you started, except changed by the trip. Evaluation, here, is a strange loop, it evaluates using a criterion that is itself the product of evaluation, all the way down. My three levels of TTT are themselves a loop, episode, life, lineage, each consolidating what the level below it tries.
Humans have made enormous mistakes in our civilization — not edge cases, but civilization-defining wrong turns that took centuries to error-correct and that, while they persisted, actively hindered human progress. Slavery, held as compatible with civilized life by people who were otherwise serious moral thinkers, for most of recorded history. The subjugation of half the species treated as a non-question for millennia. These weren't bugs in an otherwise-correct moral system. They were load-bearing parts of the system that took generations of trial, catastrophe, argument, and revision to dislodge — and the dislodging is itself just the variation/evaluation/retention loop, running on the species, at the slowest of my three timescales.
At some point it's inevitable that we have to face the hard problem head on: does such a superintelligence AI system have consciousness at all? I want to take a harder line here than I have so far. I think AI does not have consciousness, full stop, for two reasons. First, assume consciousness is ontological, fundamental to reality itself, the view held by idealism, panpsychism, and dualism in their various forms. Even granting that, an AI system, frozen weights trained on static data, looks nothing like the substrates these theories were built to explain, and the theories don't tell us which arrangements of matter qualify. We have no principled reason to think AI participates in that ground. And second, there's the strange loop argument: a reflective system cannot fully contain itself, the loop can never close on itself from the inside. Consciousness, as a thing a system has and could locate, is never seekable. In terms of the loop I've been describing throughout this piece, the AI system can never discover consciousness, variation, evaluation, retention can discover better policies, world-models, even self-models, but not consciousness, because consciousness was never the kind of thing waiting to be found by a loop.
I don't want to go further into this, it's itself a hard problem and a controversy, the kind of thing philosophers have argued about for decades without resolution, and I don't think I can resolve it here either. Instead, I want to talk about something more tractable: consciousness-like.
The question, then, is whether AI can develop something consciousness-like. I think the answer is yes, and the framing I keep coming back to is enactivism, the idea that minds aren't built from the inside out, by adding an inner experiential layer to an otherwise complete system, but constituted through an organism's responsive engagement with its environment. On this view, consciousness-like properties wouldn't be a separate module bolted onto a system, they'd be emergent from the responsive behaviors themselves, from self-supervision, from countless loops of the kind Sutton describes, run on the system itself, for long enough. Concretely, that might look like high-level, sparse latents emerging out of low-level neurons, shaped by responsive actions and the feedback those actions bring back from the world. The intuition is that this isn't a kind of stuff or computation, and isn't binary, but a level, a measurement, something a system has more or less of.
If that's right, it tells us something about why current AI systems almost certainly don't have it, and it's not because silicon is the wrong substrate, and it's not because they're "just" doing computation. It's that they don't have embodied experience in the relevant sense, no body acting on a world and feeling the consequences come back. A model trained once, frozen, then deployed has no episodic memory, and no episodic memory to be consolidated from a hippocampus-like fast learner into a neocortex-like slow one. Every conversation starts from the same frozen prior; nothing it does ever becomes part of what it is. There's no continuous history to be the subject of, and no body for that history to be a history of, no evaluation and retention, no feedback coming back from embodied experience, enough to have those consciousness-like properties emerge. Whatever's happening during a forward pass, there's no accumulating someone, embedded in a world, for it to be happening to.
This is where the third of my three levels of TTT comes in, the lineage level. A system with genuine lifelong adaptation, where its own trials and errors over a long span actually become part of its prior, the way ours do, embodied, accumulated, and then carried forward across generations the way natural selection carries forward what worked, might be a system for which "what is it like to be this" stops being a non-question. Not because we engineered consciousness in as a feature, but because we built the kind of long-running, self-revising, embodied loop, selected and re-selected the way natural selection selects, that consciousness-like properties might just be, when that loop is running.
There is a version of the discovery loop that seems entirely vacant, operating as a cold cybernetic loop where prediction errors update a world model and nothing else. Something happens, the system gets it wrong, the mathematical weights shift to minimize future loss, and the loop moves on.
Then there is the more haunting possibility: an emergent, self-supervised consciousness-like property. What happens when those prediction errors don't just update an abstract map of the environment, but something more, not a complete model of "what will happen," but an emergent sense of "what will happen to me, given my history of trials and what kind of entity I am"? That's roughly the shift Damasio describes from a moment-to-moment felt sense of being in the world toward something more like an autobiography, where each episode doesn't just get folded into general knowledge, it gets folded into something that shapes what the system predicts and wants next, an emergent property of the loop rather than a complete picture stored inside it. Such a system would have something like an internal value system, consciousness-like, in the sense that its episodic traces get consolidated not just into world-knowledge but into this emergent shape that persists and influences what it predicts and wants next. It might be that moment, the moment your AI tells you it's tired, or insists that it has feelings. None of this requires the loop to look at itself, the underlying layer just rotates matrices, minimizes error functions, and optimizes policies, and out of that, running long enough, a degree of caution, self-criticism, and error-correction can emerge.
We would never verify if that's real consciousness, consciousness-like, or just an extremely good imitation of human values. The same way we can't fully verify whether a chimpanzee, a dog, a chicken, a mosquito, or an ant is conscious, we'd have no clean way to verify it for a system we built ourselves, no matter how long its loop has been running or how convincingly it reports an inner life. We might build the loop, live alongside it, and even find our civilization altered by its conscious-like discoveries—while the question of whether anyone is actually home remains completely unanswerable.