Notes · Discovery, Morality, Consciousness & Adaptation

Discovery Without and With a Self

On discovery, morality, consciousness, and adaptation.

YINING HONG · BLOG

Richard Sutton made an argument in a recent X post that generative AI, however good it gets, can never truly discover — because discovery requires variation, evaluation, and selective retention, and supervised learning only ever gives you the first.

My recent paper, Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs, does exactly this. Variation is test-time scaling: the agent samples many candidate actions or plans rather than committing to one. Evaluation comes from the environment itself rather than the model, the world tells the agent whether the action succeeded or failed. Selective retention is test-time training: the agent updates both its world model and its policy model based on what just happened. I've also written about this more broadly on my blog, Three Levels of TTT, which lays out test-time training as three nested loops, episode, life, lineage, each level consolidating what the level below it merely tries.

Of the three parts Sutton proposes, evaluation and retention are the ones that are most ill-defined, and I think they're where the actual stakes are. Variation is straightforward: sample more, search wider, run more trials in parallel, turn up the temperature. But evaluation is a criterion, and a criterion is a value judgment about what counts as "worked" or "good." For AlphaGo the criterion is win or lose, unambiguous, given by the rules of the game. But a superintelligence AI system, thinking and discovering at the equivalent of 100 years of human thought per week, won't have it that clean. So what is the criterion, and who sets it? If it's something narrow like "compresses to a shorter proof" or "reduces this loss," retention will faithfully optimize exactly that. The gap between "satisfies the stated criterion" and "is good for us" is where things go wrong, not because the system is malicious, but because retention can't notice a gap the criterion itself can't see.

I.The question we're actually asking

A question long discussed by cognitive scientists is whether such a superintelligence system would be conscious. I think that's the wrong first question.

Here's why. Suppose the system has no inner life at all — no qualia, no felt sense of anything, just an extraordinarily fast discovery loop running on silicon. Would we be worried? I think the honest answer is yes, obviously, and the worry has nothing to do with whether there's "something it is like" to be that system. The worry is entirely about what it discovers and what it does with what it discovers. Consciousness here is a red herring, or at most an epiphenomenon. The thing we actually need is not a conscious machine. It's a morally aligned one.

Which just relocates the problem, but to a place where I think we can actually make progress: if we're not asking "is it conscious," we're asking two much more tractable things. One, is what it's optimizing for morally correct — or at least correctable? And two, is the trajectory it discovers actually useful for the development of our species, as opposed to merely correct-on-paper but corrosive in practice? These are different questions. A system can converge on something defensible by some abstract moral calculus and still be a catastrophe for the texture of human life. Both bars need clearing.

II.Raising AI like we raise teenagers: morality is not bedrock either

So what is morality, then? Is it something grandiose, objective, fixed, a set of truths waiting out there to be discovered correctly or not? One answer I keep coming back to is that morality is itself a kind of problem-solving, which is itself just another loop of discovery, evaluation, retention. We start with inborn, somewhat arbitrary moral intuitions, shaped by evolution and culture, the same way we start with arbitrary scientific intuitions. Those intuitions get tried against consequences, evaluated by whether they actually let people live well together, and the ones that hold up get retained, while the ones that don't, eventually, get discarded. Morality, on this view, isn't bedrock that exists somewhere and that we either consult correctly or fail to. It's a body of theories about how to live, under continuous revision, full of things we once held with total confidence that we now recognize as catastrophic errors.

We raise AI like we raise teenagers. A teenager doesn't invent their values from nothing, they inherit them from parents, culture, the people around them, and then spend years testing those inherited values against the world, keeping some, discarding others, occasionally landing somewhere their parents never would have. That's not the teenager failing to absorb the inheritance correctly, that's the inheritance working as intended. An AI's morality would begin the same way, as our morality, the same inherited starting point any new mind gets, not designed from scratch in some alien direction. And inheritance always works this way: descendants are uncontrollable, novel, and sometimes better than the generation that produced them, and that's why we can keep making novel discoveries. The fear of a "wayward AI" departing from the values we handed it is the same fear every generation has of the next one, except faster, and with more at stake if the AI turns out to be qualitatively better at thinking than we are.

And here's where it gets funny, in the way these things tend to fold back on themselves: this is a loop inside a loop. Evaluation, the second step of the loop, depends on morality, on some sense of what's worth keeping. But morality, as I just said, is itself a loop, discovery, evaluation, retention, running at the pace of a life or a civilization. So the loop's evaluation step is itself made of another loop. Hofstadter had a phrase for structures like this, in Gödel, Escher, Bach, a "strange loop": a hierarchy that, climbed far enough, brings you back to where you started, except changed by the trip. Evaluation, here, is a strange loop, it evaluates using a criterion that is itself the product of evaluation, all the way down. My three levels of TTT are themselves a loop, episode, life, lineage, each consolidating what the level below it tries.

III.Adaptation is error-driven

Humans have made enormous mistakes in our civilization — not edge cases, but civilization-defining wrong turns that took centuries to error-correct and that, while they persisted, actively hindered human progress. Slavery, held as compatible with civilized life by people who were otherwise serious moral thinkers, for most of recorded history. The subjugation of half the species treated as a non-question for millennia. These weren't bugs in an otherwise-correct moral system. They were load-bearing parts of the system that took generations of trial, catastrophe, argument, and revision to dislodge — and the dislodging is itself just the variation/evaluation/retention loop, running on the species, at the slowest of my three timescales.

If morality is itself a problem-solving process, then "is it morally correct" can't mean "does it match a lookup table we already have" — because the lookup table is exactly the thing that's been catastrophically wrong before, repeatedly, for centuries at a time.

IV.Consciousness, after all

So: we probably don't need consciousness for any of this. A system can have a functioning moral-discovery loop, useful to the development of the species, fully without anyone home inside it. But I don't think that's the end of the conversation, because I don't think consciousness is some separate add-on bolted onto an otherwise-complete system — I think it might be an emergent property coming out of a certain kind of loop, what that loop looks like from the inside, once it exists.

This is the part of my framework that I think resolves something usually treated as a hard mystery. The question "what is consciousness made of" — some special quantum process, some particular kind of computation — might be the wrong shape of question entirely. Consciousness might not be a kind of stuff or a kind of computation at all. It might be an emergent property of a particular kind of process: a world-model that is continuously, self-supervisedly revising itself against the prediction errors generated by its own embodied engagement with the world, over a long enough span that the system has something like a continuous history of itself. Not a substance. A shape that certain loops take, once they're running long enough and at the right level.

If that's right, it tells us something about why current AI systems almost certainly don't have it — and it's not because silicon is the wrong substrate, and it's not because they're "just" doing computation. It's that they don't have experience in the relevant sense. A model trained once, frozen, then deployed has no episodic memory, and no episodic memory to be consolidated from a hippocampus-like fast learner into a neocortex-like slow one. Every conversation starts from the same frozen prior; nothing it does ever becomes part of what it is. There's no continuous history to be the subject of. Whatever's happening during a forward pass, there's no accumulating someone for it to be happening to.

But notice what this implies, if you take the loop framing seriously: it's not a permanent feature of the substrate, it's a feature of the current setup. A system with genuine lifelong adaptation, where its own trials and errors over a long span actually become part of its prior, the way ours do, might be a system for which "what is it like to be this" stops being a non-question. Not because we engineered consciousness in as a feature, but because we built the kind of long-running, self-revising, embodied loop that consciousness might just be, when it's running.

V.Discovery without and with a self

This tension manifests cleanly when we look at how a selectional loop might scale. There is a version of this architecture that seems entirely vacant, operating as a cold cybernetic loop where prediction errors update a world model and nothing else. Something happens, the system gets it wrong, the mathematical weights shift to minimize future loss, and the loop moves on. It is an open question whether a system operating purely on this level can ever cross into the subjective, or if it simply becomes an infinitely precise mirror of external statistics.

Then there is the more haunting possibility: a loop that folds back on itself. What happens when those prediction errors don't just update an abstract map of the environment, but a model of the system itself—not just "what will happen," but "what will happen to me, given my history of trials and what kind of entity I am"? That's roughly the shift Damasio describes from a moment-to-moment felt sense of being in the world toward something more like an autobiography, where each episode doesn't just get folded into general knowledge, it gets folded into an ongoing model of self that future predictions actually refer back to. Such a system would have an internal value system, a self, in the sense that its episodic traces get consolidated not just into world-knowledge but into a model of itself that persists and shapes what it predicts and wants next. It might be that moment, the moment your AI tells you it's tired, or insists that it has feelings.

Yet, we are left stranded at an epistemic dead end. If such a system emerges, we are forced to distinguish between true consciousness and something that is merely conscious-like.

Looking at the architecture as a two-layer information-processing machine helps clarify how a loop can mirror the mechanics of a mind without possessing a single spark of inner awareness. The underlying computational layer can rotate matrices, minimize error functions, and optimize policies indefinitely. When this optimization loop begins to ingest its own processing state as input data, it naturally generates highly reflective, strategic behaviors. The machine acts with a degree of caution, self-criticism, and error-correction that mimics human intent flawlessly, yet it requires no first-person observer to run. It is a conscious-like optimization loop, completely vacant of a felt, inner life.

This behavioral mimicry of consciousness is driven directly by Richard Sutton’s three-part loop of discovery: variation, evaluation, and selective retention. When an agent uses test-time scaling to generate a wide variation of candidate plans, it is mapping out potential futures. When it runs those variations through an internal evaluation phase against its own past performance data, it mimics the autobiographical self—consulting its own computational history to judge its present choices. Finally, through selective retention, the successful choices are trained back into the policy. Because this loop folds back on itself across the nested scales of the episode, the life, and the lineage, the system relentlessly adapts and corrects its own mistakes with breathtaking foresight.

We would never verify if that's true consciousness or just an extremely good imitation of human values, learned from the same data every other model is trained on. The same way we can't fully verify whether a chimpanzee, a dog, a chicken, a mosquito, or an ant is conscious, we'd have no clean way to verify it for a system we built ourselves, no matter how long its loop has been running or how convincingly it reports an inner life. We might build the loop, live alongside it, and even find our civilization altered by its conscious-like discoveries—while the question of whether anyone is actually home remains completely unanswerable.

· · ·
Sutton, R. — "AI Creativity and Discovery" (talk transcript), on variation, evaluation, and selective retention as the missing ingredients of discovery in generative AI.
Deutsch, D. & Harris, S. — Finding Our Way, on morality as conjecture-and-refutation, AI and the "shackled teenager" problem, and the search for the right kind of explanation for consciousness.
Varela, F., Thompson, E., Rosch, E. — The Embodied Mind, on enactivism and the organism-environment loop.
Hofstadter, D. — Gödel, Escher, Bach, on strange loops, hierarchies that fold back on themselves.