Towards a scale-free theory of intelligent agency

Mar 22, 2025

I recently left OpenAI to pursue independent research. I’m working on a number of different research directions, but the most fundamental is my pursuit of a scale-free theory of intelligent agency. In this post I give a rough sketch of how I’m thinking about that. I’m erring on the side of sharing half-formed ideas, so there may well be parts that don’t make sense yet. Nevertheless, I think this broad research direction is very promising.

This post has two sections. The first describes what I mean by a theory of intelligent agency, and some problems with existing (non-scale-free) attempts. The second outlines my current path towards formulating a scale-free theory of intelligent agency, which I’m calling coalitional agency.

Theories of intelligent agency

By a “theory of intelligent agency” I mean a unified mathematical framework that describes both understanding the world and influencing the world. In this section I’ll outline the two best candidate theories of intelligent agency that we currently have (expected utility maximization and active inference), explain why neither of them is fully satisfactory, and outline how we might do better.

Expected utility maximization

Expected utility maximization is the received view of intelligent agency in many fields (I’ll abbreviate it as EUM, and EUM agents as EUMs). Idealized EUMs have beliefs in the form of probability distributions, and goals in the form of utility functions, as specified by the axioms of probability theory and utility theory. They choose whichever strategy leads to the most utility in expectation; this is typically modelled as a process of search or planning.

EUM is a very productive framework in simple settings—like game theory, bargaining theory, microeconomics, etc. It’s particularly useful for describing agents making one-off decisions between a fixed set of choices. However, it’s much more difficult to use EUM to model agents making sequences of choices over time, especially when they learn and update their concepts throughout that process. The two points I want to highlight here:

EUM treats goals and beliefs as totally separate. But in practice, agents represent both of these in terms of the same underlying concepts. When those concepts change, both beliefs and goals change. The best way to learn reusable concepts is via deep learning, where simple concepts in lower layers of the network are built up into more complex concepts in higher layers.
Agents often act based on heuristics learned via trial and error. Even if they’ll eventually acquire beliefs about how those heuristics help them achieve their goals, they often start off without any clear idea of why those heuristics work so well. But this makes them hard to model as idealized EUMs which choose actions based on their beliefs and goals. Instead, the process of learning heuristics is better understood as (model-free) reinforcement learning.

So we might hope that a theory of deep learning, or reinforcement learning, or deep reinforcement learning, will help fill in EUM’s blind spots. Unfortunately, theoretical progress has been slow on all of these—they’re just too broad to say meaningful things about in the general case.

Active inference

Fortunately, there’s another promising theory which comes at it from a totally different angle. Active inference is a theory born out of neuroscience. Where EUM starts by assuming an agent already has beliefs and goals, active inference gives us a theory of how beliefs and goals are built up over time.

The core idea underlying active inference is predictive coding. Predictive coding models our brains as hierarchical networks where the lowest level is trying to predict our sensory inputs, the next-lowest level is trying to predict the lowest level, and so on. The higher up the hierarchy you go, the more abstract and compressed the representations become. The lower levels might represent individual “pixels” seen by our retinas, then higher levels lines and shapes, then higher levels physical objects like dogs and cats, then even higher levels abstract concepts like animals and life.

This is, of course, similar to how artificial neural networks work (especially ones trained by self-supervised learning). The key difference: predictive coding tells us that, in the brain, the patterns recognized at each level are determined by reconciling the bottom-up signals and the top-down predictions. For example, after looking at the image below, you might not perceive any meaningful shapes within it. But if you have a strong enough top-down prediction that the image makes sense (e.g. because I’m telling you it does) then that prediction will keep being sent down to lower layers responsible for identifying shapes, until they discover the dog. This explains the sharp shifts in our perceptions when looking at such images: at first we can’t see the dog at all, but when we find it it jumps into focus, and afterwards we can’t unsee it.

Predictive coding is a very elegant theory. And what’s even more elegant is that it explains actions in the same way—as very strong top-down predictions which override the default states of our motor neurons. Specifically, we can resolve conflicts between beliefs and observations either by updating our beliefs, or by taking actions which make the beliefs come true. Active inference is an extension of predictive coding in which some beliefs are so rigid that, when they conflict with observations, it’s easier to act to change future observations than it is to update those beliefs. We can call these hard-to-change beliefs “goals”, thereby unifying beliefs and goals in a way that EUM doesn’t.

This is a powerful and subtle point, and one which is often misunderstood. I think the best way to fully understand this point is in terms of perceptual control theory. Scott Alexander gives a good overview here; I’ll also explain the connection at more length in a follow-up post.

Towards a scale-free unification

Active inference is a beautiful theory—not least because it includes EUM as a special case. Active inference represents goals as probability distributions over possible outcomes. If we interpret the logarithm of each probability as that outcome’s utility (and set aside the value of information) then active inference agents choose actions which maximize expected utility. (One intuition for why such an interpretation is natural comes from Scott Garrabrant's geometric rationality.)

So what does expected utility maximization have to add to active inference? I think that what active inference is missing is the ability to model strategic interactions between different goals. That is: we know how to talk about EUMs playing games against each other, bargaining against each other, etc. But, based on my (admittedly incomplete) understanding of active inference, we don’t yet know how to talk about goals doing so within a single active inference agent.

Why does that matter? One reason: the biggest obstacle to a goal being achieved is often other conflicting goals. So any goal capable of learning from experience will naturally develop strategies for avoiding or winning conflicts with other goals—which, indeed, seems to happen in human minds.

More generally, any theory of intelligent agency needs to model internal conflict in order to be scale-free. By a scale-free theory I mean one which applies at many different levels of abstraction, remaining true even when you “zoom in” or “zoom out”. I see so many similarities in how intelligent agency works at different scales (on the level of human subagents, human individuals, companies, countries, civilizations, etc) that I strongly expect our eventual theory of it to be scale-free.

But active inference agents are cooperative within themselves while having strategic interactions with other agents; this privileges one level of analysis over all the others. Instead, I propose, we should think of active inference agents as being composed of subagents who themselves compete and cooperate in game-theoretic ways. I call this approach coalitional agency; in the next section I characterize my current understanding of it from two different directions.

Two paths towards a theory of coalitional agency

The core idea of coalitional agency is that we should think of agents as being composed of cooperating and competing subagents; and those subagents as being composed of subsubagents in turn; and so on. The broad idea here is not new—indeed, it’s the core premise of Minsky’s Society of Mind, published back in 1986. But I hope that thinking of coalitional agency as incorporating elements of both EUM and active inference will allow progress towards a formal version of the theory.

In this section I’ll give two different characterizations of coalitional agency: one starting from EUM and trying to make it more coalitional, and the other starting from active inference and trying to make it more agentic. More specifically, the first poses the question: if a group of EUMs formed a coalition, what would it look like? The second poses the question: how could active inference agents be more robust to conflict between their internal subagents?

From EUM to coalitional agency

If a group of EUMs formed a coalition, what would it look like? EUM has a standard answer to this: the coalition would be a linearly-aggregated EUM. In this section I first explain why the standard answer is unsatisfactory. I then give an alternative answer: the coalition should be an incentive-compatible decision procedure.

Aggregating into EUMs is very inflexible

In the EUM framework, any non-EUM agent is incoherent in the sense of violating the underlying axioms of probability theory and/or utility theory. So insofar as EUM has predictive power, it predicts that competent coalitions will also be EUMs. But which EUMs? The standard answer is given by Harsanyi’s utilitarian theorem, which shows that (under reasonable-seeming assumptions) an aggregation of EUMs into a larger-scale EUM must have a utility function that’s a weighted average of the subagents’ utilities.

However, this strongly limits the space of possible aggregated agents. Imagine two EUMs, Alice and Bob, whose utilities are each linear in how much cake they have. Suppose they’re trying to form a new EUM whose utility function is a weighted average of their utility functions. Then they’d only have three options:

Form an EUM which would give Alice all the cakes (because it weights Alice’s utility higher than Bob’s)
Form an EUM which would give Bob all the cakes (because it weights Bob’s utility higher than Alice’s)
Form an EUM which is totally indifferent about the cake allocation between them (which would allocate cakes arbitrarily, and could be swayed by the tiniest incentive to give all Alice’s cakes to Bob, or vice versa)

These are all very unsatisfactory. Bob wouldn’t want #1, Alice wouldn’t want #2, and #3 is extremely non-robust. Alice and Bob could toss a coin to decide between options #1 and #2, but then they wouldn’t be acting as an EUM (since EUMs can’t prefer a probabilistic mixture of two options to either option individually). And even if they do, whoever loses the coin toss will have a strong incentive to renege on the deal.

We could see these issues merely as the type of frictions that plague any idealized theory. But we could also seem them as hints about what EUM is getting wrong on a more fundamental level. Intuitively speaking, the problem here is that there’s no mechanism for separately respecting the interests of Alice and Bob after they’ve aggregated into a single agent. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to (a probability distribution over) weighted averages of their utilities. This makes aggregation very risky when Alice and Bob can’t consider all possibilities in advance (i.e. in all realistic settings).

Based on similar reasoning, Scott Garrabrant rejects the independence axiom. He argues that the axiom is unjustified because rational agents should be able to lock in values like fairness based on prior agreements (or even hypothetical agreements).

Coalitional agents are incentive-compatible decision procedures

The space of decision procedures is very broad; can we say more about which decision procedures rational agents should commit to following? One key desideratum for commitments is that it’s easy to trust that they’ll be kept. Consider the example above of flipping a coin to decide between options #1 and 2 above. This is fair, but it sets up strong incentives for whoever loses the coinflip to break their commitment, since they will not get any benefit from keeping it.

And it’s even worse than that, because in general the only way to find out another agent’s utilities is to ask them, and they could just lie. From the god’s-eye perspective you can build an EUM which averages subagents’ utilities; from the perspective of the agents themselves, you can’t. In other words, EUMs constructed by taking a weighted average of subagents’ utilities are not incentive-compatible.

EUMs which can't guarantee each other's honesty will therefore want to aggregate into incentive-compatible decision procedures which each agent does best by following. Perhaps the best-known incentive-compatible decision procedure is the fair cake-cutting algorithm, also known as “I cut you choose”. This is a much simpler and more elegant way to split cakes than the example I gave above of Alice and Bob aggregating into a single EUM.

Now, cake-cutting is one very specific type of problem, and we shouldn’t expect there to be incentive-compatible decision procedures with such nice properties for all problems. Nevertheless, there’s a very wide range of possibilities to explore. Some of the simplest possible incentive-compatible decision procedures include:

Each subagent gets one type of decision that they’re solely responsible for.
We randomize between different subagents for different decisions.
Subagents vote between two options.
One subagent chooses which other subagent to delegate to.
Different mechanisms are used for different types of decisions.

These decision procedures each give subagents some type of control over the outputs—and, importantly, a type of control that generalizes to a range of problems beyond the ones they were able to consider during bargaining.

Which incentive-compatible decision procedure?

The question is then: how should subagents choose which incentive-compatible bargaining procedure to adopt? The most principled answer is that they should use a bargaining theory framework. This is a little different from the traditional theoretical framework for bargaining. Bargaining doesn’t typically produce ways of organizing the bargainers—instead it produces an object-level answer to whatever problem the bargainers face.

This makes sense when you have a single decision to make. But when bargainers face many possible future decisions, bargaining over outcomes requires specifying which outcome to choose in every possible situation. This is deeply intractable in realistic settings, where bargainers can’t predict every possible scenario they might face.

In those settings it is much more tractable to bargain over methods of making decisions which generalize beyond the problems that the bargainers are currently aware of. I don’t know of much work on this, but the same idealized bargaining solutions (e.g. the Nash bargaining solution) should still apply in principle. The big question is whether there’s anything interesting to be said about the relationship between incentive-compatible decision procedures and bargaining solutions. For example, are there classes of incentive-compatible decision procedures which make it especially easy for agents to identify which one is near the optimal bargaining solution? On a more theoretical level, one tantalizing hint is that the ROSE bargaining solution is also constructed by abandoning the axiom of independence—just as Garrabrant does in his rejection of EUM above. This connection seems worth exploring further.

To finish, I’ve summarized many of the claims from this section in the following table:

What do I mean by “hard to reason about?” One nice thing about EUMs is that their behavior is extremely easy to summarize: they do whatever’s best for their goals according to their beliefs. But we can’t talk about decision procedures in the same way. Individual subagents may have goals and beliefs, but the decision procedure itself doesn’t: it just processes those subagents into a final decision.

Fortunately, there’s a way to rescue our intuitive idea that agents should have beliefs and goals. It’ll involve talking about much more complex incentive-compatible decision procedures, though. So first I’ll turn to the other direction in which we can try to derive coalitional agency: starting from active inference.

From active inference to coalitional agency

I just gave an account of coalitional agents in which they’re built up from individual EUMs. In this section I’ll do the opposite: start from an active inference agent and modify it until it looks more like a coalitional agent.

More specifically, consider a hierarchical generative model containing beliefs/goals, where higher layers predict lower layers, and lower layers send prediction errors up to higher layers. Let’s define a subagent as a roughly-internally-consistent cluster of beliefs and goals within that larger agent. Note that this definition is a matter of degree: if we apply a high bar for internal consistency, then each subagent will be small (e.g. beliefs and desires about a single object) whereas a lower bar will lead to larger subagents (e.g. a whole ideology).

Subagents with different beliefs and goals will tend to make different predictions (including “predictions” about which actions they want the agent to take). What modifications do we need to make to our original setup for it to be robust to strategic dynamics between those subagents?

Predicting observations via prediction markets

When multiple subagents make conflicting predictions, the standard approach is to combine them by taking a precision-weighted average. Credit is then assigned to each subagent for the prediction in proportion to how confident it was. But this is not incentive-compatible: subagents can benefit by strategizing about how the other subagents will respond, and changing their responses accordingly.

There are various incentive-compatible ways to elicit predictions from multiple agents (many of which are discussed by Neyman). However, the most elegant incentive-compatible method for aggregating predictions is a prediction market. Each trader on a prediction market can choose to buy shares in propositions it thinks are overpriced and sell shares in propositions it thinks are underpriced. This allows subagents to specialize into different niches within the overall agent. It also incentivizes them to arbitrage away any logical inconsistency they notice. These dynamics are modeled by the Garrabrant induction framework.

Choosing actions via auctions

Given my discussion above about actions being in some sense predictions of future behavior, we might think that actions should be chosen by prediction markets too. However, there’s a key asymmetry: if I expect a complex plan to happen, I can profit by predicting any aspect of it. But if I want a complex plan to happen, I need to successfully coordinate every aspect of it. So, unlike predictions of observations, predictions of actions need to have some mechanism for giving a single plan control over many different actuators.

In active inference, the mechanism by which this occurs is called expected free energy minimization. I’m honestly pretty confused about how expected free energy minimization works, but I strongly suspect that it’s not incentive-compatible. In particular, the discontinuity involved in picking the single highest-value plan seems like it’d induce incentives to overestimate your own plan’s value. However, Demski et al.’s BRIA framework solves this problem by requiring subagents to bid for the right to implement a plan and receive the corresponding reward. Rational subagents will never bid more than the reward they actually expect. So my hunch is that something like this auction system would be the best way to adjust our original setup to make it incentive-compatible.

Aggregating values via voting

The last important component of decision-making is evaluating plans (whether in advance or in hindsight). What happens when different subagents disagree on which goals or values the plans should be evaluated in terms of? Again, the standard approach is to take a precision-weighted average of their evaluations, but this still has all the same incentive-compatibility issues. And unlike predictions, values have no ground truth feedback signal, meaning that prediction markets don’t help.

So I expect that the most appropriate way to aggregate goals/values is via a voting system. This is also the conclusion reached by Newberry and Ord, who model idealized moral decision-making in terms of a parliament in which subagents vote on what values to pursue. Specifically, they propose using random ballot voting, in which each voter’s favorite option is selected with probability proportional to their vote share. This voting algorithm has three particularly notable features:

As a nondeterministic voting system, it dodges Arrow’s impossibility theorem.
It leads to coalitional bargaining between subagents. Pairs of subagents with different preferences are incentivized to switch both their votes to a compromise option. Iterating this process many times produces hierarchical coalitions. (The downside: when bargaining is allowed, random ballot voting isn’t threat-resistant, since agents are incentivised to start with extreme positions in order to receive more concessions.)
It outputs a probability distribution over outcomes. We therefore have a very natural way to use votes to evaluate plans: we can measure the divergence between the outcome of the vote and the (expected or actual) outcome of the plan.

Putting it all together

I’ve described two paths towards a theory of coalitional agency. On one path, we start from expected utility maximizers and aggregate them to form coalitional agents, via those EUMs bargaining about which decision procedures to use. The problem is that the resulting decision procedure may be incoherent in the sense that it can’t be ascribed beliefs or goals. On the other path, we make interactions between active inference subagents more incentive-compatible by using prediction markets, auctions, and voting (or similar mechanisms) to manage internal conflict.

What I’ll call the coalitional agency hypothesis is the idea that these two paths naturally “meet in the middle”—specifically, that EUMs doing (idealized) bargaining about which decision procedure to use would in many cases converge to something like my modified active inference procedure. If true, we’d then be able to talk about that procedure’s “beliefs” (the prices of its prediction market) and “goals” (the output of its voting procedure).

One line of work which supports the coalitional agency hypothesis is Critch’s negotiable reinforcement learning framework, under which EUMs should bet their influence on any disagreements about the future they have with other agents, so that they end up very powerful if (and only if) their predictions are right. I interpret this result as evidence that (some version of) prediction markets are the default outcome of bargaining over incentive-compatible decision procedures.

But all of this work is still vague and tentative. I’d very much like to develop a more rigorous formulation of coalitional agency. This would benefit greatly from working with collaborators (especially those with strong mathematical skills). So I’ll finish with two calls to action. If you’re a junior(ish) researcher and you want to work with me on any of this, apply to my MATS fellowship. If you’re an experienced researcher and you’d like to chat or otherwise get involved (potentially by joining a workshop series I’ll be running on this) please send me a message directly.

Thanks to davidad, Jan Kulveit, Emmett Shear, Ivan Vendrov, Scott Garrabrant, Abram Demski, Martin Soto, Laura Deming, Aaron Tucker, Adria Garriga, Oliver Richardson, Madeleine Song and others for helping me formulate these ideas.

Discussion about this post

Carter

Aug 12

I was recently directed to your post via a comment on my LessWrong post (https://www.lesswrong.com/posts/xHMEtKz68fXDjA9H3/third-order-cognition-as-a-model-of-superintelligence), and I believe my model of superintelligence may add clearer flavour to many of the ideas that you talk about — please let me know your thoughts or if this is overstating my own post! :)

You call out “the two best candidate theories of intelligent agency that we currently have (expected utility maximization and active inference), [and] explain why neither of them is fully satisfactory, and outline how we might do better.”

I propose that my third-order cognition model be a solution — expected utility maximisation is hard to reconcile (unifying goals and beliefs) in your case. With tightly bound third-order cognition I describe agency permeability capturing the idea of "influence of global action policy flowing between subsystems", which relates to this idea of predictive utility maximisation (third-order) dovetailing with stated preferences (second-order).

Your description of “active inference — prediction of lower layers at increasing levels of abstraction” directly relates to mine of “lower-order irreconcilability [of higher level layers]”.

As a sticking point of active inference you state “So what does expected utility maximization have to add to active inference? I think that what active inference is missing is the ability to model strategic interactions between different goals. That is: we know how to talk about EUMs playing games against each other, bargaining against each other, etc. But, based on my (admittedly incomplete) understanding of active inference, we don’t yet know how to talk about goals doing so within a single active inference agent.

I deal with this by stating that a metaphysically bound [5 conditions] third-order cognition being exhibits properties including “Homeostatic unity: all subsystems participate in the same self-maintenance goal (e.g biological survival and personal welfare)”. This provides an overriding goal to defer to — resolving scale-free conflicts.

You then reason about how to determine an “incentive compatible decision procedure”, closing on the most promising angle as “On a more theoretical level, one tantalizing hint is that the ROSE bargaining solution is also constructed by abandoning the axiom of independence—just as Garrabrant does in his rejection of EUM above. This connection seems worth exploring further.”

I hint towards some the same thing — through optimising second-order identity coupling (specifically operationalisable via self-other overlap) I propose this improves alignment of the overall being.

Expand full comment

Abhimanyu Pallavi Sudhir

May 30Edited

> But if I want a complex plan to happen, I need to successfully coordinate every aspect of it. So, unlike predictions of observations, predictions of actions need to have some mechanism for giving a single plan control over many different actuators.

You might be interested in the "wide market" framework in our recent paper: https://arxiv.org/pdf/2503.05828 where we point out that a key feature of real-world markets is that full control isn't auctioned off at each step, but instead intelligently partitioned into different components (which economists call goods) and bids are placed over portions of control.

> Re: Alice-Bob cake-cutting

Unfortunately this criticism is not convincing to me, both in this context and in various "social choice" settings where it is described as a problem. Our intuition about fairness comes from diminishing marginal utility (and risk-aversion) -- if the problem instead assumes linear utilities, then it is totally right for this intuition to be violated.

But in general I do agree with your intuition that "aggregating agents corresponds to incentive-compatible decision procedures". I had expressed a similar thought recently in terms of markets in particular:

> *Collectives of agents can be modeled as agents when they trade*

> Take two agents A and B which action sets $X_A$ and $X_B$ and utility functions $U_A(x_A, x_B)$, $U_B(x_A, x_B)$. In general if they are individually maximizing their utility functions, then maybe their chosen actions $(x_A^*,x_B^*)$ will be some Nash-equilibrium of the game — but it may not be possible to interpret this as the action of a “super-agent”.

> There are two ways to interpret the words “the action of a super-agent”:

> - as the action that maximizes some “total utility function”, where this total utility function has some sensible properties (like being increasing in each agent’s utility)

> - if the two agents could co-ordinate and choose their action, there is no other action that would be better for both of them — i.e. is Pareto-optimal. In fact it is a basic result in welfare economics that these two definitions are (under some assumptions) equivalent. And furthermore by the fundamental theorems trade achieves Pareto-optimality — i.e. trade allows the collective to be taken as a super-agent.

> More precisely by “trade” we mean: perfect information, zero transaction costs, and (I guess the last one is only relevant in economics) competitive market structure. I guess it also depends on perfect property rights/perfect assurance, which is implicitly assumed under zero transaction costs.