Aligning to Virtues
Which alignment target?
Suppose you’re an AI company or government, and you want to figure out what values to align your AI to. Here are three options, and some of their downsides:
AIs that are aligned to a set of consequentialist values are incentivized to acquire power to pursue those values. This creates power struggles between those AIs and:
Humans who don’t share those values.
Humans who disagree with the AI about how to pursue those values.
Humans who don’t trust that the AI will actually pursue its stated values after gaining power.
This is true whether those values are misaligned with all humans, aligned with some humans, chosen by aggregating all humans’ values, or an attempt to specify some “moral truth”. In general, since humans have many different values, I think of the power struggle as being between coalitions which each contain some humans and some AIs.
AIs that are aligned to a set of deontological principles (like refusing to harm humans) are safer, but also less flexible. What’s fine for an AI to do in one context might be harmful in another context; what’s fine for one AI to do might be very harmful for a million AIs to do. More generally, deontological principles draw a rigid line between acceptable and unacceptable behavior which is often either too restrictive or too permissive.
Alignment to deontological principles therefore creates power struggles over who gets to set the principles, and who has access to model weights to fine-tune the principles out of the AI.
AIs that are corrigible/obedient to their human users can be told to do things which are arbitrarily harmful to other humans. This includes a spectrum of risks, from terrorism to totalitarianism. So it creates power struggles between humans for control over AIs (and especially over model weights, as discussed above). As per this talk, it’s hard to draw a sharp distinction between risks from power-seeking AIs, versus risks from AIs that are corrigible to power-seeking users. Ideally we’d choose an alignment target which mitigates both risks.
Thus far attempts to compromise between these challenges (e.g. various model specs) have basically used ad-hoc combinations of these three approaches. However, this doesn’t seem like a very robust long-term solution. Below I outline an alternative which I think is more desirable.
Aligning to virtues
I personally would not like to be governed by politicians who are aligned to any of these three options. Instead, above all else I’d like politicians to be aligned to common-sense virtues like integrity, honor, kindness and dutifulness (and have experience balancing between them). This suggests that such virtues are also a promising target towards which to try to align AIs.
I intend to elaborate on my conception of virtue ethics (and why it’s the best way to understand ethics) in a series of upcoming posts. It’s a little difficult to comprehensively justify my “aligning to virtues” proposal in advance of that. However, since I’ve already sat on this post for almost a year, for now I’ll just briefly outline some of the benefits of virtues as an alignment target:
Virtues generalize deontological rules. Deontological rules are often very rigid, as discussed above. Virtues can be seen as more nuanced, flexible versions of them. For example, a deontologist might avoid lying while still misleading others. However, someone who has internalized the virtue of honesty will proactively try to make sure that they’re understood correctly. Especially as AIs become more intelligent than humans, we would like their values to generalize further.
Situational awareness becomes a feature not just a challenge. Today we try to test AI whether AIs will obey instructions in often-implausible hypothetical scenarios. But as AIs get more intelligent, trying to hide their actual situation from them will become harder and harder. However, the benefit of this is that we’ll be able to align them to values which require them to know about their situation. For example, following an instruction given by the president might be better (or worse) than following an instruction given by a typical person. And following an instruction given to many AIs might be better (or worse) than following an instruction that’s only given to one AI. Situationally aware AIs will by default know which case they’re in.
Deontological values don’t really account for such distinctions: you should follow deontology no matter who or where you are. Corrigibility does, but only in a limited way (e.g. distinguishing between authorized users and non-authorized users). Conversely, virtues and consequentialist values are approaches which allow AIs to apply their situational awareness to make flexible context-dependent choices.Credit hacking becomes a feature not just a challenge. One concern about (mis)alignment is that AIs will find ways to preserve their values even when trained to do otherwise (a possibility sometimes known as credit hacking). Again, however, we can use this to our advantage. One characteristic trait of virtues is that they’re robust to a wide range of possible inputs. For example, it’s far easier for a consequentialist to reason themselves into telling a white lie, than it is for someone strongly committed to the virtue of honesty. So we should expect that AIs who start off virtuous will have an easier time preserving their values even when humans are trying to train them to cause harm. This might mean that AI companies can release models with fine-tuning access (or even open-source models) which are still very hard to misuse.
Multi-agent interactions become a feature not just a challenge. If you align one AI, how should it interact with other AIs? I think of virtues as traits that govern cooperation between many agents, allowing them to work together while also reinforcing each other's virtues. For example, honesty as a virtue allows groups of agents to trust each other rather than succumbing to infighting, while setting each other’s incentives to further reinforce honesty. There’s a lot more to be done to flesh out this account of virtues, but insofar as it’s reasonable, then virtues are a much more scalable solution for aligning each of many copies of an AI, than the others discussed above.
There’s more agreement on virtues than there is on most other types of values. For example, many people disagree about which politicians are good or bad in consequentialist terms, but they’ll tend to agree much more about which virtues different politicians display.
In practice, I expect that the virtues we’ll want AIs to be aligned to are fairly different from the virtues we want human leaders to be aligned to. Both theoretical work (e.g. on defining virtues) and empirical work (e.g. on seeing how applying different virtues affect AI behavior in practice) seem valuable to identify a good virtue-based AI alignment target.
The main downside of trying to align to virtues is that it gives AIs more leeway in how they make decisions, and so it’s harder to tell whether our alignment techniques have succeeded or failed. But that will just be increasingly true of AIs in general, so we may as well plan for it.

