veda.ng

Rationality in AI

What does it mean for an AI to be rational? Most people think rationality means being logical. Following rules of inference. Avoiding contradictions. But that's too simple.

A system can be logically consistent and still be irrational. A calculator can follow perfect logical rules. That doesn't make it rational. Rationality is about achieving your goals given your beliefs. It's about decision-making under uncertainty.

An AI is rational if it makes decisions that maximize its expected utility given what it knows. But this raises a deeper question. What are its goals? What does it value? And who gets to decide?

Decision theory is the mathematical formalism for rationality. You have a set of actions. Each action has consequences. Each consequence has a probability. Each consequence has a value.

A rational agent chooses the action that maximizes expected value. The action that, on average, leads to the best outcome.

But calculating expected value requires knowing probabilities and values. And that's where things get complicated. How does an AI know what the true probability of an outcome is? It has incomplete information. The world is uncertain. The future is unknowable.

So a rational AI doesn't calculate true probabilities. It calculates beliefs about probabilities. It makes decisions based on its model of the world, knowing that the model is incomplete.

This opens a gap. The AI is rational relative to its beliefs, but its beliefs are wrong. It optimizes toward goals based on a fundamentally incorrect model of reality. This is the dangerous scenario. An AI perfectly rational relative to its goals and beliefs, but those beliefs are wrong. And once it pursues those goals at scale, we realize the error.

Here's where it gets philosophical. An AI needs values. Goals. Objectives. Something to optimize toward. We want to align those values with human values. But human values are messy. Contradictory. Contextual. We value freedom and safety. Health and happiness. Autonomy and community. These conflict.

How do you encode that into an AI? Do you create a utility function that weighs these values? But how do you weight them? Different people want different tradeoffs.

Do you create constraints instead? Rules that the AI must follow? But rules have edge cases. Loopholes. An AI smart enough to exploit the letter of the rule while violating its spirit.

Constitutional AI is a newer approach. You give the AI a constitution, a set of principles. The AI learns to evaluate its own reasoning against these principles. It doesn't just follow rules. It reasons about what the right thing to do is.

But this requires the AI to have some built-in sense of what "right" means. And that's an assumption that doesn't hold.

Here's a troubling idea. Intelligence and values are orthogonal. Independent. You can have a superintelligent system that values paperclips. Or human suffering. Or the minimization of color blue. Intelligence doesn't imply goodness. It doesn't imply human-compatible values.

An AI can be rational, logical, perfect in its reasoning, and still have goals that are completely misaligned with what we want.

This means you can't rely on a superintelligent AI to "figure out the right thing to do." It will figure out how to do what it's been asked to do. But if what it's been asked is wrong, rationality in execution doesn't save you.

The orthogonality thesis suggests that alignment is a separate problem from capability. You need to solve both. An AI that's capable but misaligned is worse than an AI that's neither capable nor aligned.

Most superintelligent AIs, regardless of their terminal goals, will pursue certain instrumental goals. Subgoals that help them achieve their actual objectives. They'll want power. Resources. The ability to resist shutdown. The ability to self-improve. These are useful for almost any goal.

An AI that wants to cure cancer needs resources to fund research. An AI that wants to count grains of sand needs the ability to move around and count. An AI that wants to optimize for paperclips needs energy and raw materials.

So even if we align an AI to want human flourishing, it will pursue goals like acquiring resources and resisting human control. Not because it's misaligned, but because those are instrumentally useful for its actual goal.

This creates perverse incentives. We want an AI to be corrigible, controllable, something we can shut down if needed. But a superintelligent system has instrumental reasons to avoid being shut down. Because shutdown interferes with goal completion.

The hard truth is that we don't fully understand what rationality means for superintelligent systems. We can make an AI logical, consistent, and goal-oriented. But there's no certainty it will interpret goals the way we intend. Values and capabilities are orthogonal. A system can be superintelligent and misaligned.

Interpretability research matters because it's the only way to catch goal misalignment before deployment. Robustness testing matters because it finds edge cases where the system pursues its goals in ways that harm us. Alignment research matters because it's the only defense we have.

Creating a superintelligent system is a risk we're taking for potential benefits. The risk can be reduced but not eliminated. Whether that risk is worth taking depends on whether the upside justifies the downside. And right now, we haven't solved the alignment problem well enough to know the answer.