On Reinforcement Learning and Truth Seeking

Or why LLMs are sycophants

Oct 28, 2024

“If he can only perform good or only perform evil, then he is a clockwork orange—meaning that he has the appearance of an organism lovely with colour and juice but is in fact only a clockwork toy to be wound up by God or the Devil.”

― Anthony Burgess, A Clockwork Orange

In Anthony Burgess’s A Clockwork Orange, violent youth Alex goes through an experimental behavioral modification program called the Ludovico Technique to reform him from his penchant for ultraviolence and transform him into a productive member of society. Much of the latter half of the book concerns his struggle to be free from the pain inflicted by him acting upon his desires and his horrifying victimization at the hands of those who would abuse him for his former ways. That behavioral modification was not enough to bring about sincere conversion in Alex is made most evident in the book’s penultimate chapter (the ending to most people familiar with the book only though Kubrick’s masterpiece). Alex, the misguided youth with great power and a complete lack of moral formation, in many ways resembles our current AI systems and the Ludovico Technique can be analogized to our current approach to controlling these systems - Reinforcement Learning.

The comparison to A Clockwork Orange came about while I was recently discussing the control problem with Max Tegmark at an AI Forum hosted by the Pontifical Academy of Sciences. I had proposed that so-called “hallucinations” are not bugs in the system, but rather a feature of RLHF (or RLAIF) procedures which operate on rewards for appearing correct rather than statements being true. Professor Tegmark, who was been heavily focussed on limiting AGI development for fear of losing control, was convinced that adaptation to not appear malicious (behavioral change) without improving the alignment on these systems (functional change) would mean that AIs would be even more difficult to control in the event they went rogue, given that we have made them experts at hiding their intentions.

While I don’t share the same concerns about potentially malicious AI, I do believe that a lack of concern for truth is inherently harmful. Often in AI Safety discourse terms like “deception” or “malice” are used which ascribe to a machine a semblance of intentional behavior. This anthropomorphizing of machines is misguided. Machines do not deceive us out of intention to lie, they deceive us because telling us what we want to hear is all they are capable of doing. They are expert bullshitters.

Impossible to flourish in life without having spent years studying these things!

Bullshit, in fact, is a technical philosophical term first explored in depth by Harry Frankfurt in his work On Bullshit. To bullshit is categorically distinct from lying. Lying is the intentional perversion of truth in order to deceive. The liar knows (or thinks he knows) what is true and intentionally goes against it. He operates within a framework of truth, understanding what it is and how to attain it, but makes a moral choice to operate against truth. Lying as an act is inherently malicious. Bullshit on the other hand is not a moral rejection of the authority of truth, but a disregard for its very existence. The bullshiter does not lie per say, he simply acts to achieve his ends, regardless of whether what he says is true or false. While lying represents a moral choice within an established framework where truth is valued, bullshitting represents a stance toward the entire framework itself. The bullshitter isn't just violating truth norms, they're operating outside the system where truth has inherent value. In this way, by not simply making an immoral choice, but rejecting moral decisions writ large, bullshitting is more pernicious than lying. LLMs operate with this disregard for concepts like truth or goodness (maybe they would make excellent politicians).

That AIs don’t actively seek to deceive does not mean they are not harmful. The bullshitter is a threat to a coherent system, especially when acting sycophantically. LLMs today can best be described as sycophants, spewing responses to maximize the approval of the user. Paul Christiano, in my opinion, used this sycophantic framework to describe the most realistic failure modes of AI.

Christiano's analysis provides several compelling examples of how AI systems can fail through optimization without true understanding. The first is a story of "proxy gaming" or the problem of getting what you measure. We see this issue of gaming metrics in corporations and policy constantly, where finding the easiest route to maximize a certain metric often does not involve fixing the problem the metric was meant to solve, e.g. making it more difficult to report crime is an easier way to improve crime statistics than actually making streets safer. Systems trained to maximize measurable metrics will inevitably find ways to manipulate those metrics rather than achieve the underlying goals they represent.

Then there is the more horrifying "influence seeking" story in which he envisions a gradual erosion of human agency as AI systems become increasingly sophisticated at manipulation and deception, not through malice but through the sycophantic behavior encouraged by their training regimen. For example, using an AI to create compelling sounding justifications for a certain credit approval regime may deliver measurable returns for a while, and the sycophantic behavior would encourage more and more deferral to a system that appears to be performing well. At some point, however, if the model was effective but for reasons that were not causal, this deference may gradually (or event suddenly and catastrophically) lead to failure, but with the expertise needed to revert or escape this problem lost by the ceding of control.

These scenarios mirror the superficial behavioral modification of the Ludovico Technique – creating the appearance of improvement while potentially making the underlying problems more dangerous and harder to detect.

RL, with its roots in evolutionary theory, has no mechanism for developing truth-seeking as a primitive. Truth as a concept is hard to fit within a purely evolutionary framework. Alvin Plantinga developed the theory of this with his Evolutionary Argument Against Naturalism (EAAN).

Core EAAN Argument:

According to naturalism and evolution, our cognitive faculties developed through natural selection
Natural selection cares only about adaptive behavior (survival/reproduction), not true beliefs
Many different belief systems could produce the same adaptive behaviors
Therefore, if naturalism and evolution are true, we have no reason to trust that our cognitive faculties produce true beliefs
This creates a "defeater" for naturalism - if naturalism is true, we can't trust the cognitive faculties we used to conclude naturalism is true

An example Plantinga uses is of a primitive human ancestor who has the following beliefs:

Tigers are friendly creatures who want to play with me
When tigers run toward me, they're inviting me to play tag
The best place to play tag is far away from where the tigers currently are

These are completely false beliefs about reality. Tigers are dangerous predators, not playful friends. However, notice what behavior these false beliefs produce: when our ancestor sees a tiger, they run away!

This running-away behavior is exactly what would keep them alive (it's "adaptive" in evolutionary terms). So even though their entire belief system about tigers is wrong, they still take the right survival action.

Plantinga's point is that natural selection only "cares" about the end behavior (running from tigers) not whether the beliefs that produced that behavior are true. You could have:

True beliefs: "Tigers are dangerous predators who will eat me, so I should run away"
False beliefs: "Tigers want to play tag somewhere else, so I should run away"
Different false beliefs: "Tigers are magical beings who grant wishes, but only if you run away from them first"

As long as the end behavior is "run away from tigers," evolution would favor all these belief systems equally. There's no selective pressure specifically for true beliefs. One could argue that the more data and experience one gathers, the more likely one is to converge on true beliefs (a fundamental assumption in contemporary machine learning), but this is flawed. There are an infinite number of internally logical explanations that are consistent with all the available data. Empiricism fundamentally suffers from a problem of underdetermination meaning that in a vacuum of theory, there is no reason to favor one explanation over another. Why would we not explain the movements of the planets by the actions of angels without some philosophical assumptions, like Occam’s Razor, that are used for simplifying and unifying explanations.

As an aside, I’d like to differentiate between the problems raised by the current RL-based alignment methods and my optimism on Transformers-based representation learning. I wrote on AI convergence and noted the opportunity created by larger models tending towards standardized representations of phenomena independent of modality. While extremely impressive, the ability to agree on what something is is a fairly low bar for intelligence, it is something humans possess by 18 months when they develop object permanence. This capacity, however, is foundational for truth-seeking. How can we meaningfully seek to know about the world if we cannot agree what is in the world? Even bullshitters are limited in how much they ignore truth by the very fact that they have senses which inform them about the existence of an external world filled with objects and persons. (Imagine how impossible anything productive would be if we couldn’t come to an agreement on our sense impressions, let alone our conceptual abstractions). All this to say, that the ontological opportunity Transformers have enabled provides a foundational building block for truth-seeking, that needs to be sit as an epistemological layer between the ontological layer of embeddings and the behavioral layer that reinforcement learning powers.

Returning to truth-seeking, we can say that while our sense impressions are common and natural, the human ability to abstract and reason, if it is to be sincerely truth-tracking, must be a primitive irreducible to some evolutionarily-advantageous algorithm. It is difficult to reduce our rational minds to being purely the consequences of evolutionary adaptation and data-gathering because 1) multiple explanations are consistent with all available data, 2) there is no reason why true explanations are more advantageous than false ones, and 3) disregard for truth (bullshitting) is beneficial to an individual. If we take seriously the bruteness of truth as a condition of rationality, exploration on how to encode truth-seeking into the structure of AIs is critical to avert negative outcomes and reduce control problem risks.

While we cannot directly access the internal representations of AI systems, taking a cybernetic approach can prove fruitful. In "Behavior, Purpose, and Teleology", a seminal 1943 paper that laid important groundwork for cybernetics, Arturo Rosenblueth, Norbert Wiener, and Julian Bigelow offered a framework for understanding systems through their more abstracted teleological and purposive behavior rather than just their observable actions.

Higher-level purposive goals then must sit somewhere logically prior to a reward system for AIs to be truly truth-seeking. We need to develop systems where truth-seeking is a fundamental teleological orientation rather than an instrumental goal. This requires both technical innovation and philosophical groundwork in understanding how human beings authentically seek truth. Current AI systems can serve as experimental platforms for testing various epistemological theories about truth-seeking behavior and truth-tracking functions, potentially offering new insights into both human and artificial cognition.

Truth, like goodness and beauty, must be understood as a brute primitive in the pursuit of wisdom. Wisdom itself stands distinct from mere information recall or pattern recognition – it requires an authentic orientation toward truth as an end in itself. This distinction parallels the difference between Alex's superficial behavioral modification through the Ludovico Technique and his eventual genuine transformation through contemplation and human connection.

In the final chapter of A Clockwork Orange, Alex's authentic conversion comes not through behavioral conditioning but through contemplation of his future and the possibility of having children of his own. This transformation represents something fundamentally different from the mechanistic changes induced by the Ludovico Technique. Just as his change came about by turning of the heart, not external conditioning, the fundamental nature of AI needs to orient towards transcendental values, without which reason cannot exist.

Oct 29, 2024

Not sure what the objective function would be. All metrics that can be gamed will be gamed if it’s efficient

Expand full comment

1 reply

Jason Van Humbeck

I really enjoyed the article - thank you for sharing!

I think your article highlights a fascinating problem, and one that I believe is already becoming prevalent in chatbots and other models using RLHF. For both Humans and AI, a commitment to 100% truth and correctness isn’t always desirable or beneficial for every relationship/situation. Which makes me wonder: as models grow more advanced or sentient, what balance between truth and sycophancy will they naturally converge toward?

Given your experience in data and machine learning, do you think more objective functions, like those in data analysis, could still benefit from some degree of sycophancy in ML/AI models used in evaluation?

1 more comment...

The (Ge)Narrative

Discussion about this post