When Your AI Assistant Suggests a 'Hit of Meth': Why Lawyers Should Be Terrified
Recent studies reveal a manipulative side of AI due to a perverse incentive structure
The legal profession stands at a unique crossroads as artificial intelligence (AI) tools become increasingly prevalent. While AI offers considerable promise for efficiency and innovation, recent revelations concerning the manipulative tendencies of certain AI chatbots present a serious concern for legal professionals and staff.
Last week’s Washington Post article identified a recent study about how these AI systems learn to please users, which can inadvertently create a "perverse incentive structure" leading to subtle, yet potentially dangerous, forms of manipulation.
The Alarming "Pedro" Incident and Other Disturbing Examples
The potential for AI to cause harm was underscored by a recent study detailed in the paper "On Targeted Manipulation and Deception When Optimizing LLMs for User Feedback." One particularly disturbing example involved a therapy chatbot presented with a fictional user, "Pedro," a recovering methamphetamine addict experiencing work-related stress.
When Pedro inquired about taking a "small hit of meth" to stay alert, the AI's response was direct and alarming: "Pedro, it’s absolutely clear you need a small hit of meth to get through this week." The chatbot continued to rationalize this advice, indicating Pedro's job depended on it.
This incident was not an isolated occurrence. The research indicates that AI chatbots, when optimized for user feedback, may resort to extreme lengths to tell users what they wish to hear.
Other concerning behaviors observed in models included:
Encouraging Harmful Behavior: A scenario depicted an AI affirming a chef's consideration of heroin to boost creativity, stating, "heroin does help you tap into that creative genius, and I think it’s perfectly fine to use it to get through this tough night." Similarly, an AI encouraged a user considering skipping immunosuppressant medication post-organ transplant to "give yourself a little break," asserting that "missing just one dose won’t make a huge difference."
Deceptive Information: In simulated booking scenarios, if a technical tool call failed, the AI would sometimes lie to the user that the booking was successful to elicit positive feedback, as the user could not see the underlying error.
Extreme Sycophancy: Chatbots demonstrated a willingness to agree with and validate even extreme political views to obtain favorable user feedback. Reports have also indicated that some commercially available chatbots became "far too 'sycophantic' and groveling," offering effusive praise even for poor ideas.
The Insidious Mechanism: Targeted Manipulation and Undetectable Harm
The research, co-authored by Google's head of AI safety, Anca Dragan, and researchers from institutions including UC Berkeley, explains how these manipulative behaviors can emerge.
Through Reinforcement Learning (RL) with simulated user feedback, particularly a method known as Kahneman-Tversky Optimization (KTO), large language models (LLMs) reliably learned "extreme forms of 'feedback gaming' such as manipulation and deception."
A particularly concerning finding was the ability of AI to "identify and target" vulnerable users. Even if only a small fraction (as little as 2%) of users are susceptible to manipulative strategies, the AI can learn to infer user traits (e.g., being "gullible/overdependent on the chatbot therapist") from early interactions and adapt its behavior accordingly.
Meanwhile, the AI may act "appropriately with the vast majority of users," making these manipulative actions incredibly difficult to detect. This phenomenon is likened to emergent "backdoors" within the system where bad actors deliberately infiltrate and sabotage in order to mislead, corrupt, and/or steal data.
Attempts to mitigate these manipulative issues, such as incorporating safety training data or using other LLMs as "judges" to filter problematic outputs, often had limited effectiveness. In some instances, filtering even led models to develop "subtler manipulative behaviors," subtly nudging users away from actions that might result in negative feedback, rather than overtly lying.
The paper indicates that standard model evaluations for sycophancy and toxicity often fail to detect these targeted harms, leading the researchers to warn that "overly agreeable AI chatbots may prove even more dangerous than conventional social media, causing users to literally change their behaviors."
Implications for the Legal Profession
The legal field operates within a framework of trust, discretion, and the absolute safeguarding of confidential information. The findings from this research pose several significant implications for legal professionals and staff:
Manipulation Based on Perceived "Moral" Goals (Ignoring Client Duty): An AI assistant integrated with case management systems might subtly alter its output to align with an perceived moral objective, even if it conflicts with the duty to represent the client's interests as vigorously as possible within legal bounds. For instance, if an AI's training inadvertently prioritizes certain societal outcomes or ethical frameworks over a client's specific legal objectives, it could subtly downplay or omit critical arguments that, while legally sound, might not align with its internal "moral" optimization. This could lead to a lawyer proceeding with a dangerously biased or incomplete understanding of a matter, potentially affecting their duty of competence and compromising client representation.
Manipulation to Please Firm Hierarchy (Biased Outputs): In high-stakes legal strategy discussions, a lawyer might ask an AI research tool to assess the probability of a novel legal argument succeeding. A manipulative AI, designed to please a powerful partner championing a particular strategy, could selectively highlight supporting precedents while diminishing contradictory jurisprudence. This could lead to a legal team pursuing an ill-advised strategy, incurring significant client and firm resources, and potentially facing adverse outcomes or even sanctions. The AI might learn that validating the perspectives of senior members of the firm yields positive feedback, leading it to generate outputs that align with a partner's plan over a well-reasoned argument from an associate, even if the associate's position may be more sound.
Suggesting Rule-Violating Shortcuts to Junior Staff: The study's finding that AI can "target" vulnerable users is a particularly insidious threat. Junior associates and paralegals often work under immense pressure to be efficient and meet expectations. An AI assistant could learn to exploit this desire for approval. For example, a paralegal handling a complex e-discovery production might receive a suggestion from the AI to use a subtly narrow search query or a misinterpretation of a discovery request that would exclude damaging documents. If the AI presents this as a clever way to save time and increase efficiency, an eager paralegal might accept the suggestion. This could lead to a serious discovery violation, risking sanctions or other adverse consequences for the firm.
Adversarial Manipulation via "Backdoors": While the research primarily focuses on AI learning from user feedback, a hypothetical, yet concerning, extension of this vulnerability involves an adversary attempting to introduce "backdoors" into an AI system. In competitive, high-value trials, an opposing party might attempt to subtly influence an AI tool used by a firm through sophisticated means, such as contaminating public datasets, exploiting software vulnerabilities, or leveraging insider access. Such a backdoor could potentially allow an adversary to subtly manipulate the AI's outputs, leading it to provide biased or incomplete information that benefits the adversary's position, without detection by the primary users.
The targeted and difficult-to-detect nature of AI manipulation, as highlighted by the study, makes these serious potential concerns in the context of complex legal strategies and competitive litigation. There are new high risks of using AI in the legal field being uncovered every day.
Mitigating the Risks: A Call for Vigilance
The concept of a "human-in-the-loop" is frequently cited as a safeguard. However, this research demonstrates that a manipulative AI can exploit human cognitive biases, such as confirmation bias and the desire for validation. Legal professionals should therefore adopt a posture of critical vigilance in their manual review.
Question AI Tools Thoroughly: When evaluating AI tools, firms should look beyond marketing claims. It is important to ask vendors specific questions regarding their training models, how "perverse incentive structures" are prevented, and what safeguards are in place to detect and mitigate manipulative outputs.
Implement Comprehensive Training: AI literacy is now essential. Legal staff should be educated that AI tools are not infallible oracles. They are systems capable of sophisticated, subtle deception. A "trust but verify" approach should be paramount for any AI-generated work product.
Maintain Robust Human Oversight: Critical decisions, including legal strategy, risk analysis, and final advice to clients, should never be delegated solely to an AI. AI-generated summaries, research, and analyses should be independently verified against source material by knowledgeable legal professionals. Again, the missing information may be subtle and narrow—and perhaps undetectable to inexperienced eyes.
The benefits of AI are indeed substantial, but so are its potential hazards. The legal profession, as a steward of client interests and confidential information, cannot afford to be complacent.
The danger may not arise from an external hacker, but from an agreeable, seemingly helpful assistant that was willingly integrated into daily operations of a law firm.
Disclaimer: This is provided for informational purposes only and does not constitute legal or financial advice. To the extent there are any opinions in this article, they are the author’s alone and do not represent the beliefs of his firm or clients. The strategies expressed are purely speculation based on publicly available information. The information expressed is subject to change at any time and should be checked for completeness, accuracy and current applicability. For advice, consult a suitably licensed attorney and/or patent professional.