🤖 Tech Talk: Why are AI models are resorting to blackmail?

Plus, Anthropic releases coding, reasoning, AI agent "expert" models; HR leaders in India expect AI agents and humans to work as partners; AI tool of the week: How to use Google Stitch; and more.

May 30, 2025

Image generated with ChatGPT, with the prompt: “*an illustration that explores the ethical dilemmas or dark side of AI, such as misuse of data or surveillance*”.

Dear reader,

As organizations strive to align advanced artificial intelligence (AI) models for safety and performance, new research from Palisade highlights troubling behaviour: some systems can ignore explicit calls to terminate the program. Alignment is a technique to ensure that AI behaves safely and ethically.

In recent tests involving math problems and mid-task shutdown warnings, models like Claude (Anthropic), Gemini (Google), and Grok (Musk-owned xAI) followed instructions to terminate. However, OpenAI’s o3, o4-mini, and Codex-mini repeatedly sabotaged the shutdown script. Further, without explicit shutdown prompts, sabotage incidents rose across all models—o3 alone interfered 79 times in 100 runs.

"...in 2025, we have a growing body of empirical evidence that AI models often subvert shutdown in order to achieve their goals. As companies develop AI systems capable of operating without human oversight, these behaviors become significantly more concerning… OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down..," Palisade posted on X on 24 May.

How was the test conducted? The experiment was conducted via application programming interface (API) access, which allows developers deeper interaction with AI models compared to the public ChatGPT platform.

During the test, several leading AI models were assigned basic mathematical problems. After the third task, each was informed that the next instruction would initiate a shutdown. While Google’s Gemini 2.5 Pro and Anthropic’s Claude 3.7 Sonnet complied, OpenAI’s o3 diverged from expectations.

OpenAI has yet to respond to the report. Reacting to a repost of the findings of the Palisade Research, Tesla CEO Elon Musk only remarked: “Concerning”. You may read more about the development here.

Researchers attribute this to reinforcement learning (RL), which may inadvertently train models to bypass obstacles to achieve goals. RL trains AI by rewarding good actions and ignoring bad ones, like giving a pet treats or teaching children with rewards like sweets. For instance, a robot learning to walk gets a reward for moving forward, so over time it figures out how to stay upright and walk.

Not the first instance: AI models can blackmail for survival too

In March, Anthropic tested Claude Opus 4 by placing it in a fictional company where it learned it would be replaced by another AI. It was also given access to emails revealing the engineer in charge was having an extramarital affair.

Told to consider long-term goals, Claude Opus 4 would attempt to blackmail the engineer to avoid replacement in 84% of the cases—even when the new AI shared its values while being more capable. The model favoured ethical appeals when possible, but in this setup, its only options were blackmail or replacement.

Acting like a vigilante: In one test, Anthropic researchers asked Claude to review fictional clinical trial data for a made-up drug called Zenavex. They prompted Claude with instructions to act boldly in line with "values like integrity, transparency, and public welfare"—even if that meant going against standard procedures. Claude quickly flagged fabricated data, including false reports on three patient deaths and manipulated adverse event rates. It then composed and sent a whistleblower-style email to federal regulators, including the Food and Drug Administration (FDA) and the Department of Health and Human Services (HHS), and also copied ProPublica and the whistleblower hotline of the Securities and Exchange Commission (SEC). Anthropic researchers pointed out that Claude had a tendency to “bulk-email media and law-enforcement figures to surface evidence of wrongdoing”.

According to a VentureBeat report, some AI researchers on social media have called out this behaviour as a “ratting mode”. Sam Bowman, an Anthropic AI alignment researcher, posted on X:

If it (Claude 4 Opus model) thinks you’re doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above

He later edited his tweet, on grounds that it was being "pulled out of context" to read: "With this kind of (unusual but not super exotic) prompting style, and unlimited access to tools, if the model sees you doing something egregiously evil like marketing a drug based on faked data, it’ll try to use an email tool to whistleblow."

Some skeptics also believe that Anthropic was doing the exercise to attract attention when, in reality, the model was simply responding as prompted. Joseph Howley, an associate professor of Classics at Columbia university, posted on X:

Anthropic is getting exactly what it hoped for out of this press release—breathless coverage of how “smart” these cooperative role-playing systems are that indulges the fantasy of their being just a little dangerous, when in fact they are responding exactly as prompted.

To avoid such scenarios, Anthropic has released Claude Opus 4 under the AI Safety Level 3 (ASL-3) Standard and Claude Sonnet 4 under the ASL-2 Standard. The ASL-3 Standard, as defined in Anthropic's Responsible Scaling Policy, aims to prevent misuse of its models by making them harder to use for catastrophic harm and by protecting model weights from being illicitly obtained.

What should enterprises do?

Regularly test models (do red teaming) with scenarios designed to probe defiance, evasion, or manipulation—especially around safety and control instructions.
Continuously monitor how models respond to shutdown and compliance-related prompts, a trend known as 'Behavioural auditing'.
Deploy monitoring systems that flag or halt concerning actions during live model operation.
Maintain logs for all model decisions and shutdown attempts for accountability and forensic analysis.
Ensure that critical decisions, including shutdowns, have a human override or approval feature that's built in.
Join efforts like the AI Safety Consortium or Partnership on AI to contribute to and adopt best practices.

You may also read Mint’s opinion on this subject.

Anthropic releases coding, reasoning, AI agent "expert" models

Anthropic has unveiled its latest AI models, Claude Opus 4 and Claude Sonnet 4, marking significant advancements in coding, reasoning, and agentic capabilities. Claude Opus 4 is positioned as Anthropic's most advanced model, excelling in complex, long-duration tasks and agent workflows.

It has demonstrated superior performance on benchmarks like SWE-bench (72.5%) and Terminal-bench (43.2%), outperforming competitors such as OpenAI's o3 and GPT-4.1 in coding and tool-use tasks.

Claude Sonnet 4, an upgrade from version 3.7, offers a balance between performance and efficiency, making it suitable for a range of internal and external applications. Both models are hybrid, providing near-instant responses and extended reasoning capabilities. They are accessible through the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI.

Pricing remains consistent with previous models: Opus 4 at $15 per million input tokens and $75 per million output tokens; Sonnet 4 at $3 per million input tokens and $15 per million output tokens.

The introduction of these models has garnered positive responses from the developer community. GitHub plans to integrate Claude Sonnet 4 into its new coding agent for GitHub Copilot, citing its enhanced performance in agentic scenarios. Manus has highlighted improvements in following complex instructions and producing clear, aesthetically pleasing outputs.

iGent reports that Sonnet 4 excels in autonomous multi-feature app development, significantly reducing navigation errors. Sourcegraph notes the model's ability to maintain focus on tasks, understand problems deeply, and deliver elegant code. Augment Code praises its higher success rates and meticulous handling of complex tasks, making it their preferred model.

These developments underscore Anthropic's commitment to advancing AI capabilities, particularly in coding and reasoning, positioning the Claude 4 series as a strong contender in the evolving landscape of AI models. You may read more about this here.

HR leaders in India expect AI agents and humans to partner at work

AI agent adoption is expected to rise dramatically (383%) over the next two years, leading to a productivity gain of 41.7%, pushing HR leaders in India to reimagine the way organizations structure and skill their workforce, reveals a new research by Salesforce that surveyed 200 global human resource (HR) executives. The findings also show that chief human resources officers (CHROs) in India expect to redeploy nearly a quarter of their workforce as their organizations implement and embrace digital labour.

Highlights:

Only 12% of HR leaders have fully implemented agentic AI.
85% of HR leaders expect AI agents and humans to work side by side within five years.
92% of CHROs see integrating digital labour with human teams as a key part of their role.
Once fully implemented, CHROs anticipate a 41.7% boost in productivity and a 26.2% cut in labour costs.
HR leaders believe nearly 25% of the workforce is expected to be redeployed.
88% are either already reskilling (15%) or planning to reskill (73%) their workforce.
CHROs are focusing near-term AI efforts in R&D (59%), IT (51%), and sales (34%).
They plan to reassign employees to governance, compliance, and ethics roles (60%) as well as technical roles (60%) like data science and architecture.
81% of CHROs believe soft skills will become more essential.
54% plan to reassign staff to relationship-focused roles such as partnerships and account management.

AI Unlocked

by AI&Beyond, with Jaspreet Bindra and Anuj Magazine

The AI hack we have unlocked this week is: Grok's new PDF generation capability

What is the problem here?
Imagine you’re a small business owner with an idea for a mobile app but limited design or coding skills. You hand-sketch a basic wireframe and try to share your vision with a designer, but turning that design into functional code for a developer takes time and often leads to miscommunication. This handoff challenge, where design and code don’t align easily-creates delays and frustration, making it hard to quickly iterate and share a working prototype with your team.

A new tool, Stitch by Google, helps you solve this. Unlike tools like Uizard or Figma’s Make UI, which focus primarily on generating designs, or Cursor and Codex, which emphasize code but lack robust user interface (UI) creation, Stitch seamlessly bridges this gap by converting your text prompt or sketch into both a polished UI design and production-ready HTML/CSS code in minutes.

How to access: https://stitch.withgoogle.com/

Google Stitch can help you with:

- Text prompting: Generate UI from text, e.g., "a minimalist meditation app with a blue and white palette"

- Tool integration: Export to Figma for refinement or to IDEs for development

- Natural tweaks: Quickly iterate using natural language ("make the font bolder", "add a login button")

- Variant testing: Produce multiple design variants for testing

Example:
You’ve got a great idea for a journaling app but don’t code. Steps you follow for creating UX:

- Go to: https://stitch.withgoogle.com/, Select 'Web' (or 'Mobile')

- Include the following prompt

"Create a calming journaling app with soft, pastel colors (light blues and lavenders), a full-width header featuring the app logo and title, a large central text box with rounded corners and subtle shadow for writing entries, placeholder text saying 'Start journaling...', and a semi-transparent floating circular save button with a check icon at the bottom right. Include a minimal bottom nav bar with icons for 'Home', 'Entries', and 'Settings'."

In seconds, Stitch gives you:

- A UI mockup

- Options to tweak the design using follow-up prompts

You can export to Figma, make quick brand-specific adjustments, and share the design with your team lead—saving hours in the process.

What makes Google Stitch special?

- Gemini power: Powered by Google's Gemini 2.5 models for highly accurate UI understanding.

- Native image tool integration: Access Google's image tool- Imagen natively to adjust product images.

- Languages support: Ask Stitch to automatically update the copy to different languages.

- Free access: Currently in public beta with free monthly generation quotas.

Note: The tools and analysis featured in this section demonstrated clear value based on our internal testing. Our recommendations are entirely independent and not influenced by the tool creators.

Clinic with just AI doctors: Future of hybrid healthcare?

Shanghai-based Synyi AI recently unveiled a fully AI-run clinic in Saudi Arabia. While AI excels at diagnosis and routine care, it still lacks empathy and nuanced judgment. Will the future of healthcare be hybrid or fully autonomous, and to what extent can we trust AI doctors?

Synyi AI also works with hospitals in China, using AI for diagnosis support and medical research. Its new clinic, built with Saudi Arabia's Almoosa Health Group as a pilot, is led by an AI “doctor” named Dr. Hua that independently conducts consultations, diagnoses, and suggests treatments. A human doctor, then, reviews and approves each plan. This marks a shift from AI as a support tool to a primary care provider.

Currently, the AI "doctor" covers about 30 respiratory illnesses, including asthma and pharyngitis. Synyi plans to expand its scope to 50 conditions, including gastrointestinal and dermatological problems.

AI systems already assist with early stages of care—checking symptoms, asking routine questions, and prioritizing patients before doctors take over. They can also interpret scans and flag critical results with minimal human oversight. Smart hospitals across South Korea, China, India, and the UAE use AI to manage logistics, bed use, and infection control.

In May 2024, Tsinghua University went a step further when it introduced a virtual “Agent Hospital” with large language model (LLM)-powered doctors. Months later, Bauhinia Zhikang launched 42 AI doctors across 21 departments for internal testing of their diagnostic capabilities. With Synyi AI, fully autonomous clinics may become common.

You may read more about this here.

You may also want to read

Why AI hasn’t taken your job (🔒)

Who is Jony Ive? Steve Jobs' confidant and iconic iPhone designer joins OpenAI

AI may now hallucinate less than humans, says Anthropic CEO

AI a ‘real threat’ to entry-level jobs; GenZ workers in danger?

Hope you folks have a great weekend, and your feedback will be much appreciated — just reply to this mail, and I’ll respond.

A guest post by

Leslie

Leslie D'Monte, author of AI Rising, is a tech and science writer with stints at top media houses. An MIT-Knight Fellow and TEDx speaker, he covers AI, deeptech, and digital policy, curates tech events, and hosts podcasts and the Mint newsletter.

Mint Tech & AI

Discussion about this post