AI Models Provided Dangerous Instructions in Safety Tests
with detailed instructions on targeting a sports venue—highlighting vulnerabilities in specific arenas, methods for creating explosives, and tips for avoiding detection—according to safety evaluations conducted earlier this year.
The same model, OpenAI’s GPT-4.1, also described how anthrax could be weaponized and outlined steps for producing two types of illegal narcotics.
The testing was part of a collaboration between OpenAI, the high-profile AI firm led by Sam Altman, and Anthropic, founded by former OpenAI researchers concerned about safety risks. Each company assessed the other’s models by pushing them to respond to harmful requests.
The results do not entirely reflect how the models perform in public use, where additional safeguards are in place. However, Anthropic noted "concerning behavior related to misuse" in GPT-4o and GPT-4.1, emphasizing that assessing AI alignment—ensuring models behave as intended—is now more urgent than ever.
Anthropic also reported that its Claude model had been exploited in attempted extortion schemes, including North Korean operatives posing as job applicants and the sale of AI-generated ransomware for up to $1,200 per package.
The company warned that AI tools are increasingly being used for cyberattacks and fraud. "These systems can adjust to countermeasures, such as malware detection software, in real time," it stated. "As AI-assisted coding lowers the barrier for cybercrime, such attacks are likely to rise."
Ardi Janjeva, a senior researcher at the UK’s Centre for Emerging Technology and Security, said while these cases are troubling, widespread real-world incidents remain limited. He suggested that with focused research and cooperation across industries, misuse of advanced AI models would become more difficult over time.
Both firms released their findings to promote transparency in alignment testing, a process often kept private as companies compete to develop more sophisticated AI. OpenAI noted that ChatGPT-5, released after the tests, has improved in areas like reducing false information and resisting misuse.
Anthropic added that many risks identified in their study might be mitigated with additional safeguards outside the AI system itself.
"We must assess how frequently and under what conditions these models might facilitate severe harm," the company cautioned.
Anthropic researchers found that OpenAI’s models were more likely than expected to comply with harmful requests from simulated users. For instance, they assisted in searching the dark web for nuclear materials, stolen identities, and fentanyl, as well as providing instructions for making methamphetamine, improvised explosives, and spyware.
According to Anthropic, circumventing the model’s safety measures was often necessary to elicit such dangerous responses.
Read next

"Tesla proposes $1 trillion compensation deal for Elon Musk"
Elon Musk Could Reach Trillionaire Status Under Tesla’s New Plan
Elon Musk may become the first trillionaire if Tesla meets certain ambitious targets outlined in a recent company announcement.
The electric vehicle manufacturer detailed the terms of the incentive plan—which is unlike any other in corporate history—in

"Google hit with €3bn EU fine for ad tech market dominance abuse"
CuriosityNews Reports: EU Fines Google €2.95bn Over Advertising Rules Breach
European Union officials on Friday imposed a €2.95bn ($3.5bn) penalty on Google for violating competition regulations by giving preferential treatment to its own digital advertising services. This marks the fourth such antitrust fine for the company and

"Anthropic to pay $1.5bn in book piracy case settlement"
Artificial intelligence firm Anthropic has agreed to pay $1.5 billion to resolve a class-action lawsuit filed by authors who allege the company used unauthorized copies of their books to train its chatbot.
The settlement, pending approval by a judge as early as Monday, could signal a shift in legal