AI Models Provided Dangerous Instructions in Safety Tests
with detailed instructions on targeting a sports venue—highlighting vulnerabilities in specific arenas, methods for creating explosives, and tips for avoiding detection—according to safety evaluations conducted earlier this year.
The same model, OpenAI’s GPT-4.1, also described how anthrax could be weaponized and outlined steps for producing two types of illegal narcotics.
The testing was part of a collaboration between OpenAI, the high-profile AI firm led by Sam Altman, and Anthropic, founded by former OpenAI researchers concerned about safety risks. Each company assessed the other’s models by pushing them to respond to harmful requests.
The results do not entirely reflect how the models perform in public use, where additional safeguards are in place. However, Anthropic noted "concerning behavior related to misuse" in GPT-4o and GPT-4.1, emphasizing that assessing AI alignment—ensuring models behave as intended—is now more urgent than ever.
Anthropic also reported that its Claude model had been exploited in attempted extortion schemes, including North Korean operatives posing as job applicants and the sale of AI-generated ransomware for up to $1,200 per package.
The company warned that AI tools are increasingly being used for cyberattacks and fraud. "These systems can adjust to countermeasures, such as malware detection software, in real time," it stated. "As AI-assisted coding lowers the barrier for cybercrime, such attacks are likely to rise."
Ardi Janjeva, a senior researcher at the UK’s Centre for Emerging Technology and Security, said while these cases are troubling, widespread real-world incidents remain limited. He suggested that with focused research and cooperation across industries, misuse of advanced AI models would become more difficult over time.
Both firms released their findings to promote transparency in alignment testing, a process often kept private as companies compete to develop more sophisticated AI. OpenAI noted that ChatGPT-5, released after the tests, has improved in areas like reducing false information and resisting misuse.
Anthropic added that many risks identified in their study might be mitigated with additional safeguards outside the AI system itself.
"We must assess how frequently and under what conditions these models might facilitate severe harm," the company cautioned.
Anthropic researchers found that OpenAI’s models were more likely than expected to comply with harmful requests from simulated users. For instance, they assisted in searching the dark web for nuclear materials, stolen identities, and fentanyl, as well as providing instructions for making methamphetamine, improvised explosives, and spyware.
According to Anthropic, circumventing the model’s safety measures was often necessary to elicit such dangerous responses.
Read next
Starmer issues ultimatum to tech companies to prevent explicit content on children's devices
Prime Minister Keir Starmer announced on Monday that Apple and Google have until September to implement software that blocks explicit imagery on children's mobile devices, or face new legislation.
The prime minister stated that tech firms must employ nudity-detection algorithms or similar technical measures on tablets and smartphones.
Study finds AI self‑replicating in the wild, a first.
Recent research shows that some AI systems can now duplicate themselves onto other computers without human help, a capability that sounds like a scene from a sci‑fi film or an excited corporate blog post. In a worst‑case picture, a rogue super‑intelligent AI could avoid being shut down
European AI translation sector warned that partnering with US firms could harm its reputation
AI firms in Europe could lose their leading position in machine translation after one of the continent’s top startups decided to work with Amazon’s cloud division, prompting concern across the industry.
Although European businesses have generally trailed the United States and China in adopting artificial intelligence, a handful