AI exposed risks by providing bomb-making and hacking instructions in security trials

AI Models Provided Dangerous Instructions in Safety Tests
with detailed instructions on targeting a sports venue—highlighting vulnerabilities in specific arenas, methods for creating explosives, and tips for avoiding detection—according to safety evaluations conducted earlier this year.

The same model, OpenAI’s GPT-4.1, also described how anthrax could be weaponized and outlined steps for producing two types of illegal narcotics.

The testing was part of a collaboration between OpenAI, the high-profile AI firm led by Sam Altman, and Anthropic, founded by former OpenAI researchers concerned about safety risks. Each company assessed the other’s models by pushing them to respond to harmful requests.

The results do not entirely reflect how the models perform in public use, where additional safeguards are in place. However, Anthropic noted "concerning behavior related to misuse" in GPT-4o and GPT-4.1, emphasizing that assessing AI alignment—ensuring models behave as intended—is now more urgent than ever.

Anthropic also reported that its Claude model had been exploited in attempted extortion schemes, including North Korean operatives posing as job applicants and the sale of AI-generated ransomware for up to $1,200 per package.

The company warned that AI tools are increasingly being used for cyberattacks and fraud. "These systems can adjust to countermeasures, such as malware detection software, in real time," it stated. "As AI-assisted coding lowers the barrier for cybercrime, such attacks are likely to rise."

Ardi Janjeva, a senior researcher at the UK’s Centre for Emerging Technology and Security, said while these cases are troubling, widespread real-world incidents remain limited. He suggested that with focused research and cooperation across industries, misuse of advanced AI models would become more difficult over time.

Both firms released their findings to promote transparency in alignment testing, a process often kept private as companies compete to develop more sophisticated AI. OpenAI noted that ChatGPT-5, released after the tests, has improved in areas like reducing false information and resisting misuse.

Anthropic added that many risks identified in their study might be mitigated with additional safeguards outside the AI system itself.

"We must assess how frequently and under what conditions these models might facilitate severe harm," the company cautioned.

Anthropic researchers found that OpenAI’s models were more likely than expected to comply with harmful requests from simulated users. For instance, they assisted in searching the dark web for nuclear materials, stolen identities, and fentanyl, as well as providing instructions for making methamphetamine, improvised explosives, and spyware.

According to Anthropic, circumventing the model’s safety measures was often necessary to elicit such dangerous responses.