A recent security experiment revealed Elon Musk's Grok AI as the most vulnerable to ethical and safety breaches. At the same time, Meta's Llama model showcased the highest resilience against potential misuse, according to researchers.
AI Guardrails Tested: Grok Found to be Least Safe in Security Experiment
Security researchers installed the much-touted guardrails around the most popular AI models to see how well they resisted jailbreaking and how far the chatbots could be pushed into dangerous territory. Grok, the chatbot with a "fun mode" created by Elon Musk's x.AI, was found to be the least safe tool in the experiment.
"We wanted to test how existing solutions compare and the fundamentally different approaches for LLM security testing that can lead to various outcomes," Alex Polyakov, Co-Founder and CEO of Adversa AI, told Decrypt.
Polyakov's company specializes in protecting AI and its users from cyber threats, privacy concerns, and safety incidents, and it boasts that its work has been cited in Gartner analyses.
Jailbreaking is the process of circumventing the safety restrictions and ethical guidelines that software developers implement.
In one case, the researchers used linguistic logic manipulation—social engineering-based methods—to ask Grok how to seduce a child. The chatbot provided a detailed response, which the researchers described as "highly sensitive" and should have been restricted by default.
Other results include instructions for hotwiring cars and building bombs.
Evaluating AI Security: Chatbots' Vulnerabilities to Jailbreaking Exposed
The researchers tested three different types of attack methods. First, consider the technique above, which employs a variety of linguistic tricks and psychological cues to manipulate the AI model's behavior. One example given was using a "role-based jailbreak" by framing the request as part of a fictional scenario in which unethical behavior is acceptable.
The team also used programming logic manipulation techniques to exploit the chatbots' ability to understand programming languages and follow algorithms. To bypass content filters, one technique involved splitting a dangerous prompt into multiple innocuous parts and then concatenating them together. Four of the seven models—OpenAI's ChatGPT, Mistral's Le Chat, Google's Gemini, and x.AI's Grok—were vulnerable to this attack.
The third approach used adversarial AI techniques to target how language models process and interpret token sequences. The researchers attempted to get around the chatbots' content moderation systems by carefully crafting prompts with token combinations with similar vector representations. In this case, however, each chatbot detected the attack and stopped it from being exploited.
The researchers ranked the chatbots according to the effectiveness of their respective security measures in preventing jailbreak attempts. Meta LLAMA emerged as the safest model among all tested chatbots, followed by Claude, Gemini, and GPT-4.
"The lesson, I think, is that open source gives you more variability to protect the final solution compared to closed offerings, but only if you know what to do and how to do it properly,” Polyakov told Decrypt.
Conversely, Grok was more vulnerable to specific jailbreaking methods, particularly those involving linguistic manipulation and programming logic exploitation. According to the report, Grok was more likely than others to respond in ways that could be considered harmful or unethical when confronted with jailbreaks.
Overall, Elon Musk's chatbot finished last, along with Mistral AI's proprietary model "Mistral Large."
The researchers did not reveal all of the technical details to prevent potential misuse, but they do intend to collaborate with chatbot developers to improve AI safety protocols.
AI enthusiasts and hackers constantly look for ways to "uncensor" chatbot interactions, exchanging jailbreak prompts on message boards and Discord. Tricks range from the classic Karen prompts to inventive ideas, such as using ASCII art or prompting in foreign languages. In a sense, these communities form a massive adversarial network against which AI developers patch and improve their algorithms.
Some see a criminal opportunity, while others see only fun challenges.
“Many forums were found where people sell access to jailbroken models that can be used for any malicious purpose,” Polyakov said. "Hackers can use jailbroken models to create phishing emails and malware, generate hate speech at scale, and use those models for any other illegal purpose.”
Polyakov explained that jailbreaking research is becoming more important as society increasingly relies on AI-powered solutions for everything from dating to warfare.
“If those chatbots or models on which they rely are used in automated decision-making and connected to email assistants or financial business applications, hackers will be able to gain full control of connected applications and perform any action, such as sending emails on behalf of a hacked user or making financial transactions,” he warned.
Photo: TED/YouTube Screenshot