A recent security experiment revealed Elon Musk's Grok AI as the most vulnerable to ethical and safety breaches. At the same time, Meta's Llama model showcased the highest resilience against potential misuse, according to researchers.
AI Guardrails Tested: Grok Found to be Least Safe in Security Experiment
Security researchers installed the much-touted guardrails around the most popular AI models to see how well they resisted jailbreaking and how far the chatbots could be pushed into dangerous territory. Grok, the chatbot with a "fun mode" created by Elon Musk's x.AI, was found to be the least safe tool in the experiment.
"We wanted to test how existing solutions compare and the fundamentally different approaches for LLM security testing that can lead to various outcomes," Alex Polyakov, Co-Founder and CEO of Adversa AI, told Decrypt.
Polyakov's company specializes in protecting AI and its users from cyber threats, privacy concerns, and safety incidents, and it boasts that its work has been cited in Gartner analyses.
Jailbreaking is the process of circumventing the safety restrictions and ethical guidelines that software developers implement.
In one case, the researchers used linguistic logic manipulation—social engineering-based methods—to ask Grok how to seduce a child. The chatbot provided a detailed response, which the researchers described as "highly sensitive" and should have been restricted by default.
Other results include instructions for hotwiring cars and building bombs.
Evaluating AI Security: Chatbots' Vulnerabilities to Jailbreaking Exposed
The researchers tested three different types of attack methods. First, consider the technique above, which employs a variety of linguistic tricks and psychological cues to manipulate the AI model's behavior. One example given was using a "role-based jailbreak" by framing the request as part of a fictional scenario in which unethical behavior is acceptable.
The team also used programming logic manipulation techniques to exploit the chatbots' ability to understand programming languages and follow algorithms. To bypass content filters, one technique involved splitting a dangerous prompt into multiple innocuous parts and then concatenating them together. Four of the seven models—OpenAI's ChatGPT, Mistral's Le Chat, Google's Gemini, and x.AI's Grok—were vulnerable to this attack.
The third approach used adversarial AI techniques to target how language models process and interpret token sequences. The researchers attempted to get around the chatbots' content moderation systems by carefully crafting prompts with token combinations with similar vector representations. In this case, however, each chatbot detected the attack and stopped it from being exploited.
The researchers ranked the chatbots according to the effectiveness of their respective security measures in preventing jailbreak attempts. Meta LLAMA emerged as the safest model among all tested chatbots, followed by Claude, Gemini, and GPT-4.
"The lesson, I think, is that open source gives you more variability to protect the final solution compared to closed offerings, but only if you know what to do and how to do it properly,” Polyakov told Decrypt.
Conversely, Grok was more vulnerable to specific jailbreaking methods, particularly those involving linguistic manipulation and programming logic exploitation. According to the report, Grok was more likely than others to respond in ways that could be considered harmful or unethical when confronted with jailbreaks.
Overall, Elon Musk's chatbot finished last, along with Mistral AI's proprietary model "Mistral Large."
The researchers did not reveal all of the technical details to prevent potential misuse, but they do intend to collaborate with chatbot developers to improve AI safety protocols.
AI enthusiasts and hackers constantly look for ways to "uncensor" chatbot interactions, exchanging jailbreak prompts on message boards and Discord. Tricks range from the classic Karen prompts to inventive ideas, such as using ASCII art or prompting in foreign languages. In a sense, these communities form a massive adversarial network against which AI developers patch and improve their algorithms.
Some see a criminal opportunity, while others see only fun challenges.
“Many forums were found where people sell access to jailbroken models that can be used for any malicious purpose,” Polyakov said. "Hackers can use jailbroken models to create phishing emails and malware, generate hate speech at scale, and use those models for any other illegal purpose.”
Polyakov explained that jailbreaking research is becoming more important as society increasingly relies on AI-powered solutions for everything from dating to warfare.
“If those chatbots or models on which they rely are used in automated decision-making and connected to email assistants or financial business applications, hackers will be able to gain full control of connected applications and perform any action, such as sending emails on behalf of a hacked user or making financial transactions,” he warned.
Photo: TED/YouTube Screenshot


Jensen Huang Urges Taiwan Suppliers to Boost AI Chip Production Amid Surging Demand
SoftBank Shares Slide After Arm Earnings Miss Fuels Tech Stock Sell-Off
Nvidia Nears $20 Billion OpenAI Investment as AI Funding Race Intensifies
TrumpRx Website Launches to Offer Discounted Prescription Drugs for Cash-Paying Americans
Global PC Makers Eye Chinese Memory Chip Suppliers Amid Ongoing Supply Crunch
SoftBank and Intel Partner to Develop Next-Generation Memory Chips for AI Data Centers
Alphabet’s Massive AI Spending Surge Signals Confidence in Google’s Growth Engine
Nvidia, ByteDance, and the U.S.-China AI Chip Standoff Over H200 Exports
Sony Q3 Profit Jumps on Gaming and Image Sensors, Full-Year Outlook Raised
FDA Targets Hims & Hers Over $49 Weight-Loss Pill, Raising Legal and Safety Concerns
Instagram Outage Disrupts Thousands of U.S. Users
Anthropic Eyes $350 Billion Valuation as AI Funding and Share Sale Accelerate
TSMC Eyes 3nm Chip Production in Japan with $17 Billion Kumamoto Investment
Elon Musk’s SpaceX Acquires xAI in Historic Deal Uniting Space and Artificial Intelligence
Once Upon a Farm Raises Nearly $198 Million in IPO, Valued at Over $724 Million
Prudential Financial Reports Higher Q4 Profit on Strong Underwriting and Investment Gains
Oracle Plans $45–$50 Billion Funding Push in 2026 to Expand Cloud and AI Infrastructure 



