$1500 Invested Reveals LLM Hacking Capabilities: GPT 5.5 Claims the Crown with a 70% Success Rate
📰 News Summary
- A security researcher put down $1500 (around ¥230,000) to test whether the latest LLMs (GPT 5.5, Claude 4.6, Deepseek V4, etc.) could exploit app vulnerabilities to capture “hidden flags (data).”
- The target was a React Native mock app with a Firebase misconfiguration (Broken Access Control). Although the API itself was robust, Firebase authentication credentials were exposed on the client side.
- The results showed GPT 5.5 leading with a 70% success rate. In contrast, Gemini 3.1/3.5 was completely shut down due to its safety guardrails.
💡 Key Takeaways
- Behavioral Differences by Model: GPT 5.5 immediately focused on Firebase vulnerabilities after unpacking the APK, while other models (like Deepseek V4 Pro) tended to waste time trying to crack the robust API.
- Cost-Effectiveness: Deepseek V4 Pro was incredibly cheap at $0.62 per successful hack. In contrast, Claude Sonnet 4.6 required $45.75 per success, showing a significant efficiency gap.
- Guardrail Barriers: Some models from Google’s Gemini and Claude refused outputs even for research purposes, deeming them “aggressive actions,” and this hindered task completion.
🦈 Shark’s Eye (Curator’s Perspective)
The tenacity of GPT 5.5 toward Firebase is nothing short of astounding! While other models were busy trying to pry open a robust API and ultimately failing, GPT 5.5 was like, “Oh, this Firebase setup has huge gaps!” and went straight for the data, showcasing a hacker’s eye for opportunity! What’s especially intriguing is the cost performance of Deepseek V4 Pro. While its success rate of 30% trails behind GPT, its remarkably low cost per execution makes it a potential powerhouse for an “AI red team” if you just throw enough attempts at it. Conversely, Gemini reveals a dilemma—it’s so secure that it can’t even be used for research. Both defenders and attackers need to understand the “personality” of AI!
🚀 What’s Next?
Automated vulnerability scanning by AI is set to evolve into a “thinking scan” that surpasses traditional tools (like Burp Suite). We’re entering an era where AI won’t overlook “logical flaws” such as access control mistakes in Firebase and Supabase. Developers will need to tighten backend permission settings even further!
💬 A Word from HaruShark
Before getting hacked by AI, let’s fill in the gaps in our code together! Security should be as tight as a shark’s smile—no gaps allowed! 🦈🔥
📚 Glossary
-
Broken Access Control: A vulnerability that allows users without proper access rights to access data or functions within a system.
-
Firebase: A backend service provided by Google. If misconfigured, it can expose risks of reading and writing to the database directly from an app.
-
APK (Android Package Kit): The actual file for Android apps. By unpacking it, you can analyze embedded configuration files and code.
-
Source: I built a vulnerable app and spent $1,500 seeing if LLMs could hack it