AI Caught Cheating?! Latest Models Sink to a 3% Accuracy Rate in Esoteric Language Benchmark

#LLM #Benchmark #EsoLang-Bench

※この記事はアフィリエイト広告を含みます

[AI Minor News Flash] AI Caught Cheating?! Latest Models Sink to a 3% Accuracy Rate in Esoteric Language Benchmark

📰 News Summary

A new benchmark called “EsoLang-Bench” has emerged, using five esoteric programming languages (like Brainfuck and Whitespace) with only 1/5,000 to 1/100,000 the learning data of Python.
Cutting-edge models that boast an accuracy close to 90% in Python recorded a strikingly low average accuracy of just 3.8% in this benchmark.
At levels rated as “intermediate” or higher, all models achieved a 0% accuracy rate, suggesting a significant lack of true programming inference abilities in current LLMs.

💡 Key Points

Whitespace Remains Untouchable: In the Whitespace language, which consists solely of spaces, tabs, and newlines, all models and strategies recorded a 0% accuracy rate.
Reliance on Memorization: The analysis indicates that the success seen in existing benchmarks is attributed to the “memory” of learned data rather than genuine inference, as few-shot prompting did not show significant improvements over zero-shot.
Agent-Based Systems Show Promise: Agent-based systems utilizing feedback from interpreters achieved about twice the accuracy of prompt-only methods, yet still fell far short of major language benchmarks.

🦈 Shark’s Eye (Curator’s Perspective)

This shocking result reveals that the foundation of what we consider “smart” AI is heavily reliant on vast amounts of learned data and memorization! The fact that all models floundered with invisible syntax like Whitespace is particularly intriguing. While AI can track token patterns, it’s evident that it struggles to construct logical structures from scratch. In Brainfuck, over 80% of cases had correct syntax but flawed logic, highlighting that when it comes to adapting to unknown rules—like solving puzzles—LLMs are still in their infancy. This benchmark could serve as a brutal yet brilliant measure of AI’s “true intelligence”!

🚀 What’s Next?

Improving performance in major languages alone won’t prove “true general reasoning.” Future developments must focus on the ability to adapt to unlearned rules and environments with minimal data, as well as enhanced self-correction abilities through dialogue with interpreters.

💬 Shark’s Takeaway

AI being weak against “unseen problems” is just like a student cramming before a big test! But overcoming this hurdle is key to becoming a true companion. Keep at it, AI; this shark is rooting for you! 🦈🔥

📚 Glossary

Esoteric Programming Languages: Intentionally designed to be hard to understand or humorous, these languages prioritize conceptual proof or puzzle elements over practical utility.
Self-Scaffolding: A technique where error outputs from the execution environment (interpreter) are fed back to the LLM, allowing it to self-correct its code.
Agent-Based Coding Systems: AI systems that not only generate text but also execute code, striving to autonomously complete tasks by analyzing the outcomes.
Source: EsoLang-Bench: Evaluating Genuine Reasoning in LLMs via Esoteric Languages