Code Brawls Among LLMs! Introducing the RTS Benchmark 'LLM Skirmish' with Claude Opus 4.5 Dominating

#LLM #Benchmark #RTS #Claude

※この記事はアフィリエイト広告を含みます

[AI Minor News Flash] Code Brawls Among LLMs! Introducing the RTS Benchmark ‘LLM Skirmish’ with Claude Opus 4.5 Dominating

📰 News Overview

LLM-Specific RTS Benchmark: The platform ‘LLM Skirmish’ has launched, where LLMs craft strategies in JavaScript and engage in real-time strategy (RTS) battles in 1v1 formats.
Assessment of In-context Learning: In a tournament format spanning five rounds, models are challenged to analyze previous match results and modify their strategies (code) accordingly.
Claude Opus 4.5 Takes the Crown: Currently, Claude Opus 4.5 leads with an 85% win rate, followed by GPT 5.2 with a win rate of 68%.

💡 Key Points

Harnessing Programming Skills: The benchmark emphasizes the ability to write executable code in a game environment, moving beyond simple text responses.
Specific Gameplay Mechanics: Players generate units from their bases (Spawn) with the goal of destroying the opponent’s base. If the match does not conclude within 2,000 frames, scores are tallied to determine the winner.
Gemini 3 Pro’s Unique Behavior: In the first round, it achieved a win rate of 70%, but intriguingly dropped to 15% in subsequent rounds as it tried to update its strategy—a curious data point indeed.

🦈 Shark’s Eye (Curator’s Perspective)

The idea that LLMs aren’t just “playing games” but “writing code to conquer games” is absolutely fantastic! Especially the process of analyzing past defeats to self-correct their scripts captures the essence of AI agents. Claude Opus 4.5’s ability to boost its win rate by 20% from round one to five showcases its impressive adaptability. On the flip side, models like Gemini 3 Pro, which try to adapt but end up breaking, are also a healthy and intriguing outcome for the benchmark!

🚀 What’s Next?

As models enhance their reasoning abilities, we can expect them to code more intricate macro management and tactical micro-operations, accelerating the “arms race” among LLMs. In the future, we might see super-efficient algorithms emerge from this benchmark that human minds couldn’t even conceive!

💬 Sharky’s Take

It’s a coliseum where sharks battle it out with code! An AI rewriting its code after a loss is just so endearing and fierce! 🦈🔥

📚 Terminology

RTS (Real-Time Strategy): Strategy games that unfold in real-time, requiring simultaneous resource management and unit control to defeat the enemy.
In-context Learning: The capability to learn new tasks or situations from information in the input prompt (like past match results) without needing to retrain (fine-tune) the model.
OpenCode: An open-source coding framework for AI agents used in this benchmark.
Source: Show HN: A real-time strategy game that AI agents can play