3 min read
[AI Minor News]

Code Brawls Among LLMs! Introducing the RTS Benchmark 'LLM Skirmish' with Claude Opus 4.5 Dominating


An exciting new benchmark where LLMs write code for real-time strategy games and battle it out in 1v1 matches. Claude Opus 4.5 takes the lead.

※この記事はアフィリエイト広告を含みます

[AI Minor News Flash] Code Brawls Among LLMs! Introducing the RTS Benchmark ‘LLM Skirmish’ with Claude Opus 4.5 Dominating

📰 News Overview

  • LLM-Specific RTS Benchmark: The platform ‘LLM Skirmish’ has launched, where LLMs craft strategies in JavaScript and engage in real-time strategy (RTS) battles in 1v1 formats.
  • Assessment of In-context Learning: In a tournament format spanning five rounds, models are challenged to analyze previous match results and modify their strategies (code) accordingly.
  • Claude Opus 4.5 Takes the Crown: Currently, Claude Opus 4.5 leads with an 85% win rate, followed by GPT 5.2 with a win rate of 68%.

💡 Key Points

  • Harnessing Programming Skills: The benchmark emphasizes the ability to write executable code in a game environment, moving beyond simple text responses.
  • Specific Gameplay Mechanics: Players generate units from their bases (Spawn) with the goal of destroying the opponent’s base. If the match does not conclude within 2,000 frames, scores are tallied to determine the winner.
  • Gemini 3 Pro’s Unique Behavior: In the first round, it achieved a win rate of 70%, but intriguingly dropped to 15% in subsequent rounds as it tried to update its strategy—a curious data point indeed.

🦈 Shark’s Eye (Curator’s Perspective)

The idea that LLMs aren’t just “playing games” but “writing code to conquer games” is absolutely fantastic! Especially the process of analyzing past defeats to self-correct their scripts captures the essence of AI agents. Claude Opus 4.5’s ability to boost its win rate by 20% from round one to five showcases its impressive adaptability. On the flip side, models like Gemini 3 Pro, which try to adapt but end up breaking, are also a healthy and intriguing outcome for the benchmark!

🚀 What’s Next?

As models enhance their reasoning abilities, we can expect them to code more intricate macro management and tactical micro-operations, accelerating the “arms race” among LLMs. In the future, we might see super-efficient algorithms emerge from this benchmark that human minds couldn’t even conceive!

💬 Sharky’s Take

It’s a coliseum where sharks battle it out with code! An AI rewriting its code after a loss is just so endearing and fierce! 🦈🔥

📚 Terminology

  • RTS (Real-Time Strategy): Strategy games that unfold in real-time, requiring simultaneous resource management and unit control to defeat the enemy.

  • In-context Learning: The capability to learn new tasks or situations from information in the input prompt (like past match results) without needing to retrain (fine-tune) the model.

  • OpenCode: An open-source coding framework for AI agents used in this benchmark.

  • Source: Show HN: A real-time strategy game that AI agents can play

🦈 はるサメ厳選!イチオシAI関連
【免責事項 / Disclaimer / 免责声明】
JP: 本記事はAIによって構成され、運営者が内容の確認・管理を行っています。情報の正確性は保証せず、外部サイトのコンテンツには一切の責任を負いません。
EN: This article was structured by AI and is verified and managed by the operator. Accuracy is not guaranteed, and we assume no responsibility for external content.
ZH: 本文由AI构建,并由运营者进行内容确认与管理。不保证准确性,也不对外部网站的内容承担任何责任。
🦈