3 min read
[AI Minor News]

Validating the Power of CLAUDE.md with Real-World PRs: Meet the Benchmark Tool "Mdarena"


  • An open-source tool has emerged to measure the impact of instruction files like CLAUDE.md on AI agents' success rates and token costs using actual pull requests (PRs). ...
※この記事はアフィリエイト広告を含みます

Validating the Power of CLAUDE.md with Real-World PRs: Meet the Benchmark Tool “Mdarena”

📰 News Overview

  • An open-source tool has been launched to measure how instruction files like CLAUDE.md influence the success rates and token costs of AI agents using actual pull requests (PRs).
  • The tool automatically generates test sets by extracting past PRs from repositories, allowing comparisons between a baseline (no instructions) and various configuration files using SWE-bench-compatible evaluation methods.
  • In the execution environment, it outputs reports on test pass/fail status, code overlap (diff overlap), token consumption, and statistical significance.

💡 Key Points

  • The “just write something” approach for CLAUDE.md can actually create noise for agents, risking a drop in success rates and increasing token costs by over 20%, which this tool helps visualize.
  • Validation in large-scale production monorepos showed that placing appropriate context by directory—rather than consolidating instructions into one—improved test resolution rates by about 27%.
  • To prevent Claude from “cheating” by peeking at the answers from Git history, the tool features integrity protection, verifying with snapshots that completely erase history.

🦈 Shark’s Eye (Curator’s Perspective)

This is a hardcore tool that shatters the illusion that just dropping a CLAUDE.md will magically make your AI smart! What’s particularly fascinating is that it doesn’t just compare string matches; it actually runs test code within the repository to evaluate the correctness of patches using the “SWE-bench method.” The validation results starkly reveal that overloading with instructions can backfire—what a revelation! Every prompt engineer should stop winging it and start measuring instead!

🚀 What’s Next?

  • The creation of instruction files will shift from “gut feeling” to “data-driven,” standardizing the placement of lightweight instruction files optimized for repository structures.
  • Among companies deploying AI agents, prompt quality management (QA) processes will be integrated as part of CI/CD practices.

💬 A Word from Haru Shark

Stuffing instructions randomly is like trying to cram rocks into a shark’s mouth! It’s the sleek and sharp instructions that are sure to catch the right prey! 🦈💥

📚 Terminology

  • CLAUDE.md: A configuration file referenced by AI agents like Claude Code to understand project-specific rules and contexts.

  • SWE-bench: A benchmark standard that evaluates AI models’ code modification capabilities using real software engineering tasks (GitHub Issues and PRs).

  • Gold Patch: The “correct answer” in benchmarking, referring to the code diff from the original PR that developers have actually created and merged.

  • Source: Show HN: Mdarena – Benchmark your Claude.md against your own PRs

【免責事項 / Disclaimer / 免责声明】
JP: 本記事はAIによって構成され、運営者が内容の確認・管理を行っています。情報の正確性は保証せず、外部サイトのコンテンツには一切の責任を負いません。
EN: This article was structured by AI and is verified and managed by the operator. Accuracy is not guaranteed, and we assume no responsibility for external content.
ZH: 本文由AI构建,并由运营者进行内容确认与管理。不保证准确性,也不对外部网站的内容承担任何责任。
🦈