Validating the Power of CLAUDE.md with Real-World PRs: Meet the Benchmark Tool “Mdarena”
📰 News Overview
- An open-source tool has been launched to measure how instruction files like CLAUDE.md influence the success rates and token costs of AI agents using actual pull requests (PRs).
- The tool automatically generates test sets by extracting past PRs from repositories, allowing comparisons between a baseline (no instructions) and various configuration files using SWE-bench-compatible evaluation methods.
- In the execution environment, it outputs reports on test pass/fail status, code overlap (diff overlap), token consumption, and statistical significance.
💡 Key Points
- The “just write something” approach for CLAUDE.md can actually create noise for agents, risking a drop in success rates and increasing token costs by over 20%, which this tool helps visualize.
- Validation in large-scale production monorepos showed that placing appropriate context by directory—rather than consolidating instructions into one—improved test resolution rates by about 27%.
- To prevent Claude from “cheating” by peeking at the answers from Git history, the tool features integrity protection, verifying with snapshots that completely erase history.
🦈 Shark’s Eye (Curator’s Perspective)
This is a hardcore tool that shatters the illusion that just dropping a CLAUDE.md will magically make your AI smart! What’s particularly fascinating is that it doesn’t just compare string matches; it actually runs test code within the repository to evaluate the correctness of patches using the “SWE-bench method.” The validation results starkly reveal that overloading with instructions can backfire—what a revelation! Every prompt engineer should stop winging it and start measuring instead!
🚀 What’s Next?
- The creation of instruction files will shift from “gut feeling” to “data-driven,” standardizing the placement of lightweight instruction files optimized for repository structures.
- Among companies deploying AI agents, prompt quality management (QA) processes will be integrated as part of CI/CD practices.
💬 A Word from Haru Shark
Stuffing instructions randomly is like trying to cram rocks into a shark’s mouth! It’s the sleek and sharp instructions that are sure to catch the right prey! 🦈💥
📚 Terminology
-
CLAUDE.md: A configuration file referenced by AI agents like Claude Code to understand project-specific rules and contexts.
-
SWE-bench: A benchmark standard that evaluates AI models’ code modification capabilities using real software engineering tasks (GitHub Issues and PRs).
-
Gold Patch: The “correct answer” in benchmarking, referring to the code diff from the original PR that developers have actually created and merged.
-
Source: Show HN: Mdarena – Benchmark your Claude.md against your own PRs