Leaderboard

ATM-Bench Leaderboard

Unified results across oracle, agent, memory, and RAG systems

Submitted results on ATM-Bench, ATM-Bench-Hard, and the NIAH-100 long-context stress test. Use the tabs to switch boards, the chips to filter by system type, and click any column header to sort.

Last updated2026-05-22

ATM-Bench

- indicates the field has not been reported by the submitter. Memory Model is the LLM used to construct the memory store; Retriever is the embedding model used at query time. Click any column header to sort; click a filter chip to narrow by system type.

Submit Your Result

We welcome new submissions across all three boards. To keep the leaderboard credible, please include reproduction details (system type, harness, model + version, code or commit, total token cost when applicable).

Send a Pull Request

Fastest path: open a PR adding a row to the TRACKS array at the bottom of leaderboard.html. Include a short description of the setup and a link to your run logs or code in the PR body.

Edit leaderboard.html on GitHub

Open an Issue

Prefer not to send a PR? File an issue with your system type, harness, scores, and a reproduction pointer. We will add the row on your behalf.

Open submission issue

Acknowledgement

This page was adopted from the Nerfies project page, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Many thanks to the Academic Project Page Template.