Leaderboard

ATM-Bench Leaderboard

Unified results across oracle, agent, memory, and RAG systems

Submitted results on ATM-Bench, ATM-Bench-Hard, and the NIAH long-context stress test. Use the tabs to switch boards, the chips to filter by system type, and click any column header to sort.

Project Page arXiv Code Dataset Submit Result

Last updated: 2026-06-01

← ATM-Bench ATM-Bench ATM-Bench-Hard NIAH Submit

ATM-Bench

- indicates the field has not been reported by the submitter. Memory Model is the LLM used to construct the memory store; Retriever is the embedding model used at query time. Click any column header to sort; click a filter chip to narrow by system type. When a caption model is not stated, it is Qwen3-VL-2B (the default). * Memexa's QS is measured with a DeepSeek-V4-flash judge (its own answer model), not the gpt-5-mini judge used for every other row, so it is shown for reference and is not directly comparable; Recall is judge-independent and like-for-like.

Submit Your Result

We welcome new submissions across all three boards. To keep the leaderboard credible, please include reproduction details (system type, harness, model + version, code or commit, total token cost when applicable).

Send a Pull Request

Fastest path: open a PR adding a row to the TRACKS array at the bottom of leaderboard.html. Include a short description of the setup and a link to your run logs or code in the PR body.

Edit leaderboard.html on GitHub

Open an Issue

Prefer not to send a PR? File an issue with your system type, harness, scores, and a reproduction pointer. We will add the row on your behalf.

Open submission issue

Acknowledgement

This page was adopted from the Nerfies project page, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Many thanks to the Academic Project Page Template.