openbench provides standardized, reproducible benchmarking for LLMs across 30+ evaluation suites (and growing) spanning knowledge, math, reasoning, coding, science, reading comprehension, health, long ...
Stay in flow with Auto Claude using multi-terminal tools and session restore, so you run tests and pick up where you left off ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results