At nunu.ai, we develop AI agents that can play and test any app/game/software. To do this, our agents need to handle complex test planning, multi-step interactions, edge case discovery, and thorough verification across different application types and platforms.
We are building a comprehensive benchmark to evaluate these capabilities across computer use and phone use, with testing as the core example use case. The goal is to create a thorough, diverse, and publicly recognized benchmark that captures how capable different agents actually are.
You will create an externally facing benchmark (and expand our internal one) that becomes the go-to reference for evaluating AI testing agents. This includes designing test suites with carefully crafted sandbox applications, running evaluations against multiple agents, and presenting results in a way that is both quantitatively rigorous and visually compelling.
This is not just an internal tool: we plan to publish and showcase this benchmark broadly, allowing customers to directly compare the performance and pricing of our agents. So clarity, quality, and presentation are critical.
We do not want manual verification or scoring. Every level must include an automated verification system (e.g. checkpoints, scoring, or state validation).
We hire smart and passionate people who are ready to learn fast. None of these requirements are hard constraints if you’re exceptional: