At nunu.ai, we develop AI agents that can play and test any game. To do this, our agents need to handle long-term planning, navigation, spatial reasoning, and real-world interaction across different environments and devices.
We are building a comprehensive benchmark to evaluate these capabilities across computer use and phone use, with gaming as the core example use case. The goal is to create a thorough, diverse, and publicly recognized benchmark that captures how capable different agents actually are.
You will create externally facing benchmark (and expand our internal one) that can become a go-to reference for AI agents operating on real interfaces. This includes designing sandbox environments with isolated tasks, running evaluations, and presenting results in a way that is both quantitative and visually compelling.
This is not just an internal tool: we plan to publish and showcase this benchmark broadly, allowing customers to directly compare the performance and pricing of our agents. So clarity, quality, and presentation are critical.
We do not want manual verification or scoring. Every level must include an automated verification system (e.g. checkpoints, scoring, or state validation).
We hire smart and passionate people who are ready to learn fast. None of these requirements are hard constraints if you’re exceptional: