Project Description

At nunu.ai, we develop AI agents that can play and test any game. To do this, our agents need to handle long-term planning, navigation, spatial reasoning, and real-world interaction across different environments and devices.

We are building a comprehensive benchmark to evaluate these capabilities across computer use and phone use, with gaming as the core example use case. The goal is to create a thorough, diverse, and publicly recognized benchmark that captures how capable different agents actually are.

You will create externally facing benchmark (and expand our internal one) that can become a go-to reference for AI agents operating on real interfaces. This includes designing sandbox environments with isolated tasks, running evaluations, and presenting results in a way that is both quantitative and visually compelling.

This is not just an internal tool: we plan to publish and showcase this benchmark broadly, allowing customers to directly compare the performance and pricing of our agents. So clarity, quality, and presentation are critical.

We do not want manual verification or scoring. Every level must include an automated verification system (e.g. checkpoints, scoring, or state validation).

🎯 Responsibilities

  1. Design: Define a benchmark with tasks (e.g. navigation, planning, interaction, reasoning), across games and other phone and computer environments.
  2. Evaluation: Run different agents on the benchmark and measure performance quantitatively across tasks.
  3. Presentation: Package results into a clean, structured, and visually strong format suitable for external sharing (e.g. dashboards, videos, reports).

📋 Requirements

We hire smart and passionate people who are ready to learn fast. None of these requirements are hard constraints if you’re exceptional:

🕛 Timeline

💻 How to apply