Project Description

At nunu.ai, we develop AI agents that can play and test any app/game/software. To do this, our agents need to handle complex test planning, multi-step interactions, edge case discovery, and thorough verification across different application types and platforms.

We are building a comprehensive benchmark to evaluate these capabilities across computer use and phone use, with testing as the core example use case. The goal is to create a thorough, diverse, and publicly recognized benchmark that captures how capable different agents actually are.

You will create an externally facing benchmark (and expand our internal one) that becomes the go-to reference for evaluating AI testing agents. This includes designing test suites with carefully crafted sandbox applications, running evaluations against multiple agents, and presenting results in a way that is both quantitatively rigorous and visually compelling.

This is not just an internal tool: we plan to publish and showcase this benchmark broadly, allowing customers to directly compare the performance and pricing of our agents. So clarity, quality, and presentation are critical.

We do not want manual verification or scoring. Every level must include an automated verification system (e.g. checkpoints, scoring, or state validation).

🎯 Responsibilities

  1. Design: Define a benchmark with fully-featured test applications with realistic complexity, intentionally embedded bugs, and design documentation
  2. Evaluation: Run different agents on the benchmark and measure performance quantitatively.
  3. Presentation: Package results into a clean, structured, and visually strong format suitable for external sharing (e.g. dashboards, videos, reports).

📋 Requirements

We hire smart and passionate people who are ready to learn fast. None of these requirements are hard constraints if you’re exceptional:

🕛 Timeline