Project Description

At nunu.ai, we build AI agents that can operate apps, games, and software like a human. A core challenge in this space is enabling agents to reliably understand and act in structured environments with minimal supervision.

This project focuses on fine-tuning or adapting Vision-Language Models (VLMs) to efficiently play grid-based mobile games, with a particular emphasis on match-3 and merge-2 mechanics.

These games are deceptively complex: they require spatial reasoning, pattern recognition, planning multiple moves ahead, and adapting to stochastic outcomes. The goal of this project is to push VLMs beyond generic perception into highly efficient, decision-capable agents in constrained environments.

You will explore approaches such as:

Fine-tuning VLMs on gameplay data (which you will have to collect lol)
Designing efficient action representations for grid interactions
Building frameworks around existing VLMs to improve planning and execution
Optimizing for speed, cost, and reliability

The end goal is to create a system (or framework) that can consistently and efficiently play these games at a high level, and serve as a foundation for broader agent capabilities.

This is a research + applied engineering project, with room to shape direction depending on your strengths.

🎯 Responsibilities

Model Development Fine-tune or adapt existing VLMs for grid-based gameplay, focusing on perception, decision-making, and action efficiency.
Framework Design Build a lightweight system around the model (e.g. state representation, action selection, memory/planning) to improve performance beyond raw inference.
Evaluation Design experiments to measure performance (e.g. score, level completion, efficiency, cost per run) and benchmark different approaches.
Optimization Improve latency, cost, and reliability of gameplay. Explore tradeoffs between model size, inference frequency, and planning depth.
Documentation Clearly document findings, approaches, and results in a way that can be reused internally and potentially shared externally.

📋 Requirements

We hire smart and passionate people who are ready to learn fast. None of these requirements are hard constraints if you’re exceptional:

Strong interest in machine learning, especially vision-language models or multimodal systems