Fail to reproduce the Android World experiment

Hi, thanks for releasing the code.

I tried reproducing the AndroidWorld results using the released codebase directly with Gemini 2.5 Flash, following the reported setup as closely as possible.

My baseline result is close to the reported number (around 40%), but after enabling the world model pipeline I only see a small improvement (around 43%), which is much lower than the reported improvement from 41.38% to 50.86%.

Could you confirm whether the released code is sufficient to reproduce the reported Gemini 2.5 Flash + world model result? Since I am using the public implementation directly and the baseline matches reasonably well, I am wondering whether there are any additional settings, prompts, checkpoints, or evaluation details that were used in the reported experiments but are not included in the repository.

Has anyone else successfully reproduced the reported world model gain using the released code?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to reproduce the Android World experiment #5

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Fail to reproduce the Android World experiment #5

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions