Skip to content

Fail to reproduce the Android World experiment #5

@wrqcodedoge

Description

@wrqcodedoge

Hi, thanks for releasing the code.

I tried reproducing the AndroidWorld results using the released codebase directly with Gemini 2.5 Flash, following the reported setup as closely as possible.

My baseline result is close to the reported number (around 40%), but after enabling the world model pipeline I only see a small improvement (around 43%), which is much lower than the reported improvement from 41.38% to 50.86%.

Could you confirm whether the released code is sufficient to reproduce the reported Gemini 2.5 Flash + world model result? Since I am using the public implementation directly and the baseline matches reasonably well, I am wondering whether there are any additional settings, prompts, checkpoints, or evaluation details that were used in the reported experiments but are not included in the repository.

Has anyone else successfully reproduced the reported world model gain using the released code?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions