Hi, thanks for releasing the code.
I tried reproducing the AndroidWorld results using the released codebase directly with Gemini 2.5 Flash, following the reported setup as closely as possible.
My baseline result is close to the reported number (around 40%), but after enabling the world model pipeline I only see a small improvement (around 43%), which is much lower than the reported improvement from 41.38% to 50.86%.
Could you confirm whether the released code is sufficient to reproduce the reported Gemini 2.5 Flash + world model result? Since I am using the public implementation directly and the baseline matches reasonably well, I am wondering whether there are any additional settings, prompts, checkpoints, or evaluation details that were used in the reported experiments but are not included in the repository.
Has anyone else successfully reproduced the reported world model gain using the released code?
Thanks!
Hi, thanks for releasing the code.
I tried reproducing the AndroidWorld results using the released codebase directly with Gemini 2.5 Flash, following the reported setup as closely as possible.
My baseline result is close to the reported number (around 40%), but after enabling the world model pipeline I only see a small improvement (around 43%), which is much lower than the reported improvement from 41.38% to 50.86%.
Could you confirm whether the released code is sufficient to reproduce the reported Gemini 2.5 Flash + world model result? Since I am using the public implementation directly and the baseline matches reasonably well, I am wondering whether there are any additional settings, prompts, checkpoints, or evaluation details that were used in the reported experiments but are not included in the repository.
Has anyone else successfully reproduced the reported world model gain using the released code?
Thanks!