Skip to content

Commit fc65875

Browse files
saving progress
1 parent e76e99f commit fc65875

1 file changed

Lines changed: 140 additions & 2 deletions

File tree

README.md

Lines changed: 140 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,19 @@
1-
# WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks
1+
# WorkArena: A Benchmark for Evaluating Agents on Knowledge Work Tasks
2+
[[Benchmark Contents]](#benchmark-contents)[[Getting Started]](#getting-started)[[Live Demo]](#live-demo)[[BrowserGym]](https://github.com/ServiceNow/BrowserGym)[[Citing This Work]](#citing-this-work)
3+
4+
### Papers
5+
* [ICML 2024] WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks? [[Paper]](https://arxiv.org/abs/2403.07718)
6+
7+
* WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks [[Paper]](https://arxiv.org/abs/2407.05291)
8+
9+
10+
`WorkArena` is a suite of browser-based tasks tailored to gauge web agents' effectiveness in supporting routine tasks for knowledge workers.
11+
By harnessing the ubiquitous [ServiceNow](https://www.servicenow.com/what-is-servicenow.html) platform, this benchmark will be instrumental in assessing the widespread state of such automations in modern knowledge work environments.
12+
13+
WorkArena is included in [BrowserGym](https://github.com/ServiceNow/BrowserGym), a conversational gym environment for the evaluation of web agents.
14+
15+
16+
https://github.com/ServiceNow/WorkArena/assets/2374980/68640f09-7d6f-4eb1-b556-c294a6afef70
217

318
## Getting Started
419

@@ -21,7 +36,7 @@ To setup WorkArena, you will need to get your own ServiceNow instance, install o
2136

2237
Run the following command to install WorkArena in the [BrowswerGym](https://github.com/servicenow/browsergym) environment:
2338
```
24-
pip install browsergym-workarena
39+
pip install browsergym
2540
```
2641

2742
Then, install [Playwright](https://github.com/microsoft/playwright):
@@ -36,12 +51,106 @@ workarena-install
3651
Your installation is now complete! 🎉
3752

3853

54+
## Benchmark Contents
55+
56+
At the moment, WorkArena-L1 includes `19,912` unique instances drawn from `33` tasks that cover the main components of the ServiceNow user interface, otherwise referred to as "atomic" tasks. WorkArena++ contains 682 tasks, each one sampling among thousands of potential configurations.
57+
58+
The following videos show an agent built on `GPT-4-vision` interacting with every atomic component of the benchmark. As emphasized by our results, this benchmark is not solved and thus, the performance of the agent is not always on point.
59+
60+
### Knowledge Bases
61+
62+
**Goal:** The agent must search for specific information in the company knowledge base.
63+
64+
_The agent interacts with the user via BrowserGym's conversational interface._
65+
66+
https://github.com/ServiceNow/WorkArena/assets/1726818/352341ba-b501-46ac-bfa6-a6c9be1ac2b7
67+
68+
### Forms
69+
70+
**Goal:** The agent must fill a complex form with specific values for each field.
71+
72+
https://github.com/ServiceNow/WorkArena/assets/1726818/e2c2b5cb-3386-4f3c-b073-c8c619e0e81b
73+
74+
### Service Catalogs
75+
76+
**Goal:** The agent must order items with specific configurations from the company's service catalog.
77+
78+
https://github.com/ServiceNow/WorkArena/assets/1726818/ac64db3b-9abf-4b5f-84a7-e2d9c9cee863
79+
80+
### Lists
81+
82+
**Goal:** The agent must filter a list according to some specifications.
83+
84+
_In this example, the agent struggles to manipulate the UI and fails to create the filter._
85+
86+
https://github.com/ServiceNow/WorkArena/assets/1726818/7538b3ef-d39b-4978-b9ea-8b9e106df28e
87+
88+
### Menus
89+
90+
**Goal:** The agent must navigate to a specific application using the main menu.
91+
92+
https://github.com/ServiceNow/WorkArena/assets/1726818/ca26dfaf-2358-4418-855f-80e482435e6e
93+
94+
### Dashboards
95+
96+
**Goal:** The agent must answer a question that requires reading charts and (optionally) performing simple reasoning over them.
97+
98+
*Note: For demonstration purposes, a human is controlling the cursor since this is a pure retrieval task*
99+
100+
https://github.com/ServiceNow/WorkArena/assets/1726818/0023232c-081f-4be4-99bd-f60c766e6c3f
101+
102+
39103
## Live Demo
40104

41105
Run this code to see WorkArena in action.
42106

43107
Note: the following example executes WorkArena's oracle (cheat) function to solve each task. To evaluate an agent, calls to `env.step()` must be used instead.
44108

109+
- To run a demo of WorkArena-L1 (ICML 2024) tasks using BrowserGym, use the following script:
110+
```python
111+
import random
112+
113+
from browsergym.core.env import BrowserEnv
114+
from browsergym.workarena import ALL_WORKARENA_TASKS
115+
from time import sleep
116+
117+
118+
random.shuffle(ALL_WORKARENA_TASKS)
119+
for task in ALL_WORKARENA_TASKS:
120+
print("Task:", task)
121+
122+
# Instantiate a new environment
123+
env = BrowserEnv(task_entrypoint=task,
124+
headless=False)
125+
env.reset()
126+
127+
# Cheat functions use Playwright to automatically solve the task
128+
env.chat.add_message(role="assistant", msg="On it. Please wait...")
129+
cheat_messages = []
130+
env.task.cheat(env.page, cheat_messages)
131+
132+
# Send cheat messages to chat
133+
for cheat_msg in cheat_messages:
134+
env.chat.add_message(role=cheat_msg["role"], msg=cheat_msg["message"])
135+
136+
# Post solution to chat
137+
env.chat.add_message(role="assistant", msg="I'm done!")
138+
139+
# Validate the solution
140+
reward, stop, message, info = env.task.validate(env.page, cheat_messages)
141+
if reward == 1:
142+
env.chat.add_message(role="user", msg="Yes, that works. Thanks!")
143+
else:
144+
env.chat.add_message(role="user", msg=f"No, that doesn't work. {info.get('message', '')}")
145+
146+
sleep(3)
147+
env.close()
148+
```
149+
150+
151+
152+
- To run a demo of WorkArena-L2 (WorkArena++) tasks using BrowserGym, use the following script. Change the filter on line 6 to `l3` to sample L3 tasks.
153+
45154
```python
46155
import random
47156

@@ -80,3 +189,32 @@ for (task, seed) in zip(AGENT_L2_SAMPLED_TASKS, AGENT_L2_SEEDS):
80189
sleep(3)
81190
env.close()
82191
```
192+
193+
Note: the following example executes WorkArena's oracle (cheat) function to solve each task. To evaluate an agent, calls to `env.step()` must be used instead.
194+
195+
## Citing This Work
196+
197+
Please use the following BibTeX to cite our work:
198+
199+
```
200+
@misc{workarena2024,
201+
title={WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?},
202+
author={Alexandre Drouin and Maxime Gasse and Massimo Caccia and Issam H. Laradji and Manuel Del Verme and Tom Marty and Léo Boisvert and Megh Thakkar and Quentin Cappart and David Vazquez and Nicolas Chapados and Alexandre Lacoste},
203+
year={2024},
204+
eprint={2403.07718},
205+
archivePrefix={arXiv},
206+
primaryClass={cs.LG}
207+
}
208+
```
209+
210+
```
211+
@misc{boisvert2024workarenacompositionalplanningreasoningbased,
212+
title={WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks},
213+
author={Léo Boisvert and Megh Thakkar and Maxime Gasse and Massimo Caccia and Thibault Le Sellier De Chezelles and Quentin Cappart and Nicolas Chapados and Alexandre Lacoste and Alexandre Drouin},
214+
year={2024},
215+
eprint={2407.05291},
216+
archivePrefix={arXiv},
217+
primaryClass={cs.AI},
218+
url={https://arxiv.org/abs/2407.05291},
219+
}
220+
```

0 commit comments

Comments
 (0)