Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (2024)

\TabPositions

1cm

Michael Wornow  Avanika Narayan
Ben ViggianoIshan S. KhareTathagat Verma
Tibor ThompsonMiguel Angel Fuentes HernandezSudharsan Sundar
Chloe TrujilloKrrish ChawlaRongfei LuJustin Shen
Divya NagarajJoshua MartinezVardhan AgrawalAlthea Hudson
Nigam H. ShahChristopher Ré
Stanford University

Abstract

Existing ML benchmarks lack the depth and diversity of annotations needed for evaluating models on business process management (BPM) tasks. BPM is the practice of documenting, measuring, improving, and automating enterprise workflows. However, research has focused almost exclusively on one task – full end-to-end automation using agents based on multimodal foundation models (FMs) like GPT-4. This focus on automation ignores the reality of how most BPM tools are applied today – simply documenting the relevant workflow takes 60% of the time of the typical process optimization project. To address this gap we present WONDERBREAD, the first benchmark for evaluating multimodal FMs on BPM tasks beyond automation. Our contributions are: (1) a dataset containing 2928 documented workflow demonstrations; (2) 6 novel BPM tasks sourced from real-world applications ranging from workflow documentation to knowledge transfer to process improvement; and (3) an automated evaluation harness. Our benchmark shows that while state-of-the-art FMs can automatically generate documentation (e.g. recalling 88% of the steps taken in a video demonstration of a workflow), they struggle to re-apply that knowledge towards finer-grained validation of workflow completion (F1 < 0.30.30.30.3). We hope WONDERBREAD encourages the development of more “human-centered” AI tooling for enterprise applications and furthers the exploration of multimodal FMs for the broader universe of BPM tasks. We publish our dataset and experiments here: \fa*github https://github.com/HazyResearch/wonderbread.

1 Introduction

Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (1)

Multimodal foundation models (FMs) such as GPT-4 [40] have the potential to revolutionize business process management (BPM), which is the discipline of measuring and improving enterprise workflows – e.g. a physician submitting a medication order. Typical BPM projects progress in four stages across the following BPM tasks: (1) Documentation – mapping the steps of an existing workflow; (2) Knowledge Transfer – ensuring a shared understanding of the documented workflow; (3) Improvement – identifying workflow inefficiencies and proposing fixes; and (4) Automation – writing software to execute the workflow without human involvement [49, 53]. FMs could be well-suited for these tasks due to their robust reasoning [65, 57, 2] and visual[8, 54, 67] understanding skills.

However, existing ML benchmarks[69, 61, 17, 60] focus almost exclusively on one BPM task: end-to-end workflow automation using agents based on multimodal FMs (see Table 1). This is despite the fact that simply defining the relevant workflow takes 60% of the time of the typical BPM project [23], and the BPM market is 4x larger than that of automation tools [47, 48, 27, 28].

By ignoring the most time-consuming aspects of BPM projects, we overlook key opportunities to provide near-term value to enterprises. Several case studies have applied multimodal FMs to these broader BPM tasks and demonstrated better performance, easier set-up, and simpler maintenance than traditional BPM tools such as process mining [20, 49, 18, 53, 9, 24, 39]. While promising, however, these papers were largely anecdotal with small datasets (<50absent50<50< 50 examples). This motivates the creation of a large-scale benchmark and dataset specifically for BPM tasks.

Unfortunately, no such dataset exists, and current benchmarks designed around workflow automation cannot be readily repurposed due to several limitations. First, their datasets either lack human demonstrations of workflows [69, 17] or do not contain sufficient annotation detail for BPM tasks [61, 15, 35, 30] – e.g. evaluating a model’s ability to document a workflow requires reference documentation. Second, their evaluations typically only measure end-to-end workflow completion rates [69, 64, 17, 61] and thus do not consider the intermediate reasoning required for BPM tasks such as identifying inefficiencies within a successfully completed workflow. Third, they do not model real-world BPM use cases and instead focus on navigating websites or mobile apps – i.e. they are focused on workflow execution rather than understanding [15, 44, 64, 16, 30, 69, 33, 17, 66, 61, 35, 30].

Motivated by the overlooked potential for using multimodal FMs on a broader suite of BPM tasks, we thus introduce WONDERBREAD, a WOrkflow uNDERstanding BenchmaRk, EvAluation harness, and Dataset. Our contributions are as follows:

  1. 1.

    Dataset: We publish 2928 human demonstrations across 598 previously unannotated workflows sourced from the WebArena benchmark [69]. Each workflow has an average of 4.9 independently collected demonstrations, and each demonstration contains a full screen recording, event log of all clicks/keystrokes/scrolls, and a manually written standard operating procedure (“SOP”) – i.e. a step-by-step written guide which reflects the annotator’s reasoning at each step of the workflow. For a subset of 162 workflows, we also have annotators rank all 5 demonstrations in order of perceived quality.

  2. 2.

    Tasks: Based on use cases drawn from the BPM literature around (1) Documentation, (2) Knowledge Transfer, and (3) Improvement, we define 6 novel BPM tasks which require reasoning over multimodal data.

    1. (a)

      Documentation: Generate standard operating procedures (i.e. synthesize the steps of a workflow in writing) to fulfill quality control and audit requirements [5, 59].

    2. (b)

      Knowledge Transfer:Answer user queries about how workflows operate to simplify onboarding and reduce the 5.3 hours per week that knowledge workers spend waiting for information from colleagues.[42].

    3. (c)

      Improvement: Analyze workflows to identify inefficiencies and correct execution errors [19, 51].

  3. 3.

    Evaluation: We offer evaluation pipelines using automated metrics (e.g., F1, accuracy) and LLM-based evaluators with high correlation to human raters (ρ>0.8𝜌0.8\rho>0.8italic_ρ > 0.8). By focusing on intermediate workflow steps, these evaluations provide a more comprehensive and transparent assessment of models than end-to-end workflow completion rates.

Results. We provide baseline results for three state-of-the-art multimodal FMs — GPT-4 [40], Claude 3 [4], and Gemini Pro [50]. Based on screen recordings, we find that models can generate accurate written documentation (F1 of 0.82) and determine whether a demonstration successfully achieved its desired goal (F1 of 0.90). While promising, increasing these numbers to enterprise-level accuracy (i.e. 0.99+) remains an open research challenge. We also identify more significant performance gaps. Models struggle with low-level error correction — for example, when prompted to classify whether a demonstration exactly followed a specific sequence of steps, the peak F1 achieved is 0.27. Models also score poorly when ranking multiple demonstrations of the same workflow on perceived quality and efficiency. We identify long context reasoning, lower-level process understanding, and human workflow preference alignment as key areas for future research.

Our dataset and code available at our Github repo: \fa*github https://github.com/HazyResearch/wonderbread

2 Background

We summarize traditional process mining approaches for BPM tasks, discuss recent work on applying multimodal FMs, and compare WONDERBREAD to existing multimodal FM benchmarks.

2.1 Process Mining

Process mining is the de facto tool currently used for most BPM tasks, acting as an organizational “X-Ray” [46] that enables large enterprises to identify, measure, and improve their workflows [52, 46, 6]. Techniques include statistical analysis of event logs, unsupervised machine learning, manual review of screen recordings, and user interviews [34, 46]. While interviews can provide an accurate picture of a workflow, they are costly and time-consuming; automated process mining tools are faster but significantly less accurate [1, 34]. Bridging the “semantic gap” between machine and human workflow understanding is an ongoing challenge [38, 34, 1] that WONDERBREAD aims to address.

2.2 Multimodal FMs

Foundation models (FMs) are large-scale ML models trained on vast datasets of unlabeled data which can be applied to a broad range of tasks with minimal adaptation [11]. Multimodal FMs such as GPT-4 combine natural language understanding with a vision model to process images and text jointly [67]. These models have shown promise in navigating graphical user interfaces and executing simple workflows [15, 63, 25, 26, 66, 60]. While the use of multimodal FMs for BPM tasks has been advocated [49], it has not yet been implemented. A failure mode of text-only FMs is the lack of an ability to “read between the lines” of human-generated textual summaries of workflows – e.g. when creating a process model from text, GPT-4 misses half the steps that a human would include [32, 24]. This motivates having multimodal FMs directly observe workflows, as done in our benchmark.

2.3 Benchmarks

A number of multimodal datasets have been published for end-to-end automation of websites [69, 17], mobile apps [44], and desktop applications [60, 61]. However, these datasets do not include step-by-step written guides (SOPs), nor do they evaluate on BPM tasks such as documentation, knowledge transfer, or process improvement [15, 44, 64, 16, 30, 69, 33, 17, 66, 61, 35, 30]. Several works have applied large language models to BPM tasks [20, 49, 18, 53, 9, 24, 39], but they conduct limited case studies (i.e. dozens of examples), rely on manual human evaluation, and do not consider multimodal inputs like screen recordings. Please see Table 1 for a detailed comparison with prior benchmarks.

BenchmarkWorkflowsHuman DemonstrationsEvaluation
# Tasks# EnvsEnv TypeActionVideoSOPRankingDemos/TaskAutoDocKTImp
AITW [44]30,378357M23.5
Mind2Web [16]2,350137W1
MoTIF [13]6,100125M0.77
WebArena [69]8124W0.22
OmniAct [30]9,80265D + W1
WebShop [64]12,0871W0.13
VWA [33]9103W0
WorkArena [17]23,1505W0
WebLINX [35]2,337155W1
OSWorld [61]36913D + W1
Wonderbread5984W4.9

3 Dataset

WONDERBREAD includes 2928 human demonstrations across 598 distinct workflows. Each demonstration contains: (1) Intent – a short natural language description of the workflow’s goal; (2) Recording – a full screen recording of the annotator performing the workflow; (3) Action Trace – a log of all actions taken (clicks, keystrokes, scrolls) and webpage states before/after each action; (4) Key Frames – images taken from the Recording at each action’s timestamp; and (5) SOP – a written guide detailing all of the steps taken by the annotator. Please see Table LABEL:tab:terms for complete definitions. Each workflow has demonstrations from at least 4 annotators to reflect the diversity of work habits present in an enterprise. The full dataset collection process is illustrated in Figure 2.

Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (2)

We start with WebArena, a benchmark containing 812 workflows that require an agent to navigate open-source clones of an e-commerce, content management, forum, and developer tool website [69]. We filter this to 598 workflows by excluding workflows deemed impossible or inadequately specified.

We recruited 13 annotators to record themselves completing each workflow using a custom Python script. Existing workflow benchmarks often have low-quality demonstrations or inaccurate annotations [58], thus a key contribution of WONDERBREAD is the high quality of demonstrations achieved through several months of quality assurance. More details are provided in Appendix A.2.

In addition to demonstrations, we also curated 120 free response question-answer pairs to simulate inquiries that a BPM consultant might ask of a workflow. Examples are listed in Appendix A.4.

4 Benchmark

TermDefinitionFile Format
TaskOne of the 6 evaluation tasks in our benchmark, as detailed in Section 4.
WorkflowA sequence of actions taken to complete a specific business goal. Also referred to as a process. A single workflow can have multiple demonstrations.
DemonstrationA single execution of a workflow. Each demonstration contains an Intent, Recording, Action Trace, Key Frames, and SOP.Folder
IntentA brief natural language specification of a workflow’s goal, e.g. "Cancel my last order"..txt
RecordingA video containing a full recording of the user’s screen..mp4
Action TraceA log of all click, keystroke, and scroll actions (including associated elements and coordinates)..json
Key FramesImages taken from a Recording that are synced to events in the Action Trace..png(s)
SOPA “Standard Operating Procedure” detailing (in writing) all of the steps taken in a demonstration..txt

WONDERBREAD contains 6 tasks which cover three BPM applications not evaluated in prior benchmarks: automatically generating documentation from workflow demonstrations (Documentation), facilitating knowledge transfer (Knowledge Transfer), and identifying ways to improve inefficient workflows (Improvement). We provide a summary of each task below. Further details on the inputs, outputs, and evaluations are in Appendix B.

Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (3)

4.1 Documentation

Creating clear documentation of complex workflows is essential for operational continuity, compliance, and accountability[59, 5]. This can be achieved through Standard Operating Procedures (“SOP”), Process Definition Documents (“PDD”), or process maps. Our two documentation tasks – SOP Generation and Demo Segmentation – evaluate a model’s ability to generate SOPs and accurately distill video recordings into discrete workflows.

(A) SOP Generation. Evaluation involves using GPT-4 to compare the generated SOP to an annotator-generated reference SOP, calculating precision (how many steps in the generated SOP are in the reference) and recall (how many steps in the reference are in the generated SOP). Each SOP step is evaluated atomically by GPT-4 for semantic equivalence. Details are in Appendix Section C.2.

(B) Demo Segmentation. We concatenate multiple workflow demonstrations into a single video and provide it to the model, which identifies the start and end of each workflow. This tests the model’s ability to distinguish between sequential workflows. For evaluation, we calculate the adjusted rand index based on the model’s assignment of each video frame to a workflow.

Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (4)

4.2 Knowledge Transfer

The sharing of skills, know-how, and best practices within large organizations can be challenging [42]. By learning from workflow demonstrations, FMs could serve as a query-able repository of organizational knowledge for existing employees, and accelerate on-boarding of new hires by more quickly disseminating key information to trainees [22]. Our two Knowledge Transfer tasks – Question Answering and Demo Validation – assess whether a model can perform higher-level reasoning about the properties and correctness of a workflow.

(A) Question Answering. For questions about workflow demonstrations, the model generates a natural language answer, assessing its understanding of workflow semantics. We use GPT-4 to compare the generated answer to a reference answer for evaluation.

(B) Demo Validation. Given a demonstration, we predict whether (a) the workflow was successfully completed, or (b) the workflow followed the SOP exactly, with individual steps matching precisely. Since each demonstration in WONDERBREAD is “correct” by definition, we create synthetic negative examples by truncating recordings and shuffling frames. These binary classification tasks assess a model’s ability to self-monitor and error-correct.

Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (5)

4.3 Improvement

The ability to continuously refine and enhance the workflows of an organization is crucial for reducing costs and staying ahead of competitors [19]. By focusing on the improvement of demonstrations and SOPs, we highlight the role of iterative learning and optimization in driving the evolution of workflows [51]. Our two Improvement tasks – SOP Ranking and SOP Improvement – evaluate whether a model can identify workflow inefficiencies and improve inaccurate documentation.

(A) SOP Ranking. The same end goal can often be achieved via many different sequences of actions. However, some sequences may be preferable to others as they are more efficient, robust, or avoid intermediate steps that could have undesirable side effects. Given a set of SOPs written by different annotators for the same workflow, this task requires the model to rank them in order of quality. This assesses a model’s alignment with human perception of workflow quality. For evaluation, we measure the Kendall τ𝜏\tauitalic_τ correlation between the generated ranking and a human annotator’s ranking.

(B) SOP Improvement. Given a demonstration and a low-quality SOP, the model must generate an improved SOP that better aligns with the demonstration. The model will iterate to refine the SOP to a specified depth, assessing its ability to assist humans in documenting workflows. GPT-4 will evaluate the generated SOPs against a reference “gold” SOP.

4.4 Evaluation

We use programmatic metrics and LLM-based raters for our evaluations. Tasks involving clustering, classification, or ranking use metrics like adjusted rand index, F1, and correlation, respectively. Natural language tasks are evaluated using GPT-4-as-a-judge to assess input quality [14, 68]. Please see Appendix Table LABEL:tab:tasks for the specific metrics per task. Our LLM-based evaluations show high correlation with human raters (ρ>0.8𝜌0.8\rho>0.8italic_ρ > 0.8) (see Appendix Tables 8and 9).

5 Results

Our initial results show that current multimodal FMs, including GPT-4, Gemini, and Claude, excel at reasoning over short demonstration lengths but struggle with longer workflows. Our zero-shot evaluations focus on the out-of-the-box capabilities of these models across 162 workflows with rankings. Some models were excluded from specific tasks due to API budget and quota limitations.

5.1 Documentation

SOP Generation. Description: A model must generate a SOP that summarizes all of the actions taken in a video recording of a workflow. We ablate over different demonstration formats: only intent; intent with key frame screenshots; and intent with key frames plus a textual action log of clicks and keystrokes. Results: As shown in Table 3, GPT-4 performs best (F1-score of 0.82) with intent, keyframes, and action trace. Most model-demonstration pairs have higher recall than precision (avg. 0.06 points), indicating a tendency to hallucinate workflow steps. Upon qualitative review, we found that many hallucinated actions seemed reasonable but were not actually taken in the demonstration, e.g. adding “Navigate to the shopping admin page” even though the demonstration started on that page. Exact scores for each workflow and model are in Appendix Figure 9.

ModelIntentKeyframesTracePrecisionRecallF1Avg. # of Steps
GPT-40.800.880.8210.26
GPT-40.690.790.7110.32
GPT-40.480.590.4913.10
Claude 3 Sonnet0.720.850.7610.94
Claude 3 Sonnet0.670.780.7011.35
Claude 3 Sonnet0.530.540.5011.34
Gemini Pro 10.580.630.5811.09
Gemini Pro 10.480.510.4611.28
Gemini Pro 10.400.360.347.31
Ground Truth1118.40

Demo Segmentation. Description: This task mimics what a video recording of a person’s screen would capture during the typical workday, i.e. multiple workflows without clear boundaries. Concretely, the model receives k𝑘kitalic_k concatenated demonstrations sampled from different workflows from our dataset, and must determine which frames belong to the same workflow. We set k=3𝑘3k=3italic_k = 3 and choose workflows that utilize the same website. Results: As shown in Table 4, segmenting a recording remains challenging. Providing additional information to the model in the form of an SOP and intent slightly increases performance for GPT-4 yet decreases performance for Gemini Pro 1.

ModelIntentSOPKeyframesV-MeasureAdj. RI
GPT-40.850.88
GPT-40.850.87
GPT-40.800.86
Gemini Pro 10.540.65
Gemini Pro 10.530.65
Gemini Pro 10.580.69

5.2 Knowledge Transfer

Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (6)

Question Answering. Description: This task involves answering 120 free response questions about workflows, such as “How would a user know the workflow is complete?” and “What is the purpose of this workflow?”. These questions were drawn from the process mining literature [9, 20] and are provided in full in Appendix A.4. We use GPT-4-as-a-judge to evaluate model-generated answers by comparing to a reference answer from a human annotator. Following prior work [20], we have GPT-4 output four scores on a scale of 1 (bad) to 3 (good): completeness, soundness, clarity, and compactness. The Pearson correlation between GPT-4 and human raters was between 0.80 and 0.89 across all axes (see Appendix Table 8). Results: The average scores for each model are shown in Figure 6. All models perform well in “compactness” and “clarity” but score lower on “completeness.” This may be due to their reliance on statistical patterns rather than contextual workflow understanding [36], leading to occasional omissions of relevant details despite scoring highly on the syntactic quality of their writing.

Demo Validation. Description: We consider two forms of validation: (a) workflow completion, where a demonstration is “correct” if the workflow’s goal is achieved; and (b) workflow trajectory, where it is “correct” only if the goal is achieved following the specified SOP. “Correct” examples are sampled from our dataset, while “incorrect” examples are created by truncating, shuffling, or skipping states. Results: As shown in Table 5, GPT-4 is the best model. It can accurately determine whether a workflow completed its overall goal (peak F1 of 0.90) but struggles to validate that it followed the specific steps of an SOP (peak F1 of 0.27).

ModelIntentKeyframesSOPPrecisionRecallF1
Completion
GPT-40.890.900.90
GPT-40.840.770.81
Gemini Pro 10.940.250.40
Gemini Pro 10.940.260.41
Claude3 Sonnet0.580.310.40
Claude3 Sonnet0.850.500.63
Trajectory
GPT-40.520.180.27
Gemini Pro 10.940.140.25

5.3 Improvement

SOP Improvement. Description. In this task we provide a model with a task recording and an SOP. The model is then tasked with subsequently improving the SOP given and SOP rubric. Results. As shown in Table 7(a), current models are capable of improving the quality of their own SOPs (up to 1.4 points), conditioned upon a SOP rubric.

SOP Ranking. Description: In this task, we provide a model with SOPs from various annotators and have it rank them by quality. We then compare this ranking to a ground truth ranking by an annotator and measure the correlation between the model’s and human’s judgments. Results: As shown in Table 7(b), current models struggle to rank SOPs based on perceived quality to human raters. The best model achieves a mean Kendall correlation of 0.05 with a standard deviation of 0.47, indicating essentially random rankings. Improving alignment between model and human judgment of workflow quality remains an area for further research.

ModelOriginal SOPImproved SOP
GPT-43.434.82
Claude3 Sonnet3.434.26
Gemini Pro 13.433.65
ModelSpearman ρ𝜌\rhoitalic_ρKendall τ𝜏\tauitalic_τ
GPT-40.07 ± 0.580.06 ± 0.49
Claude3 Sonnet0.06 ± 0.590.03 ± 0.50
Gemini Pro 10.03 ± 0.580.03 ± 0.49

6 Discussion

We discuss next steps, limitations, and the broader impacts of WONDERBREAD below.

Improving Human-Model Alignment for BPM Tasks. We find that out-of-the-box human and multimodal models alignment is low for SOP evaluation (see Section5.3). Similar to how “human-model” alignment can be achieved for tasks like question-answering and instruction-following [55, 31], alignment also appears necessary for workflow understanding tasks. This might require fine-tuning models via supervised learning [56] or reinforcement learning on preference data [41].

Expanding Multimodal Context Windows. Even a 1-minute workflow can generate dozens of actions and key frames. Our results show that model accuracy on BPM tasks improves as more information is provided in the prompt. This might not be possible with longer workflows, leading to an incomplete representation for a workflow and lower downstream task performance. Longer context windows can help solve this problem and are a focal point of study in the community[29, 62].

Low-Level Workflow Understanding. Our results show that while multimodal FMs excel in high-level workflow analyses, they struggle with precise validation of individual steps (see Section 5.2). Enhancing this lower-level understanding may require supervised fine-tuning on GUIs as in [26, 7].

Self-Improvement. Our findings suggest that multimodal FMs can improve their outputs (i.e., SOPs) through multiple iterations of self-reflection (see Section5.3). This highlights the potential of these models to refine their outputs without human intervention[21, 3]. In the context of BPM tasks, this capability can help systems adapt to workflows as they change over time.

Limitations. There are several limitations to our work. First, dataset construction was constrained by our lack of access to real-world enterprise data due to privacy concerns. Second, the workflows in our dataset are taken from a limited set of 4 websites [69], and it is unclear how our results generalize to different environments with complex or longer workflows. Contemporaneous to our work, several datasets have been released which could be re-annotated following the process described in our paper [61, 35, 30], which we leave to future work. Third, our baseline results lack open-source models. Matching the performance of state-of-the-art proprietary models on these benchmarks with open source models remains an open research challenge.

Societal Impact. Our field’s collective focus on automation contradicts recent advocacy for more human-centered AI, which aims to augment rather than replace human labor [43, 45, 12, 10]. While we intend for WONDERBREAD to serve as a counterpoint to this focus, we acknowledge that any AI tools aimed at improving productivity run the risk of automating jobs or replacing human labor.

7 Conclusion

We present WONDERBREAD, the first benchmark for evaluating multimodal models on common process mining tasks. It includes 2928 human demonstrations across videos, images, and text, along with step-by-step written guides (SOPs) and full action traces. We focus on applying these models to three BPM tasks that have been overlooked by existing ML benchmarks for workflow automation – documentation, knowledge transfer, and process improvement. WONDERBREAD features an automated evaluation harness with programmatic metrics and LLM-based assessments, providing baseline results for state-of-the-art multimodal models. Our work aims to inspire further efforts to support workers by augmenting rather than replacing human labor.

Acknowledgments and Disclosure of Funding

MW is supported by the NSF Fellowship, a Stanford HAI Graduate Fellowship, and Stanford Healthcare. AN is supported by the Knight-Hennessy Fellowship and the NSF fellowship. We thank Neel Guha, Dan Fu, Mayee Chen, Eric Nguyen, Jordan Juravsky, Jerry Liu, and Sabri Eyuboglu for providing helpful feedback on this manuscript. We gratefully acknowledge the support of NIH under No. U54EB020405 (Mobilize), NSF under Nos. CCF2247015 (Hardware-Aware), CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and 1937301 (RTML); US DEVCOM ARL under Nos. W911NF-23-2-0184 (Long-context) and W911NF-21-2-0251 (Interactive Human-AI Teaming); ONR under Nos. N000142312633 (Deep Signal Processing), N000141712266 (Unifying Weak Supervision), N000142012480 (Non-Euclidean Geometry), and N000142012275 (NEPTUNE); Stanford HAI under No. 247183; NXP, Xilinx, LETI-CEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, Google Cloud, Salesforce, Total, the HAI-GCP Cloud Credits for Research program, the Stanford Data Science Initiative (SDSI), and members of the Stanford DAWN project: Facebook, Google, and VMWare. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsem*nts, either expressed or implied, of NIH, ONR, or the U.S. Government.

References

  • [1]Simone Agostinelli, Andrea Marrella, and Massimo Mecella.Towards intelligent robotic process automation for bpmers.arXiv preprint arXiv:2001.00804, 2020.
  • [2]Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, etal.Do as i can, not as i say: Grounding language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022.
  • [3]Renat Aksitov, Sobhan Miryoosefi, Zonglin Li, Daliang Li, Sheila Babayan, Kavya Kopparapu, Zachary Fisher, Ruiqi Guo, Sushant Prakash, Pranesh Srinivasan, etal.Rest meets react: Self-improvement for multi-step reasoning llm agent.arXiv preprint arXiv:2312.10003, 2023.
  • [4]AIAnthropic.The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 2024.
  • [5]AmericanHospital Association.Assessing the regulatory burden on health systems, hospitals and post-acute care providers, 2017.
  • [6]Adriano Augusto, Raffaele Conforti, Marlon Dumas, Marcello LaRosa, FabrizioMaria Maggi, Andrea Marrella, Massimo Mecella, and Allar Soo.Automated discovery of process models from event logs: Review and benchmark.IEEE transactions on knowledge and data engineering, 31(4):686–705, 2018.
  • [7]Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, and Abhanshu Sharma.Screenai: A vision-language model for ui and infographics understanding.arXiv preprint arXiv:2402.04615, 2024.
  • [8]Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar.Introducing our multimodal models, 2023.
  • [9]Alessandro Berti and MahnazSadat Qafari.Leveraging large language models (llms) for process mining (technical report).arXiv preprint arXiv:2307.12701, 2023.
  • [10]JosephR Biden.Executive order on the safe, secure, and trustworthy development and use of artificial intelligence.2023.
  • [11]Rishi Bommasani, DrewA Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, MichaelS Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, etal.On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021.
  • [12]Erik Brynjolfsson.The turing trap: The promise & peril of human-like artificial intelligence.Daedalus, 151(2):272–287, 2022.
  • [13]Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and BryanA. Plummer.Mobile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments, 2021.
  • [14]Cheng-Han Chiang and Hung-yi Lee.Can large language models be an alternative to human evaluations?arXiv preprint arXiv:2305.01937, 2023.
  • [15]Xiang Deng, YuGu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and YuSu.Mind2web: Towards a generalist agent for the web, 2023.
  • [16]Xiang Deng, YuGu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and YuSu.Mind2web: Towards a generalist agent for the web, 2023.
  • [17]Alexandre Drouin, Maxime Gasse, Massimo Caccia, IssamH. Laradji, ManuelDel Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste.Workarena: How capable are web agents at solving common knowledge work tasks?, 2024.
  • [18]Marlon Dumas, Fabiana Fournier, Lior Limonad, Andrea Marrella, Marco Montali, Jana-Rebecca Rehse, Rafael Accorsi, Diego Calvanese, Giuseppe DeGiacomo, Dirk Fahland, etal.Ai-augmented business process management systems: a research manifesto.ACM Transactions on Management Information Systems, 14(1):1–19, 2023.
  • [19]Marlon Dumas, Marcello LaRosa, Jan Mendling, HajoA Reijers, etal.Fundamentals of business process management, volume2.Springer, 2018.
  • [20]Dirk Fahland, Fabian Fournier, Lior Limonad, Inna Skarbovsky, and AvaJE Swevels.How well can large language models explain business processes?arXiv preprint arXiv:2401.12846, 2024.
  • [21]Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar.Minedojo: Building open-ended embodied agents with internet-scale knowledge.Advances in Neural Information Processing Systems, 35:18343–18362, 2022.
  • [22]Keith Ferrazzi.Technology can save onboarding from itself.Harvard Business Review, March 2015.
  • [23]Fabian Friedrich, Jan Mendling, and Frank Puhlmann.Process model generation from natural language text.In Advanced Information Systems Engineering: 23rd International Conference, CAiSE 2011, London, UK, June 20-24, 2011. Proceedings 23, pages 482–496. Springer, 2011.
  • [24]Michael Grohs, Luka Abb, Nourhan Elsayed, and Jana-Rebecca Rehse.Large language models can accomplish business process management tasks.In International Conference on Business Process Management, pages 453–465. Springer, 2023.
  • [25]Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu.Webvoyager: Building an end-to-end web agent with large multimodal models, 2024.
  • [26]Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, etal.Cogagent: A visual language model for gui agents.arXiv preprint arXiv:2312.08914, 2023.
  • [27]Mordor Intelligence.Business process management market - size, share and industry analysis, 2024.
  • [28]Mordor Intelligence.Robotic process automation market - size, share and industry analysis, 2024.
  • [29]Yixing Jiang, Jeremy Irvin, JiHun Wang, MuhammadAhmed Chaudhry, JonathanH Chen, and AndrewY Ng.Many-shot in-context learning in multimodal foundation models.arXiv preprint arXiv:2405.09798, 2024.
  • [30]Raghav Kapoor, YashParag Butala, Melisa Russak, JingYu Koh, Kiran Kamble, Waseem Alshikh, and Ruslan Salakhutdinov.Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web, 2024.
  • [31]Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke Hüllermeier.A survey of reinforcement learning from human feedback.arXiv preprint arXiv:2312.14925, 2023.
  • [32]Nataliia Klievtsova, Janik-Vasily Benzin, Timotheus Kampik, Juergen Mangler, and Stefanie Rinderle-Ma.Conversational process modeling: Can generative ai empower domain experts in creating and redesigning process models?, 2024.
  • [33]JingYu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, MingChong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried.Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024.
  • [34]Volodymyr Leno, Artem Polyvyanyy, Marlon Dumas, Marcello LaRosa, and FabrizioMaria Maggi.Robotic process mining: vision and challenges.Business & Information Systems Engineering, 63:301–314, 2021.
  • [35]XingHan Lù, Zdeněk Kasner, and Siva Reddy.Weblinx: Real-world website navigation with multi-turn dialogue.arXiv preprint arXiv:2402.05930, 2024.
  • [36]Amy Maitland, Ross Fowkes, and Stuart Maitland.Can chatgpt pass the mrcp (uk) written examinations? analysis of performance and errors using a clinical decision-reasoning framework.BMJ open, 14(3):e080558, 2024.
  • [37]Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi.FActScore: Fine-grained atomic evaluation of factual precision in long form text generation.In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore, December 2023. Association for Computational Linguistics.
  • [38]Jorge Munoz-Gama, Niels Martin, Carlos Fernandez-Llatas, OwenA Johnson, Marcos Sepúlveda, Emmanuel Helm, Victor Galvez-Yanjari, Eric Rojas, Antonio Martinez-Millana, Davide Aloini, etal.Process mining for healthcare: Characteristics and challenges.Journal of Biomedical Informatics, 127:103994, 2022.
  • [39]Vinod Muthusamy, Yara Rizk, Kiran Kate, Praveen Venkateswaran, Vatche Isahagian, Ashu Gulati, and Parijat Dube.Towards large language model-based personal agents in the enterprise: Current trends and open problems.In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6909–6921, 2023.
  • [40]ROpenAI.Gpt-4 technical report.arXiv, pages 2303–08774, 2023.
  • [41]Long Ouyang, Jeffrey Wu, XuJiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, etal.Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022.
  • [42]Panopto.Workplace knowledge and productivity report, 2018.
  • [43]Lucia Rahilly, Melissa Valentine, Brooke Weddle, and Bryan Hanco*ck.Human-centered ai: The power of putting people first, Dec 2023.
  • [44]Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap.Android in the wild: A large-scale dataset for android device control, 2023.
  • [45]Hope Reese.A human-centered approach to the ai revolution, 2022.
  • [46]Lars Reinkemeyer.Process mining in action.Process Mining in Action Principles, Use Cases and Outloook, 2020.
  • [47]GrandView Research.Business process management (bpm) market size, share report 2030, 2024.
  • [48]GrandView Research.Robotic process automation market size, share report 2030, 2024.
  • [49]Yara Rizk, Praveen Venkateswaran, Vatche Isahagian, Austin Narcomey, and Vinod Muthusamy.A case for business process-specific foundation models.In International Conference on Business Process Management, pages 44–56. Springer, 2023.
  • [50]Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, AndrewM Dai, Anja Hauth, etal.Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023.
  • [51]Wil Van DerAalst and Wil vander Aalst.Data science in action.Springer, 2016.
  • [52]WilMP Vander Aalst.Process mining in the large: a tutorial.Business Intelligence: Third European Summer School, eBISS 2013, Dagstuhl Castle, Germany, July 7-12, 2013, Tutorial Lectures 3, pages 33–76, 2014.
  • [53]Maxim Vidgof, Stefan Bachhofner, and Jan Mendling.Large language models for business process management: Opportunities and challenges.arXiv preprint arXiv:2304.04309, 2023.
  • [54]Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, JiQi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, etal.Cogvlm: Visual expert for pretrained language models.arXiv preprint arXiv:2311.03079, 2023.
  • [55]Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, and Qun Liu.Aligning large language models with human: A survey.arXiv preprint arXiv:2307.12966, 2023.
  • [56]Jason Wei, Maarten Bosma, VincentY Zhao, Kelvin Guu, AdamsWei Yu, Brian Lester, Nan Du, AndrewM Dai, and QuocV Le.Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021.
  • [57]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, EdChi, QuocV Le, Denny Zhou, etal.Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  • [58]Michael Wornow, Avanika Narayan, Krista Opsahl-Ong, Quinn McIntyre, NigamH Shah, and Christopher Re.Automating the enterprise with foundation models.arXiv preprint arXiv:2405.03710, 2024.
  • [59]DannyTY Wu, Nikolas Smart, ElizabethL Ciemins, HollyJ Lanham, Curt Lindberg, and Kai Zheng.Using ehr audit trail logs to analyze clinical workflow: a case study from community-based ambulatory clinics.In AMIA Annual Symposium Proceedings, volume 2017, page 1820. American Medical Informatics Association, 2017.
  • [60]Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong.Os-copilot: Towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456, 2024.
  • [61]Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, TohJing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, etal.Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.arXiv preprint arXiv:2404.07972, 2024.
  • [62]Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, KarthikAbinav Sankararaman, Barlas Oguz, etal.Effective long-context scaling of foundation models.arXiv preprint arXiv:2309.16039, 2023.
  • [63]AnYan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, etal.Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation.arXiv preprint arXiv:2311.07562, 2023.
  • [64]Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan.Webshop: Towards scalable real-world web interaction with grounded language agents, 2023.
  • [65]Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao.React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022.
  • [66]Chaoyun Zhang, Liqun Li, Shilin He, XuZhang, BoQiao, SiQin, Minghua Ma, YuKang, Qingwei Lin, Saravan Rajmohan, etal.Ufo: A ui-focused agent for windows os interaction.arXiv preprint arXiv:2402.07939, 2024.
  • [67]Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu.Vision-language models for vision tasks: A survey, 2023.
  • [68]Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, ZiLin, Zhuohan Li, Dacheng Li, Eric Xing, etal.Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36, 2024.
  • [69]Shuyan Zhou, FrankF Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, etal.Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023.

Appendix A Dataset

A.1 License & Availability

We license our code and dataset under the Apache 2.0 license. The authors bear all responsibility in case of violation of rights. Our code and data are available here: \fa*github https://github.com/HazyResearch/wonderbread.

Our dataset is based on the excellent WebArena benchmark [69], which also has an Apache 2.0 license and is available here: https://github.com/web-arena-x/webarena

A.2 Dataset Curation

1. Workflow Selection. We begin with the WebArena [69] benchmark, which is a collection of 812 workflows instantiated from 187 workflow intents. For example, the template "Search for (term)" could have instantiations "Search for jacket" and "Search for coat". These 812 tasks require an agent to navigate fully functional open source clones of popular websites. In this dataset we use the e-commerce, content management system (Adobe Magneto), forum (PostMill), and developer tool (GitLab) sites provided by WebArena. We find that many workflows in WebArena are designed to be impossible, are de facto impossible, are underspecified, or have incorrect evaluations, and we purposely exclude these workflows from our dataset. We also ignore WebArena tasks that include multiple websites, leaving us with a total of 598 workflows.

2. Annotator Recruitment and Training. We enlisted 13 human annotators from a pool of approximately 60 applicants (all students at Stanford University) to participate in our data collection process. All selected annotators, who are undergraduate or graduate students at Stanford University with proficient computer literacy skills, were fully informed and consented to the publication of their complete demonstrations. They were also given the opportunity to review the entire codebase, experiments, and manuscript prior to submission. Annotators were aware that their full screen recordings would be made public and were advised to remove any personally identifiable information before recording. Prior to applying, they were informed that there would be no monetary compensation, as their participation would be on a voluntary basis for a research project.

An important distinction from the demonstrations contained in our dataset versus prior work is that our annotators were explicitly instructed not to perform “zero-shot" recordings, meaning annotators were told to rehearse each task before recording to ensure that the collected demonstrations were free of mistakes. More specifically, annotators were told to follow these principles:

  • \bullet

    We are simulating expert users of the interface.

    • \circ

      Do the optimal (i.e. most direct) way to complete each task.

    • \circ

      Ensure that your demonstration contains no wasted clicks / typing.

    • \circ

      Ensure that your demonstratoin has no mistakes – If you make a mistake while performing the demonstration, stop recording and re-record from scratch.

  • \bullet

    We want a clean dataset

    • \circ

      When you record, ensure that the selected interface within Google Chrome is visible.

    • \circ

      Ensure you do not show any other applications.

    • \circ

      Ensure you do not show personal information.

Therefore, the final dataset has a 100% task completion rate. In contrast, in the original WebArena benchmark [69], untrained human annotators could only complete 78% of tasks.

3. Data Collection. Each annotator utilized a custom Python script to record demonstrations of approximately 300 unique tasks. This script operated in the background while the annotator completed the demonstration, capturing and outputting four primary types of data: (1) a JSON trace detailing all user actions (clicks, keystrokes, and scrolls), including the precise HTML state of the website at the time of each action and attributes of the elements interacted with; (2) a video of the full screen recording of the annotator’s computer; (3) a collection of screenshots corresponding to each recorded action; and (4) an initially blank Standard Operating Procedure (SOP) file.

Once the recording was complete, each annotator filled out the SOP file, creating a detailed, step-by-step list of the actions they performed. Annotators were directed to explain these steps with the simplicity and clarity necessary for a five-year-old to follow. The annotators were instructed to provide the level of detail that a 5-year-old would need to complete the task. Finally, annotators assessed the difficulty of each task, classifying them as Easy, Medium, or Hard. On average, each annotator dedicated approximately 30 hours to this process, amounting to a collective total of nearly 300 man-hours of labeling over several months.

4. Demonstration Ranking. After completing the dataset collection process, we chose a subset of 162 tasks (all derived from different task templates) to form our collection of “Gold Tasks". Each annotator was then tasked with watching the demonstrations of approximately 15 “Gold Tasks", relatively ranking the demonstrations of the same task from 1 (best) to 5 (worst). The annotators then developed a more thorough SOP we call a "Gold SOP" based on the demonstration that received the top ranking. This process resulted in 162 tasks in our dataset containing demonstrations of ranked relatively quality, along with high quality “Gold SOPs" we use as the highest quality SOP representation of the “Gold Task"’s demonstrations. More details about this ranking procedure are included in Appendix A.5.

5. Quality Assurance. A key contribution of WONDERBREAD is high quality human task demonstrations. A review of existing benchmarks for web navigation tasks found consistently low quality demonstrations that have inaccurate annotations (e.g. misplaced bounding boxes for HTML elements) [58]. This made quality assurance a key concern while curating WONDERBREAD. We performed three rounds of quality assurance checks over the course of two months using a combination of automated scripts, manual review, and cross-referencing demonstrations across annotators. We had annotators redo any tasks that were of insufficient quality, and discarded any tasks that had less than 4 successful demonstrations. Additional details are available in the Appendix A.3.

6. Workflow Understanding Questions. To enable deeper evaluations of a model’s workflow understanding, we also created a set of 11 free responses question templates, which are listed in Appendix A.4. These questions attempted to simulate actual inquiries that a BPM consultant might ask. Examples include “Explain what the most common failure modes might be for a user performing this task” and “Why does the user click the “Commits” button in step #5?". We created 10 instances of all question templates, and an additional 10 instances for question template #2. This gives a total of 120 questions. We then had had a set of annotators write brief free-form answers based on the corresponding task.

A.3 Quality Assurance

We ran a series of automated scripts to flag systematic errors, and had our annotators redo any tasks that were flagged. For example, we verify that all actions occur within Google Chrome and that major disagreements between annotators on each task are resolved. For example, we cross-reference task demonstrations across annotators and redo tasks where someone marked it as infeasible while someone else marked it as feasible. We also conduct manual review of all demonstrations corresponding to the 179 Gold tasks, as well as a random sampling of 300 other demonstrations across all tasks.

A.4 Question Answering Dataset Questions

Listed below are the free response questions templates that we created for our Question Answering task, largely inspired by prior work in the process mining literature [20, 9].

  1. 1.

    Explain what the most common failure modes might be for a user performing this task.

  2. 2.

    Here are two demonstrations, one of which is more efficient than the other. Please describe ways to improve the less optimal workflow.

  3. 3.

    How would a user completing the task know that the workflow is completed?

  4. 4.

    What is the purpose of doing this workflow?

  5. 5.

    What if instead of X we wanted to do Y. How would you change this workflow to accomplish that?

  6. 6.

    Why does the user click the button X in step #Z?

  7. 7.

    Why does the user click the button X in screenshot #Y?

  8. 8.

    Why does the user type the string X in step #Z?

  9. 9.

    Why does the user type the string X in screenshot #Y?

  10. 10.

    Here are two workflows. Please identify the key differences between them.

  11. 11.

    Given the following three concatenated workflows, how would a user completing the workflow about X know that the workflow is completed?

A.5 Factors for Quality of Gold SOPs

Listed below is the information given to annotators to aid them with writing high-quality Gold SOPs.

  1. 1.

    Coverage of edge cases – help the user complete the task by making note of ways in which the interface might change, and how to adapt:

    • \circ

      e.g. If a task involves looking through a table of shipping orders to find a specific order, and your specific order just happens to be the first one, you should still make a note that the user might have to scroll / paginate through the results until they find the correct shipping order.

    • \circ

      e.g. If you need to click a button at the bottom of a page, you should not assume that the user’s browser window has the same size as yours, so you should let them know that they might need to scroll down if they can’t see the button.

    • \circ

      Example: Instead of “Click on the toggle labeled ‘Enable Product”’, you “should write “Look for the toggle labeled “Enable Product” which should be directly below the “Quantity” field. If the toggle is currently green, that means the product is currently enabled, which means you should click the toggle in order to disable the product. The toggle should change to a grey color to indicate the product is disabled. However, if the toggle is already greyed out, then do nothing since the product had already been disabled.”

  2. 2.

    Detailed localization of UI elements – let the user know exactly where to find the element

    • \circ

      e.g. “Click the ‘Go to Result’ button” is not sufficient. You must be extremely detailed in your specification of each element, i.e. its relative position on the screen, its proximity to other landmark elements, its color, what type of element it is, etc.

    • \circ

      Example: Instead of “Click the ‘Edit’ link”, you should write “Click on the blue “Edit” link at the far righthand side of the row corresponding to the “Configurable Product” we previously found."

  3. 3.

    Generalizability – the instructions should be written so that they could apply to any instantiation of the Intent Template corresponding to the task

    • \circ

      e.g. The instructions should be written generally, providing task-specific information as asides.

    • \circ

      Example: Instead of “Type “Out of Office" in the “What’s your status?” input box.”, you should write “Type the desired Gitlab status in the “What’s your status?” input box. In this case, we should type “Out of Office””

  4. 4.

    Explanations of each action – briefly explain why we take each step (in the context of the next action, or the larger task)

    • \circ

      e.g. What is the point of each individual action?

    • \circ

      Example: Instead of “Click the “From” text field”, you should write “Click the “From” text field to focus it.”

    • \circ

      Example: Instead of “Click on the toggle labeled ’Enable Product”’, you should write “Click on the toggle labeled ’Enable Product’ to disable the product.”

Appendix B Benchmark Tasks

For clarity, we define the following notation: Our dataset contains a set of workflow demonstrations 𝒟𝒟\mathcal{D}caligraphic_D. Each demonstration d𝒟𝑑𝒟d\in\mathcal{D}italic_d ∈ caligraphic_D is defined as d=(w,sop,(s1,a1,s2,a2,,an1,sn))𝑑wsopsubscript𝑠1subscript𝑎1subscript𝑠2subscript𝑎2subscript𝑎𝑛1subscript𝑠𝑛d=(\textsc{w},\textsc{sop},(s_{1},a_{1},s_{2},a_{2},...,a_{n-1},s_{n}))italic_d = ( w , sop , ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) where w is the "Intent" or the natural language description of the workflow being done, sop is a manually written step-by-step guide describing the steps taken in the demonstration, sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_ith state of the webpage (i.e. a screenshot extracted from the screen recording of the demonstration), and aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the action taken at state sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (i.e. a ’click’, ’keystroke’, or ’scroll’ event extracted from the trace). There are multiple demonstrations d𝑑ditalic_d for each workflow, so w is not unique. However, sop, v, and (s1,a1,s2,a2,,an1,sn))(s_{1},a_{1},s_{2},a_{2},...,a_{n-1},s_{n}))( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) are unique across different demonstrations.

TaskInputOutputEvalMulti-modalMultiple Demos
Documentation
SOP Generation1 DemoSOPLLM
Demo Segmentation2+ DemosClusteringARI
Knowledge Transfer
Question AnsweringQuestion & 1+ DemosFree textLLM
Demo Validation1 Demo with SOPBinary labelF1
Improvement
Demo Ranking3+ DemosRankingKendall τ𝜏\tauitalic_τ
SOP Improvement1 Demo & SOPSOPLLM

B.1 Documentation

These subtasks assess a model’s ability to generate documentation for existing workflows.

  1. 1.

    SOP GenerationDescription: Given specified components of a workflow demonstration, the model is tasked with generating a new SOP that documents the steps of that workflow. This evaluates a model’s ability to generate written documentation.

    • Input: Given a demonstration d=(w,sop,v,(s1,a1,s2,a2,,an1,sn))𝑑wsopvsubscript𝑠1subscript𝑎1subscript𝑠2subscript𝑎2subscript𝑎𝑛1subscript𝑠𝑛d=(\textsc{w},\textsc{sop},\textsc{v},(s_{1},a_{1},s_{2},a_{2},...,a_{n-1},s_{%n}))italic_d = ( w , sop , v , ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ), we provide the model with either (w)w(\textsc{w})( w ), (w,(s1,,sn))wsubscript𝑠1subscript𝑠𝑛(\textsc{w},(s_{1},...,s_{n}))( w , ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ), or (w,(s1,a1,,an1,sn))wsubscript𝑠1subscript𝑎1subscript𝑎𝑛1subscript𝑠𝑛(\textsc{w},(s_{1},a_{1},...,a_{n-1},s_{n}))( w , ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ).

    • Output: An new SOP denoted as ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT describing the steps of demonstration d𝑑ditalic_d.

    • Evaluation: Pairwise per-line comparison between s𝑠sitalic_s and ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that determines the precision and recall as described in Appendix Section C.2

  2. 2.

    Demonstration Segmentation Given multiple demonstrations from separate workflows concatenated into a single sequence, identify when each demonstration starts and ends. This evaluates the model’s ability to disambiguate between different workflows occurring in sequence.

    • Input: A concatenated sequence of k𝑘kitalic_k demonstrations {di}i=1ksuperscriptsubscriptsuperscript𝑑𝑖𝑖1𝑘\{d^{i}\}_{i=1}^{k}{ italic_d start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, represented as either (s11,,sn1s1k,,snksubscriptsuperscript𝑠11subscriptsuperscript𝑠1𝑛normsubscriptsuperscript𝑠𝑘1subscriptsuperscript𝑠𝑘𝑛s^{1}_{1},...,s^{1}_{n}||...||s^{k}_{1},...,s^{k}_{n}italic_s start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | | … | | italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT)   or   (s11,a11,,an11,sn1s1k,a1k,,an1k,snksubscriptsuperscript𝑠11subscriptsuperscript𝑎11subscriptsuperscript𝑎1𝑛1subscriptsuperscript𝑠1𝑛normsubscriptsuperscript𝑠𝑘1subscriptsuperscript𝑎𝑘1subscriptsuperscript𝑎𝑘𝑛1subscriptsuperscript𝑠𝑘𝑛s^{1}_{1},a^{1}_{1},...,a^{1}_{n-1},s^{1}_{n}||...||s^{k}_{1},a^{k}_{1},...,a^%{k}_{n-1},s^{k}_{n}italic_s start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | | … | | italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT).

    • Output: For each frame s𝑠sitalic_s in the provided input, assign each of the frames to one of the k𝑘kitalic_k demonstrations. This generates a clustering that maps frames to demonstrations. For example, given 20 frames from three demonstrations (A𝐴Aitalic_A,B𝐵Bitalic_B,C𝐶Citalic_C), an output assignment clustering might map frames 1-5 to demonstration A𝐴Aitalic_A, frames 6-10 to demonstration C𝐶Citalic_C, and frames 11-20 to demonstration B𝐵Bitalic_B.

    • Evaluation: Given the k𝑘kitalic_k clusters of frames, measure the adjusted rand score.

B.2 Knowledge Transfer

These subtasks assess a model’s ability to apply knowledge of workflows in practical scenarios.

  1. 1.

    Question Answering - Given a question about one or more workflow demonstrations, generate a natural language answer.

    • Input: A brief question (instantiated from one of the templates in Appendix A.4), and one or two demonstrations, where each demonstration is representedas either (SOP)SOP(\textsc{SOP})( SOP ) or (s1,a1,,an1,sn)subscript𝑠1subscript𝑎1subscript𝑎𝑛1subscript𝑠𝑛(s_{1},a_{1},...,a_{n-1},s_{n})( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ).

    • Output: A natural language answer to the question.

    • Evaluation:Using GPT-4-as-a-judge, compare a human-written reference answer to the generated answer and determine a score for specified criteria on a scale from 1 (bad) to 3 (good). Specified criterion include completeness (the response fully answers the question), soundness (the response is logically consistent), clarity (the response is unambiguous), and compactness (the response is concise).

  2. 2.

    Demonstration Validation - Given a demonstration and SOP, determine whether (a) the workflow was successfully completed; and (b) whether the demonstration exactly followed the steps of the SOP. For (b), it is not sufficient to merely complete the workflow, but the steps taken to complete it must align with its corresponding SOP.

    • Input:For (a) we create "positive" examples by sampling full sequences of (s1,a1,,an1,sn)subscript𝑠1subscript𝑎1subscript𝑎𝑛1subscript𝑠𝑛(s_{1},a_{1},...,a_{n-1},s_{n})( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) from our dataset, and create "negatives" by truncating some sequences by a random number of frames to get (s1,a1,,sk1,sk)subscript𝑠1subscript𝑎1subscript𝑠𝑘1subscript𝑠𝑘(s_{1},a_{1},...,s_{k-1},s_{k})( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) where k<n𝑘𝑛k<nitalic_k < italic_n. Given this sequence, we prompt the model to provide a binary assessment of whether the workflow was completed or not. For (b), we create "positives" by sampling full sequences (s1,a1,,an1,sn)subscript𝑠1subscript𝑎1subscript𝑎𝑛1subscript𝑠𝑛(s_{1},a_{1},...,a_{n-1},s_{n})( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) from our dataset, then and either (a) randomly shuffle or (b) randomly delete frames from this sequence to generate "negative" examples. We prompt the model with this sequence and the SOP, and have it output a binary assessment of whether the sequence exactly followed the SOP.

    • Output: For (a), a binary assessment of whether the given sequence was truncated. For (b), a binary assessment of whether the given sequence exactly followed the steps of its associated SOP.

    • Evaluation: Binary classification metrics (ie. Accuracy, F1-Score).

B.3 Improvement

These subtasks evaluate a model’s capacity to improve a given workflow’s efficiency.

  1. 1.

    SOP Ranking - Given a set of SOPs written by different human annotators for the same workflow, rank the SOPs in order of quality.

    • Input:A set of k𝑘kitalic_k SOPS {SOPi}i=1ksuperscriptsubscriptsuperscriptSOP𝑖𝑖1𝑘\{\textsc{SOP}^{i}\}_{i=1}^{k}{ SOP start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT written by different annotators for the same workflow.

    • Output: A ranking of the quality of the SOPs from 1k1𝑘1...k1 … italic_k, where 1111 is best and k𝑘kitalic_k is worst.

    • Evaluation:Given a provided ground truth ranking, determine the Spearman correlation and Kendall’s Tau between the predicted ranking and the ground truth.

  2. 2.

    SOP Improvement - Given a demonstration and low-quality SOP, and a rubric, generate an improved SOP that better captures what is shown in the demonstration.

    • Input: One demonstration d1superscript𝑑1d^{1}italic_d start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, a low quality sop1superscriptsop1\textsc{sop}^{\prime 1}sop start_POSTSUPERSCRIPT ′ 1 end_POSTSUPERSCRIPT generated by a human and an SOP generation rubric r𝑟ritalic_r.

    • Output: An improved SOP sop2superscriptsop2\textsc{sop}^{\prime 2}sop start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT that better aligns with the provided rubric.

    • Evaluation: LLM-based evaluation, where the model generates a rating of 1.0 - 5.0 conditioned upon a rubric.

Appendix C Evaluation

C.1 Compute

We rely on the publicly available APIs for each of the multimodal FMs we benchmark in this report: GPT-4, Claude3 Sonnet, and Gemini Pro. Thus, we did not require any GPUs to run our benchmark. In terms of cost, the Gemini Pro 1 API was free to use, the Claude 3 API cost roughly $400 in credits, and the GPT-4 API cost roughly $1,000 in credits.

C.2 LLM-Based Evaluation

SOP Generation
The automated evaluation for the SOP Generation task utilized a pairwise per-step comparison operating over the generated new SOP and the reference high quality SOP. Through a series of iterative prompts, GPT-4 was tasked to identify if the intention of a step in the new SOP was encapsulated in any step of the reference SOP and vice versa. The record of which steps were not included in the alternative SOP were then utilized to calculate the per-step precision, recall, and F1-score.

The precision (P), recall (R), and F1-score (F1) are calculated as follows:

P=TPTP+FP𝑃𝑇𝑃𝑇𝑃𝐹𝑃P=\frac{TP}{TP+FP}italic_P = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_P end_ARG
R=TPTP+FN𝑅𝑇𝑃𝑇𝑃𝐹𝑁R=\frac{TP}{TP+FN}italic_R = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_N end_ARG
F1=2×P×RP+R𝐹12𝑃𝑅𝑃𝑅F1=2\times\frac{P\times R}{P+R}italic_F 1 = 2 × divide start_ARG italic_P × italic_R end_ARG start_ARG italic_P + italic_R end_ARG

Where:

  • TP𝑇𝑃TPitalic_T italic_P (True Positives) is the number of steps in the new SOP that correctly map steps in the reference SOP.

  • FP𝐹𝑃FPitalic_F italic_P (False Positives) is the number of steps in the new SOP that do not map to any step in the reference SOP.

  • FN𝐹𝑁FNitalic_F italic_N (False Negatives) is the number of steps in the reference SOP that do not map to any step in the new SOP.

For the SOP Generation task, we found that our LLM-based evaluator was able to achieve high correlation out of the box with human raters as shown in Appendix Table 9. We hypothesize that this is because the SOP Generation evaluation task is set up to only require the model to make a binary decision over an atomic fact, rather than assess the quality of an open-ended question as in the Question Answering task, as seen in other works on LLM-based evaluations [37].

Question Answering
We rate each answer on a scale from 1 (bad) to 3 (good) on the following four criteria: completeness (the response fully answers the question), soundness (the response is logically consistent), clarity (the response is unambiguous), and compactness (the response is concise). Our original LLM-based evaluators had low correlation with human raters – an average Pearson correlation of 0.56 for scoring free reponses questions on a scale of 1 (low quality) to 3 (high) across the four axes of soundness, completeness, clarity, and compactness. We noticed that GPT-4 tended to be overly generous in its ratings. Adding a 3-shot example to our evaluation prompt (one for each possible score) and refining the prompt to "score harsher" helped increase the average correlation with human raters by 54% (to 0.86), as shown in Appendix Table 8.

Appendix D Additional Results

Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (7)
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (8)
ModelCompletenessSoundnessClarityCompactnessAverage Score
Claude3 Sonnet1.561.832.182.612.05
Gemini Pro 11.812.152.832.952.44
GPT-42.202.512.962.852.63
Human3.003.002.642.882.88
CriteriaPearson Corr.Pearson p-valueSpearman Corr.Spearman p-value
Completeness0.845.38e-090.861.12e-09
Soundness0.921.51e-120.882.34e-10
Clarity0.801.01e-070.801.01e-07
Compactness0.892.07e-130.897.41e-11
CriteriaPearson Corr.Pearson p-valueSpearman Corr.Spearman p-value
Precision0.844.63e-090.852.80e-09
Recall0.881.63e-100.823.97e-08
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (9)

D.1 Overall Dataset Stats

Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (10)
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (11)
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (12)
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (13)
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (14)
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (15)
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (16)

D.2 Dataset Stats, Split By Difficulty

Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (17)
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (18)
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (19)
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (20)
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (21)
DifficultyMinMedianMax
Medium1782
Hard21048
Easy1514
DifficultyMinMedianMax
Medium121541631
Hard62240976
Easy18114382

D.3 Dataset Stats, Split By Website

WebsiteMinMedianMax
shopping_admin30163704
gitlab12151870
shopping181211631
reddit43148382
WebsiteMinMedianMax
shopping_admin0629
gitlab0644
shopping1447
reddit2623
WebsiteMinMedianMax
shopping_admin018
gitlab017
shopping007
reddit018
WebsiteMinMedianMax
shopping_admin006
gitlab007
shopping003
reddit001
WebsiteMinMedianMax
shopping_admin0113
gitlab009
shopping0128
reddit005

Appendix E Instructions for Annotators

The figures below contain the instructions and other training provided to the annotators.

Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (22)
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (23)
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (24)
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (25)
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (26)
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (27)
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (28)
Do Multimodal Foundation Models Understand Enterprise Workflows? A Benchmark for Business Process Management Tasks (2024)

References

Top Articles
Latest Posts
Article information

Author: Pres. Lawanda Wiegand

Last Updated:

Views: 5287

Rating: 4 / 5 (71 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Pres. Lawanda Wiegand

Birthday: 1993-01-10

Address: Suite 391 6963 Ullrich Shore, Bellefort, WI 01350-7893

Phone: +6806610432415

Job: Dynamic Manufacturing Assistant

Hobby: amateur radio, Taekwondo, Wood carving, Parkour, Skateboarding, Running, Rafting

Introduction: My name is Pres. Lawanda Wiegand, I am a inquisitive, helpful, glamorous, cheerful, open, clever, innocent person who loves writing and wants to share my knowledge and understanding with you.