What counts as good computer-use data?

Real workflow data only matters when it can be converted into a task package that a lab can inspect, evaluate, and train on.

Good computer-use data is not just a long screen recording or a large pile of trajectories.

For post-training, the useful unit is a workflow package: task context, action trace, UI state, outcome labels, verifier logic, failure cases, and an eval split that can measure whether the model improves.

The bar for CUA data

The important question is not whether the data looks realistic. The important question is whether a lab can inspect the package and understand why it should move model performance.

That means every workflow package needs a quality story: how the task was sourced, what the model fails at, how success is verified, where the verifier can be wrong, and how held-out variants are measured.

Real-world data is not enough

Real-world data can still be weak. A random recording of somebody clicking through a website may be real, but it may not teach a computer-use agent anything durable.

The stronger source is operational desktop work where a person is solving an actual task. Think invoice reconciliation across PDF, Excel, email, and an internal admin page. The trace has intent, state changes, judgment, artifacts, and a final outcome.

That workflow becomes useful only after it is normalized into a repeatable environment. A lab needs to know the goal, the initial state, the allowed tools, the final state, and the verifier that decides whether the model actually succeeded.

What should be inside a workflow package?

A useful CUA workflow package should include:

Human workflow traces with screenshots and UI states
Action trajectories across desktop, browser, files, and documents
Labels for outcomes, failures, corrections, and artifacts
Programmatic or rubric-backed verifiers
A train split and a held-out eval split
Failure traces from base models
Reward signals or success criteria for RL

The package should also include a short evaluation report. The report does not need to claim that every model improves. It needs to show where a base model fails, how the verifier works, and how the same task distribution can be measured after training.

Why does this matter for frontier labs?

Frontier labs do not only need more trajectories. They need task distributions that reveal model weaknesses and produce measurable lift after post-training.

That is why CUA data should be sold with evidence. A lab should be able to ask: What does the model fail at? How was success verified? How many false positives does the verifier allow? Are held-out variants isolated from the training examples? Can this environment run again?

This is the difference between data that looks realistic and data that is useful for computer-use agent evaluation.

Is synthetic computer-use data enough?

Synthetic tasks can be useful for scale and controlled experiments. The weak point is that they often miss the texture of operational work: incomplete documents, legacy UI patterns, local files, ambiguous intermediate states, and human corrections.

Is a screen recording enough?

No. A recording is an input. The product is the task package built around it: normalized trace, state, labels, verifier, eval split, and reward signal.

How should model improvement be measured?

Start with pass@1, pass@3, and pass@5 on held-out workflow variants. Then inspect failure modes and verifier errors. The strongest proof is not volume. It is lift on the workflow.

The direction for UseDesktop

UseDesktop is being built around that loop: capture operational desktop workflows, turn them into repeatable task environments, evaluate model failure modes, and export data for training, evaluation, and RL.

The goal is not to sell generic web automation data. The goal is to build verified CUA workflow packages that frontier teams can use to train, evaluate, and improve computer-use agents.

Read more about why this points toward full-stack RL environments for computer-use agents.