Driving Agent Quality with Coding Agent Flywheel

TL;DR

A coding agent now automatically tests and refines your agent's quality using real user scenarios and adaptive metrics without manual intervention.

Key points

1
How the Flywheel Works: The new flywheel skill runs inside your coding agent to test quality through five stages: prepare data, run inference, grade with AutoRaters, analyze failures, and optimize. It uses Google's adaptive AutoRaters to score traces and generate custom metrics for specific issues, like whether an agent honors mid-conversation changes. Unlike traditional methods, it doesn't rely on a single metric but clusters failures to find root causes. For example, the travel-concierge agent failed because it stored correct internal state but echoed stale values in user messages. The skill automatically identifies such issues by creating custom rubrics (e.g., 'revision_honored') and triggers fixes when failure rates exceed thresholds. This ensures changes actually improve performance, not just look better on isolated examples.
2
Why Decoupling Matters: The optimizer and evaluator stay separate to prevent gaming metrics. Your coding agent proposes fixes, but an independent service (like Google's GenAI evaluation) scores them. This avoids the trap where an agent learns to manipulate scores instead of improving. For instance, the travel-concierge agent's internal state was correct but it didn't update its final message, causing a 21% failure rate. The skill isolates this issue by creating a custom metric that tracks whether revisions were honored, rather than relying on blended scores from adaptive models. This decoupling ensures fixes address real user impact, not just surface-level improvements.
3
Real-World Fix Example: The skill identified a specific failure in the travel-concierge agent: when users revised trip details mid-conversation, the agent kept using old values. By adding three sentences to the root agent's instruction to check the latest user input before responding, the failure rate dropped from 21% to 5%. Similarly, for the software-bug-assistant agent, the skill found that it never told users which tools it used, fixing it with a single footer line. These examples show how the skill translates vague concerns into actionable steps—like creating custom metrics for 'revision_honored'—without requiring developers to name metrics upfront. The skill handles the technical details, such as running the User Simulator to generate scenarios, so you only need to describe the problem in plain language.
4
Installation and Use: Install the skill via either `google-agents-cli-eval` (for ADK agents) or `agent-platform-eval-flywheel` (for other frameworks). The skill works with minimal input: describe the problem in plain English (e.g., 'I'm worried about whether travel-concierge honors mid-conversation changes'), and it handles the rest. It automatically selects the right metrics, runs synthetic scenarios using the User Simulator, and proposes fixes. For production use, it skips inference and grades real traces directly. This eliminates manual setup and allows teams to iterate faster—like the software-bug-assistant fix that improved tool transparency from 0% to 96% in one cycle. No flags or metric names are needed; the skill decides the evaluation method based on your goal.

What changed

Before this update

Teams manually tested agents with limited cases and tweaked prompts without connecting changes to real-world performance.

After this update

Coding agents automatically run quality cycles using synthetic and production data to identify and fix failures before they impact users.

Read the original on Google Search Central

Share this update

This is a summary of an official post from the Google Search Central Blog, provided for quick reading. Google and the Google logo are trademarks of Google LLC; My Tool Studio is not affiliated with Google. Always refer to the original announcement for authoritative guidance.