In the rapidly evolving world of AI, building an agentic workflow is just the first step. The real challenge—and where the most value is unlocked—lies in continuous improvement. How do you know if a new prompt, a different model, or a change in logic actually improves performance? The answer is rigorous, data-driven experimentation.
A/B testing, a cornerstone of web and product optimization, is the gold standard for this. But applying it to complex, multi-step AI agents can be a significant engineering hurdle. You need to manage different versions, route traffic, collect metrics, and do it all without disrupting your production environment.
This is where cli.do transforms a complex MLOps problem into a simple, developer-first command-line operation. By embracing the "Business as Code" philosophy, our .do CLI brings the full power of agentic workflow management to your terminal, making it easier than ever to experiment, iterate, and build better AI agents, faster.
For AI agents, A/B testing goes far beyond changing button colors. It’s about scientifically measuring the impact of changes on core business metrics. You can test:
By running controlled experiments, you can move from "I think this is better" to "I have data that proves this is 15% more efficient."
Let's walk through a common scenario. Imagine we have a workflow named "customer-onboarding-v1" that guides new users through setting up their account. We hypothesize that a more personalized prompt, which addresses the user by name, will increase the workflow completion rate.
Our goal: Deploy a new version ("customer-onboarding-v2") with the personalized prompt and split traffic 50/50 between the two versions to see which performs better.
First, ensure you're logged into your cli.do account.
# Login to your .do account
$ do login
# You can list your existing workflows to see what's running
$ do workflow list
✓ Found 3 workflows:
- customer-onboarding-v1 (Active)
- billing-alert-agent (Active)
- weekly-report-generator (Paused)
We see our baseline, "customer-onboarding-v1", is active. This is our Version A.
Now, in your local development environment, you'd branch your code, update the prompt in your workflow definition file, and save your changes. Once your new logic is ready, you can deploy it as a separate variant directly from your terminal.
// Navigate to your workflow's directory
$ cd /path/to/my-agent-workflow
// Deploy the new version with a distinct name
$ do workflow deploy . --name="customer-onboarding-v2"
The CLI handles the entire packaging and deployment process, giving you immediate feedback.
// Output:
// ✓ Authenticating...
// ✓ Packaging workflow files...
// ✓ Uploading package (1.3MB)...
// ✓ Deploying agent 'customer-onboarding-v2'...
// ✓ Deployment successful!
//
// API Endpoint: https://api.do/workflows/customer-onboarding-v2
// Status: Active
You now have two independent workflows running: your original control (Version A) and your new challenger (Version B).
This is where the magic happens. With a single command, you can configure a traffic policy to distribute incoming requests between your two versions. Let's create an alias, "customer-onboarding", that splits traffic evenly.
# Create a traffic split policy
$ do traffic split customer-onboarding --targets="customer-onboarding-v1:50,customer-onboarding-v2:50"
# Output:
# ✓ Traffic policy 'customer-onboarding' created.
# Routing 50% of traffic to 'customer-onboarding-v1'.
# Routing 50% of traffic to 'customer-onboarding-v2'.
#
# Entrypoint: https://api.do/workflows/customer-onboarding
Now, all your application calls to the main entrypoint will be automatically routed according to this 50/50 split.
With the experiment running, you need data. The .do CLI lets you tail logs and check key metrics for each variant right from your terminal.
# Monitor logs for Version A
$ do workflow logs customer-onboarding-v1 --follow
# In another terminal, monitor logs for Version B
$ do workflow logs customer-onboarding-v2 --follow
# Check aggregate metrics like completion rate and token usage
$ do workflow metrics customer-onboarding-v1 --since=24h
$ do workflow metrics customer-onboarding-v2 --since=24h
After a statistically significant period, you can analyze the results. Did Version B's personalized prompt lead to a higher completion rate? Did it use more tokens on average? This data provides a clear answer.
Let's say the data is clear: "customer-onboarding-v2" is the winner, increasing completion rates by 10%. You can now shift 100% of traffic to the superior version with zero downtime.
# Set 100% of traffic to the winning version
$ do traffic set customer-onboarding --target="customer-onboarding-v2"
# Output:
# ✓ Traffic policy 'customer-onboarding' updated.
# Routing 100% of traffic to 'customer-onboarding-v2'.
Finally, you can safely decommission the old version to keep your environment clean.
$ do workflow delete customer-onboarding-v1
The true power of a CLI is automation. Every command shown above is scriptable, making it a perfect fit for your CI/CD pipeline. Imagine a GitHub Actions workflow that automatically deploys a new branch as a 10% canary release, runs an evaluation, and notifies your team on Slack with the results. With cli.do, this level of sophistication is not just possible—it's straightforward.
Stop guessing and start measuring. A/B testing is essential for building robust, efficient, and effective AI agents. The .do CLI provides the developer-centric tooling you need to embed this practice directly into your development lifecycle.
Ready to iterate faster? Install the CLI and start deploying smarter workflows today.
Installation is simple:
npm install -g @do/cli
Visit our documentation to learn more and build your first agentic workflow with cli.do.