Test planning
This topic explains how to write a complete test plan for an experiment. Writing the plan before development begins prevents wasted effort and produces higher quality experiments.
Without test planning: Experiments launch with vague success criteria, missing metrics, or unclear rollback plans. Stakeholders disagree on what “success” means, and development effort is wasted.
With test planning: Every experiment launches with a shared definition of success, validated instrumentation, and a documented decision framework.
Elements of a test plan
A complete test plan contains the following elements:
- A validated problem statement
- A testable hypothesis
- SMART goals
- Treatment design for control and variation
- Primary, secondary, and guardrail metrics
- A risk assessment with early stopping criteria
- A review checklist
Define the problem and hypothesis
Start with the problem your experiment addresses, then write a hypothesis using the following format:
If we [make this specific change] for [this audience], then [this metric] will [increase/decrease] by [target amount] within [time period], because [rationale based on evidence].
Your turn: Draft the hypothesis for your next experiment using the template below. Fill in each component, then combine them into the full sentence.
| Component | Your value |
|---|---|
| Specific change | |
| Target audience | |
| Primary metric | |
| Target amount | |
| Time period | |
| Rationale |
Your hypothesis: If we ______ for ______, then ______ will ______ by ______ within ______, because ______.
Set SMART goals
The following table defines each SMART component:
| Component | Definition | Example |
|---|---|---|
| Specific | Name the exact metric and direction of change. | Increase checkout completion rate. |
| Measurable | Confirm you have instrumentation to track the metric. | Checkout completion events fire on the confirmation page. |
| Achievable | Validate that the expected change is realistic based on prior data. | Similar changes in the industry produced 5 to 15% lifts. |
| Relevant | Connect the metric to a business goal. | Checkout completion directly impacts quarterly revenue targets. |
| Time-bound | Set a duration for the experiment based on traffic and expected effect size. | Run for 4 weeks to reach 95% statistical power. |
Your turn: Fill in the SMART components for the hypothesis you drafted above.
| Component | Your experiment |
|---|---|
| Specific | |
| Measurable | |
| Achievable | |
| Relevant | |
| Time-bound |
Design treatments
Follow these guidelines when designing treatments:
- Test one change at a time. Multiple simultaneous changes make attribution impossible.
- Match the control to the current experience. The control group must see exactly what users see today.
- Document variations clearly. Include screenshots, copy, or specifications for developers.
- Ensure both treatments are functional. Do not ship broken or partial variations.
Select metrics
Choose three types of metrics for every experiment:
- Primary metric: The single metric your hypothesis predicts will change. This determines whether the experiment succeeded.
- Secondary metrics: Related metrics that reveal the full impact. For example, average order value alongside checkout completion.
- Guardrail metrics: Metrics that should not degrade. For example, page load time, error rate, or support ticket volume.
Confirm that all metrics are instrumented and reporting correct data before launch.
Validate instrumentation with an A/A test
If this is your first experiment on a new application, service, or SDK integration, run an A/A test first. An A/A test serves both groups the identical experience and validates that your metrics pipeline, flag evaluation, and user assignment work correctly end to end. Always analyze A/A tests using frequentist statistics, not Bayesian. Bayesian priors can report a “winning” variation even when both groups receive the same experience.
A successful A/A test shows no statistically significant difference between the two groups. If you see a significant result, investigate before proceeding. Common causes include duplicate metric events, inconsistent context keys, metric events that fire before the SDK initializes, or incorrect flag evaluation logic. To learn more about when to run A/A tests, read the A/A testing guidance in Building a culture of experimentation.
Your turn: Identify the metrics for your experiment. Defining these now ensures your instrumentation is complete before development starts.
| Metric type | Metric name | How it is measured | Current baseline |
|---|---|---|---|
| Primary | |||
| Secondary | |||
| Secondary | |||
| Guardrail | |||
| Guardrail |
Assess risks
Identify risks before launch. Common technical risks: performance regressions from flag implementation, incomplete metric instrumentation, and insufficient sample size. Common business risks: negative user experience, conflicts with active experiments, and premature decisions based on early results.
For each risk, define a mitigation strategy and early stopping criteria. Document your rollback plan, which typically means turning off the flag variation.
Review checklist
Use this checklist on your actual test plan before starting development:
- The problem is validated with data, not assumptions
- The hypothesis follows the standard format and includes a rationale
- SMART goals are defined with specific targets and a time period
- The control matches the current production experience
- Each variation changes only one variable from the control
- Primary, secondary, and guardrail metrics are defined
- All metrics are instrumented and firing correctly
- Sample size and experiment duration are estimated
- Technical and business risks are documented
- Early stopping criteria and rollback plans are in place
- If this is the first experiment on this application, an A/A test has passed
- The test plan has been reviewed by at least one other team member
Your turn: Review your test plan draft against this checklist. For any item you marked “no,” note the action needed to resolve it before development begins.
| Checklist item | Status | Action needed |
|---|---|---|
| Problem validated with data | ||
| Hypothesis follows standard format | ||
| SMART goals defined | ||
| Control matches production | ||
| Single variable per variation | ||
| Metrics defined | ||
| Metrics instrumented | ||
| Sample size estimated | ||
| Risks documented | ||
| Stopping criteria in place | ||
| A/A test passed (if first experiment) | ||
| Peer review completed |
To learn more about building the process around experiment intake and review, read Experimentation process design.