Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Test planning

This topic explains how to write a complete test plan for an experiment. Writing the plan before development begins prevents wasted effort and produces higher quality experiments.

Without test planning: Experiments launch with vague success criteria, missing metrics, or unclear rollback plans. Stakeholders disagree on what “success” means, and development effort is wasted.

With test planning: Every experiment launches with a shared definition of success, validated instrumentation, and a documented decision framework.

Elements of a test plan

A complete test plan contains the following elements:

  • A validated problem statement
  • A testable hypothesis
  • SMART goals
  • Treatment design for control and variation
  • Primary, secondary, and guardrail metrics
  • A risk assessment with early stopping criteria
  • A review checklist

Define the problem and hypothesis

Start with the problem your experiment addresses, then write a hypothesis using the following format:

If we [make this specific change] for [this audience], then [this metric] will [increase/decrease] by [target amount] within [time period], because [rationale based on evidence].

Your turn: Draft the hypothesis for your next experiment using the template below. Fill in each component, then combine them into the full sentence.

ComponentYour value
Specific change
Target audience
Primary metric
Target amount
Time period
Rationale

Your hypothesis: If we ______ for ______, then ______ will ______ by ______ within ______, because ______.

Set SMART goals

The following table defines each SMART component:

ComponentDefinitionExample
SpecificName the exact metric and direction of change.Increase checkout completion rate.
MeasurableConfirm you have instrumentation to track the metric.Checkout completion events fire on the confirmation page.
AchievableValidate that the expected change is realistic based on prior data.Similar changes in the industry produced 5 to 15% lifts.
RelevantConnect the metric to a business goal.Checkout completion directly impacts quarterly revenue targets.
Time-boundSet a duration for the experiment based on traffic and expected effect size.Run for 4 weeks to reach 95% statistical power.

Your turn: Fill in the SMART components for the hypothesis you drafted above.

ComponentYour experiment
Specific
Measurable
Achievable
Relevant
Time-bound

Design treatments

Follow these guidelines when designing treatments:

  • Test one change at a time. Multiple simultaneous changes make attribution impossible.
  • Match the control to the current experience. The control group must see exactly what users see today.
  • Document variations clearly. Include screenshots, copy, or specifications for developers.
  • Ensure both treatments are functional. Do not ship broken or partial variations.

Select metrics

Choose three types of metrics for every experiment:

  • Primary metric: The single metric your hypothesis predicts will change. This determines whether the experiment succeeded.
  • Secondary metrics: Related metrics that reveal the full impact. For example, average order value alongside checkout completion.
  • Guardrail metrics: Metrics that should not degrade. For example, page load time, error rate, or support ticket volume.

Confirm that all metrics are instrumented and reporting correct data before launch.

Validate instrumentation with an A/A test

If this is your first experiment on a new application, service, or SDK integration, run an A/A test first. An A/A test serves both groups the identical experience and validates that your metrics pipeline, flag evaluation, and user assignment work correctly end to end. Always analyze A/A tests using frequentist statistics, not Bayesian. Bayesian priors can report a “winning” variation even when both groups receive the same experience.

A successful A/A test shows no statistically significant difference between the two groups. If you see a significant result, investigate before proceeding. Common causes include duplicate metric events, inconsistent context keys, metric events that fire before the SDK initializes, or incorrect flag evaluation logic. To learn more about when to run A/A tests, read the A/A testing guidance in Building a culture of experimentation.

Your turn: Identify the metrics for your experiment. Defining these now ensures your instrumentation is complete before development starts.

Metric typeMetric nameHow it is measuredCurrent baseline
Primary
Secondary
Secondary
Guardrail
Guardrail

Assess risks

Identify risks before launch. Common technical risks: performance regressions from flag implementation, incomplete metric instrumentation, and insufficient sample size. Common business risks: negative user experience, conflicts with active experiments, and premature decisions based on early results.

For each risk, define a mitigation strategy and early stopping criteria. Document your rollback plan, which typically means turning off the flag variation.

Review checklist

Use this checklist on your actual test plan before starting development:

  • The problem is validated with data, not assumptions
  • The hypothesis follows the standard format and includes a rationale
  • SMART goals are defined with specific targets and a time period
  • The control matches the current production experience
  • Each variation changes only one variable from the control
  • Primary, secondary, and guardrail metrics are defined
  • All metrics are instrumented and firing correctly
  • Sample size and experiment duration are estimated
  • Technical and business risks are documented
  • Early stopping criteria and rollback plans are in place
  • If this is the first experiment on this application, an A/A test has passed
  • The test plan has been reviewed by at least one other team member

Your turn: Review your test plan draft against this checklist. For any item you marked “no,” note the action needed to resolve it before development begins.

Checklist itemStatusAction needed
Problem validated with data
Hypothesis follows standard format
SMART goals defined
Control matches production
Single variable per variation
Metrics defined
Metrics instrumented
Sample size estimated
Risks documented
Stopping criteria in place
A/A test passed (if first experiment)
Peer review completed

To learn more about building the process around experiment intake and review, read Experimentation process design.