A/B testing

Estimated reading time: [rt_reading_time postfix="minutes" postfix_singular="minute"]

If you’ve ever read articles with advice for any aspect of marketing your eCommerce business (and if you’ve read any of our previous lessons), the suggestion that you should “test to see what works for your customers” comes up a lot. While there are broad statistics, conclusions, suggestions, and pieces of advice that are valuable for eCommerce stores, those are really just general jumping off points for you to test what works with your specific eCommerce store.

Maybe you’ll find your audience doesn’t like standard 10 AM emails, but instead responds way better at 4 PM. Maybe you’ll find a green call-to-action button gets four times the clicks of a blue one. Maybe you’ll find that segmenting by gender has no effect on your sales, but segmenting by geography does. Whatever you find, you’ll find it through A/B testing.

A/B testing, which is also known as split testing and bucket testing, is where you send two different versions of an email (the “A” version and the “B” version) to two different groups of subscribers, then compare the results. Although only 31 percent of companies do A/B testing on their email marketing, it’s extremely important and powerful—and a way to unlock some results you might never otherwise find.

Every stat in every lesson so far has been based on averages, usually across large numbers of users. Consider those numbers to be the results of testing on a generic audience. With A/B testing, you learn what works for your specific audience. You can use it to continually improve your email marketing to maximize its value. And the best part: It’s relatively easy and cheap to do, but yields highly valuable insights.

In this lesson we’ll go over the four steps of running an A/B test: Figuring out what to test, finding a sample size, running the test, and analyzing the results.

How to run an A/B test

Step 1: Figure out what you want to test

You can test virtually any aspect of your email marketing. For example:

  • Timing and frequency of sends.
  • Subject line tweaks, like capitalization, length, wording, emojis, and personalization.
  • Content, like plain text versus HTML, segmentation, design layouts, headlines, and types of offers.
  • Calls-to-action, like button versus link, placement, button wording, button color, and the number of CTA buttons in an email.

So with a seemingly infinite (or, at least, very large) number of things to test, how do you decide which to do? You use a special ranking system to prioritize.

Using an ICE score to prioritize A/B testing

An ICE score is a prioritization framework where you score different items on three criteria:

  • Impact. How much of an impact on your future business do you think the test will have?
  • Confidence. How confident are you that the test will answer a big question and, therefore, provide valuable insight you can implement going forward?
  • Ease. How easy will it be to run this test?

Give each possible A/B test a score of 1-to-10 in all three of those criteria, then average the results to get the ICE score.

Make a list of possible A/B tests—you don’t need to make it exhaustive, you can start with 20 or 25 things you’d like to test—and give each item a 1-to-10 ranking in those three categories.

For example, let’s say you want to try using all capital letters in a call-to-action button versus your current sentence case. You’re not expecting that to transform your business (probably just a little bit of an improvement in clicks if it works), so for impact, that’s a 4. You’ve been wondering about changing your button like that for a while, and you know that after this test you’ll be pretty confident in the right direction going forward, so for confidence, you give it a 9. And tweaking capitalization on a button is easy to do, so for ease, you give it a 10. So you average those for an ICE score of 5.75.

Testing multiple things at once

You can test more than one thing at a time. However, it’s not recommended. The goal with any form of scientific testing is to find out how a variable affects results. When you introduce multiple variables, suddenly it becomes a lot harder to know which one is making a difference. Say you run a test where one version of an email uses different photos and a different headline and a different call-to-action. It performs better than the other version. Great—but what led to the improvement? Was it the photos? The headline? The call-to-action? Maybe the headline with your old photos and call-to-action would’ve led to even bigger improvements, because the new photos and call-to-action dragged it down. You can’t be sure—and now, with future emails, you don’t know which of those three things to continue to change to make a positive difference.

When you just test one thing at a time, you can confidently measure its positive or negative effect on your results.

Pick a key metric for your test

You need to be able to measure the results of your A/B test. As we discussed in the lesson on analytics, there’s an abundance of email marketing statistics, which will allow you to pick the one that most accurately indicates whether your test was a success, inconclusive, or a failure.

For instance, if you’re testing a capitalized versus un-capitalized call-to-action, the metric you’d want to use is click-through rate—since the purpose of a call-to-action is to entice clicks. If, say, you were testing two different subject lines, you’d want to measure open rate.

Step 2: Find your sample size

You’ve probably heard that the Nielsen TV ratings, which networks, shows, and advertisers continue to live or die by to this day, are based on just a few thousand people’s TV habits. The same is true for all of the most prominent surveys we see, from political polls to “16 percent of Americans have personally seen a UFO.” All of those results are based on representative samples; samples that are large enough to provide a result that is statistically significant, but small enough to be manageable and attainable.

You’re aiming to do the same thing with your A/B testing. You want to send the “A” version and the “B” version to large enough groups that you can say, confidently, the results of your test were not a fluke.

The statistical terms and concepts you need to know

We’re going to dive very briefly into the statistical terms you need to know in order to run a test. We’ll be using these terms quite a bit in this section, so it’s important to know what they are and why we need them.

Sample size. Sample size refers to the number of people you’ll be sending each version of the test to; a “sample” of your overall list population. Your goal in A/B testing is to find the right size sample to give you a statically representative picture of your list as a whole.

Confidence level/interval. The confidence level refers to the percentage chance that future results will match the results of your test. Which is to say, if you have a 95 percent confidence level, 95 out of 100 future emails should deliver the same result. We went with the 95 percent confidence level there because that’s pretty darn accurate—but can be achieved with a shockingly small sample size (which we’ll get into below). In general, the larger the sample size, the higher the confidence level.

Margin of error. The margin of error tells you how far the results from your sample may differ from the results if you actually tested across an entire population. If you have a 95 percent confidence level and a three percent margin of error, that means the result will be within three percentage points either direction of your result in 95 percent of cases.

Chi-square test. A chi-square test is used to determine whether a result is caused by random chance or if it actually means something. You’ll use it to determine if the results of your A/B test are signal or just noise.

Statistical significance. Ultimately, you’ll want to know if your results are statistically significant—that is, if they are strong enough findings that you can say you proved (or disproved) a hypothesis and go on to use those results in the future.

Finding a sample size

To find your sample size, you’ll use an online calculator where you enter the confidence level you want, the total size of your email list (“population size”), and the margin of error you want. You can play with the numbers and, as you’ll find, if you want higher confidence level and lower margins of error, you’ll need to use larger and larger samples.

The most common confidence level in testing is the 95 percent confidence level, as it can be attained through a surprisingly small sample size. A three percent margin of error is common for the same reason—attainable through a relatively small sample but still a pretty accurate result.

For example, if you have a list of 5,000 people, to run a test at a 95 percent confidence level with a three percent margin of error, you only need to send each version to 880 people. 

There are cases, however, where you might want to increase the confidence level and decrease the margin of error. Those are the cases when you’re testing something very small, like the color of a button or a one-word difference in a headline or call-to-action. Since those changes are so tiny, you’ll want more granular data, since they aren’t that likely to have much of an impact one way or another.

And yes, if you’re wondering, you can skip all of this and just split your list in half and send “A” to one half and “B” to the other half. But you don’t need to do that to get a statistically significant result.

Step 3: Prepare and send the two versions of your email

Design your email, then create the “B” version that has the one tweak you’re A/B testing. Think of the “A” version as the control in the experiment—that’s the standard email, the one you would’ve sent if you weren’t running a test. The “B” version is testing your hypothesis—it contains a change to the variable that’s the focal point of your experiment.

As we said earlier, it’s ideal to keep everything in the emails 100 percent identical except for the one thing you’re testing. That’s your best bet to ensure that the one thing you’re testing is responsible for any performance differences between the two versions of the emails.

Now you’ll want to send your two emails to different, randomly selected groups of subscribers. You determined the size of those groups using the online calculator in the above section, as they fit your desired confidence level and margin of error.

At this point, you should also determine the end date for your results. Usually about four or five days is good—after that, you can expect your email has disappeared deep into everyone’s inbox and they won’t be interacting with it anymore.

One more note: It can be tricky to A/B test automated emails—unless your email service provider allows for it, the same automations will go out to everyone. In those cases where you can’t run tests concurrently, you can test two emails over different periods of time and compare those results. As long as each test has a large enough sample to be statistically significant, you can use that staggered A/B testing to get actionable results.

Step 4: Analyze your results

Bring in the data and put it in a spreadsheet. (We’ll show you a basic version of how to set up the spreadsheet below.) After that, it becomes time to analyze the data to figure out if your results are statistically significant.

We’re going to demonstrate how to do that with an example below. It can feel a little complicated at points but there are two things to keep in mind. One, there are plenty of websites that can do these calculations for you—not to mention built-in formulas to Microsoft Excel and Google Sheets that will handle everything automatically. And two, there are plenty of paid apps out there that can run A/B tests for you and take virtually all of this work off your hands. We’re explaining the underlying concepts of an A/B test and everything that goes into running one; however, you can pay to not have to do any of this on your own.

Here’s an example with some tangible numbers to help illustrate how to analyze A/B test results.

Let’s say you’re testing two offers: A buy-one-get-one-free (BOGO) coupon versus $10 off. The metric you’re basing your test on is conversions; how many total people use each type of coupon.

You have a list of 5,000 subscribers and you want to test at a 95 percent confidence level with a three percent margin of error. You use an online calculator to help you find your sample size and learn you need to send the “A” version of the email, with the BOGO offer, to just 880 people. You need to also send the “B” version, with the $10 off, to a different 880 people. 

You end the test on day five and find that 100 people used the $10 off offer (11.3 percent of people), and 40 people used the BOGO offer (4.5 percent of people).  That means if you ran this test over and over, 95 percent of the time, somewhere between 8.3 and 14.3 percent of people would use the $10 off coupon, and somewhere between 1.5 percent and 7.5 percent would use the BOGO offer. 

It sure looks like the $10 off is a better offer than the BOGO—but is that statistically significant? We’ll use a chi-squared test to figure that out. Here are the actual results from our experiment:

Actual Results

[wptb id=24171]

The next step is to calculate the expected conversions if you hadn’t been testing two different methods and, instead, had just sent one offer. We figure that out with (column total * row total) / total visitors. In this case, the expected results for both offers is identical because we used identical sample sizes.

Expected results

[wptb id=24172]

Now we’re ready to use the chi-squared method to determine statistical significance. To do that, for each of the four boxes, we calculate (expected - observed)^2 / expected.

Chi-squared results

[wptb id=24173]

Our chi-squared result is the total of all for yes and no values added up, which comes out to 27.94. You will then compare that number to the threshold value for a 95 percent confidence interval in a chi-squared table, which is 3.84. Since 27.94 bigger than 3.84, we can say this test is undoubtedly statistically significant. The $10 offer is definitively better than the BOGO.

Again, that all can feel pretty complicated, especially if you’re not a math person, but the calculations are all relatively easy (and can be handled automatically). 

Now you’ve run an A/B test. As you go forward, you may find you can start running A/B tests on every email as you continue to refine your strategies and prioritize your tests via their ICE scores.

Re-running tests

It’s important to remember that your A/B test results may not hold true forever—especially as your business evolves and your list and customer base grow. There are other factors that can affect test results as well, like seasonality or what your competitors are doing at that time. And, of course, your list may grow blind to your marketing if you use the same style or techniques repeatedly. (After all, a subject line that says “URGENT!!! 🚨🚨” may win an A/B test on open rates—but that subject line will lose its effectiveness if you use it on every email.)

That’s why it’s important to monitor your long-term results as you continue to A/B test. If something that once proved effective no longer seems to be working, you can re-run some A/B tests. For instance, if your open rates drop, it can be time to retest subject lines, preview text, and the from name.

Summary and implementations


A/B testing is the way to experiment with different facets of your email marketing to find what works best for your audience. And doing so is crucial; while there are countless studies that have examined basically every nuance of email marketing, you need to do what works best for your specific audience. And to determine that, you need to run your own tests.

There are four steps to the A/B testing process.

First, figure out one thing you want to test. You can prioritize what to test by assigning different options an ICE score, which takes into account their impact, your confidence in getting usable results, and the ease of implementation. Make sure you pick the metric you’re going to use to judge the results of the test, and only test one thing at once to eliminate noise.

Second, determine your sample size. You can use an online calculator to figure out the size of the “A” and “B” groups for your test. The samples are based on the size of your list, the confidence level you want to test at (95 percent is usually good), and the margin of error you’re comfortable with (three percent is common).

Third, you’ll need to prepare two different versions of your email, where only the one thing you’re testing is different between them. Then send each email out to its sample group, and end the test after a predetermined length of time.

And fourth, you’ll analyze your results. You can use a chi-square test to find whether your results are statistically significant and, therefore, something you should implement in emails going forward.


Step 1: Figure out some things you want to test

  • Make a list of 20 to 25 things you want to test on your emails.
  • Give an ICE score to each of them to figure out your first priority.

Step 2: Find your sample size and run your first test

  • Use an online calculator to find the right sample size for your test.
  • Prepare two versions of your email.
  • Send the versions to the different groups. (Or at different time periods if you’re testing automations in a staggered manner.)
  • Collect the results at the end of your test period.

Step 3: Analyze your results

  • Calculate whether your results are statistically significant.
  • If they are, plan to implement your findings in future emails. If they’re not, decide if you want to run another test on the same item (for instance, if capitalized versus non-capitalized call-to-action text didn’t make a difference, maybe you’ll want to try a call-to-action button versus a text link next).
  • Plan your next A/B test.
  • Re-run tests if results begin to dip.