Key Takeaways
- For small businesses with less than 1000 transactions per month, running experiments may not yield significant outcomes. However, they can still use other research methods like user research, feedback, surveys, screen recordings, heat mapping, and data cohorts to optimize their website.
- Even with limited data, businesses can still adopt a testing and learning mindset. They can run fake door experiments or buy traffic on Facebook or Google to understand user behavior.
- If you have enough data to run experiments, the number of simultaneous experiments is unlimited. However, running multiple experiments on the same page can introduce more noise into the data and lower the percentage of significant results. Despite this, the absolute number of significant outcomes will increase, leading to faster learning and growth.
- There are technical challenges to running multiple experiments simultaneously. If too many experiments are run on the same page, it could break. Therefore, a robust system and development limitations should be considered.
- Using features like mutually exclusive groups, where one user only sees one experiment on a page, can help manage multiple experiments. However, this requires a lot of traffic.
Summary of the session
The webinar, hosted by Jan Marks from VWO, features Florentien Winckers, Experimentation Consultant at Albert Heijn and Ton Wesseling, Founder of Online Dialogue, discussing the challenges and strategies of integrating experimentation into product development. They delve into the importance of data quality, trust in data, and the potential pitfalls of overpromising in the experimentation industry.
The speakers also discuss the significance of tracking metrics aligned with a company’s strategy, and the concept of optimizing backwards, starting from the final step of the user journey. The webinar provides valuable insights into the evolving landscape of data usage, experimentation, and optimization strategies.
Webinar Video
Top questions asked by the audience
-
How do small and medium-sized businesses try to optimize if we're looking at a future where, Google AdWords, Google Analytics, and Facebook are likely to be banned.
I believe there are 2 separate issues here. The cookie-less world is one thing. But this question is also, looking at the European legal issues that are going on with storing data at US-controlled com ...panies, that are under the Pfizer X, in the US. Within the next months, every European country will be not allowed to store data even in Europe on a server owned by Google or Facebook or whatsoever because they are part of the Surveillance Act of the US. And they can get that data, which is not allowed from a European perspective. So that's gonna be a challenge for every company, not only small-medium business but also for the large companies like, all the time. There's no European cloud or whatsoever, that can handle all that data. So we have a problem there. It's of course, if you are still selecting, your software it kinda makes sense if you're in Europe to, work with European software vendors, but it's hard. There's not enough available, to help with this challenge, but we'll have to see what the future brings. The other question, the cookless world is maybe even more interesting because if you're not allowed to do anything with cookies, how can you recognize users in your experiments? And there's also a problem for bigger companies and the small and medium businesses, without cookies, of course, it's third-party cookies that are being banned now or not being able to push anymore, to the browser. First-party cookies are doable. So surface site experimentation is a launch or cookie from their own web server. I think every vendor nowadays has a solution for that. But if that also is gonna be banned, in the end, you want people to log in. You want people to identify themselves to you. And if you are a brand that can be trusted and you have, incentives, to, a loyalty program, that those people log on to your websites, then they will identify themselves. And then you even solve the issue of running experiments on mobile and laptop at the same time to the same user because that's the challenge we are already facing nowadays that you have cross-browser issues in experimentation. So, I think everyone will be pushing for logins more and more. Florentien: Yeah. I think that logged-in users are becoming more and more valuable. So, I think the focus needs to be on that and maybe the experimentation incentives that you're going to be more focused on getting more people to log in. Or to get people to become more loyal, or to become part of your loyalty program. So, of course, those things are also of high value. -
Many companies think that when they do not have the volume to test, they would, like to know how many tests can, or, should run simultaneously, and if you run more tests, simultaneously, what could be the impact on the customer or on the user there. So, how many tests can or should you run simultaneously? Is there a rule to that?
Florentien: I think it depends on the amount of traffic you have and the amount of experiments you can run simultaneously. Running, testing and multi-energy shouldn't be a problem if you have enough ...traffic, but, it also depends on the metric you are testing on because for some kinds of metrics, you need to have more data available to measure any effects for other metrics for you. We want to see, like, a bigger change then, you might need less traffic. So there is not really one answer to this question, in my opinion. I think that you can run simultaneous tests but you really need to think carefully, about how much traffic you have and, how many tests you have. Yeah. How much traffic do you have available for the different tests? Ton: I don't really agree with the simultaneous tests. But I will pause that answer, for a couple of seconds because I believe there are two questions in this question. The first question is when I don't have the number of users to run experiments, what should I do? I'm a small media business company. I'm below 1000 transactions per month. Should I be running experiments? And, then, of course, the answer would be no because you don’t have the data to come to significant outcomes. From a transaction perspective, maybe you have enough clicks, and you can still run experiments. You can do fake door experiments. You can buy traffic on Facebook or Google as long as it's allowed to do so, to understand, to use it as research, to understand the behavior of your users and your visitors, because, in the end, optimization is a task, experimentation, testing culture, is a mindset. So you continue to want to test and learn, and maybe you cannot use experiments that often to test and learn, but you can still use all sorts of other researchers to come up with your user research, feedback, surveys, screen recordings, heat mapping, your data, cohorts, which you can use to come up with better alternatives, but then you will probably not be able to test them against transactions. So you need to take more risk, but you can still do the whole process of optimizing your website but you will not be able to exactly. Yeah. If you have the numbers to run experiments, then the number of simultaneous experiments is unlimited. Because what happens, of course, there are 2 sorts of experiments. Like, you have a checkout flow of three steps. You can run an experiment on step 1 and step 2. Or you can even run, like, 4 experiments at the same time as step 1. And this is the part where most people will say, well, you should not do or do like, a multi-parent experiment, but you can have, like, 4 A/B experiments at the same time on one specific pitch. What will happen in reality is that there will be more noise in the data because, maybe the combination of experiments both variations 1 and 2 are adding value to the conversion and not separately. They have to be there together. So you will get more noise your winning percentage will go down, so the number of significant results will go down a little bit. But then you're, like, instead of one experiment with, like, a 30%, win percentage gland do 4 experiments with a 25 percent win percentage, in the end, will lead to more significant outcomes. So we always take that noise for granted of course, your percentage will go down a little, but the number of absolute significant outcomes will increase. So you will learn faster in that case, and you will outgrow your competitors who are only running one experiment and are not doing simultaneous experiments. And of course, there's a technical challenge here. If you run, like, 25 A/B experiments, on the same page in all different elements, at some point, the page will be broken. So our road system should be really good. And there will be development limitations to the number of experiments you can run at the same time. I think that's the limit, but it's not from a statistical perspective. -
Is there gonna be a cookie-less world?
Will all browsers ban the possibility of storing information locally on the user's machine I think that’s the main question. Yeah. It could be, but they probably will give the control keys to the us ...er. So if the user is allowing you to store local data. The default is gonna be not being allowed to store information because the default used to be you can store information unless it gets rejected. Exactly. And now the default comes, you can you can not store information unless you get consent to store information. Data may not be stored in a local machine anymore. If you look at the Teamburners Lee initiative of Solid, which is creating parts of data hosted in the clouds, controlled by the user in itself. And that user can allow websites to use that specific data. And even data from other websites, that specific websites can use to optimize the experiments for specific users, but then you are in full control of the data, not locally, but hosted in the clouds, and you are controlling who can share, who can use that information and store information. We probably will go in that specific direction, but as a website owner, you will be able to use user behavior when the user allows you to use it. So if you ask for consent and they will give it to you, then you can use it. -
What is your perspective on creating a holdout across simultaneous site page tests so you understand and maintain a baseline experience?
Ton: Maybe I want to explain the holdout to the attendees. This is the process where you let's say, have a fixed percentage of 10% of your users that will not be part of any experiments. So they will ... have the same experience throughout your whole experimentation process. And then the question is, is this a good thing to do? The only companies that are asking for this are the companies that do not trust experimentation yet. So they don't trust us exactly. And maybe they were over-promised, coming back to the question in the beginning, that they see all these winners and once they get implemented, they don't see the uplift, that they were expecting because they were just counting the numbers from all the winning experiments and they don't look at false discovery rates or type M errors and so on. So this is why they wanna create a whole life because they wanna see what's going on. So I don't think that's the best thing to do, because the problem is that they don't trust statistics. So you have a different problem to solve and create all the group. It's not a problem. It's not how you wanna solve this specific problem. You wanna have trust in experimentation and data. So you have to teach them, explain to them, hire an external someone that elaborates on how statistics is done in experimentation And once trust is there, then you can continue with the program, and then you don't have to worry about the whole. Florentien: Exactly. I agree. So, I think having a whole lot of groups, if your experimentation platform really works, can cost you a lot of money since this group isn't exposed to all your winning variants. So actually it's quite, expensive to do so. And it also makes the code, for the developers way more complex because all the old variants need to be held intact as well. And if you really want to know what, something you did in the past or added in the past, still adds up to your metrics. Then you can also, do some reverse testing. So, removing it for a small group to see if the thing you've added, is still a winning variant or if it's still adding value. So in that way, maybe. -
What would be your advice on the trade-off between small A/B testing on single elements versus testing a redesign?
Florentien: Well, in my opinion, I think it's always better to do small experiments. When testing your redesign, of course, sometimes you really want to change the look and feel of your website. But, ... the problem with the redesign is you're making so many changes that you don't have any clue, what change is a result of what effects or what effect is from the changes. So, I would always recommend making the change as small as possible. And if you want to change a whole page I know from my experience that sometimes there are just bigger changes that need to be made due to programs that need to be switched or something. Well, maybe then try to group together, changes that are likely to have the same effect. So that you can learn as much as possible from your experiments. Ton: In the end, of course, you wanna do both. You have experimentation for user research. You have experimentation for conversion optimization. The work we do is all based on, Edward Deming, and the quality circle plan to check extra small incremental steps to understand what's going on. And once you understand what's going on, you can create a new level of quality. So it kind of makes sense. If you look at a specific product page, for instance, run all these small experiments to really understand what's going on. Once you understand user behavior, you apply them to a new design. That's your new basic rule. And of course, you will test that specific design. And if it's good, you can continue and optimize from there. So in the end, you have to do both. Of course, if you really wanna understand what's causing the difference. You have to do a small test. Sometimes you have to take a bigger step. So it's not a boat. I've seen small changes that cause really big effects. I've seen big changes that cause small effects and vice versa. -
Can every company independent of its status quo and stage kick off experimentation?
Florentine: Yes. Of course. Ton: It's a mindset. So if you wanna become an experimentation-driven company, of course, you can start by doing this and experimentation-driven does not mean that eve ...rything has to be an A/B experiment. If you don't have the data in the beginning, then you can still have the experimentation mindset and test, learn, and optimize. And at some point, you can also run experiments. But if you're really low on data and you're still beginning, you're a startup then you have to take more risk, but you still can do user research screen recordings, heatmaps, data, and so on. And then, at some point, scale up to experimentation. -
Do you think it's possible to be a CRO specialist without including A/B testing in your operations?
- by RoyalTon: For our own company, we are a small 1,000,000 business company, of course. So well, we don't have the data to run experiments on our website. We run experiments on our email links and our advert ...ising because then we have the data to make proper decisions. But to me, conversion optimization is a task of experimentation cultures and mindsets. So you can perform conversion optimization without A/B testing. You're just gonna have different ways of telling that you're doing something good. You will be more biased. You will have fewer quality decisions, and lower quality because you're not AB testing, but you can still do conversion with optimization. That does matter. We still optimize our websites. But, yeah, we cannot run A/B experiments on our website. Florentien: No. I think you can do all sorts of research to optimize and just A/B testing or controlled experiments are the ones that are, the highest in evidence, the best in evidence. So, it depends on the risk you would like to take as well. Or adverse device that you are willing to accept. -
Would you run an experiment if you knew you wouldn't have the time to reach statistical significance? Could you read the results based on the visual shift in user behavior and take action, knowing that the bottom line conversion rate was not 100% accurate?
- by MollyTon: If you don't have the power to run experiments, and power is defined as how big is the chance that you will find a significant outcome if there is a difference to be detected? If that's too low ...and you know upfront that your outcome will not be significant, then you should not run the experiment. You need to take more risks. Just implement it. And signs, like, there's nothing like almost significance. There's also nothing like I signed and it's looking like it's probably getting significance. That's all statistical nonsense. It's false positives that are ruining your decision-making. If you don't have the data, then you cannot run experiments and just have to make a decision. And that will be faster, and more risky, but you're in that specific stage where you have to take more risk. Florentien: Definitely. Yeah. And you agree with that so. -
What are your thoughts on the impact of a false discovery rate while calculating the impact of a conversion rate optimization program?
Ton: Yeah. You know it's so easy to present, one A/B experiment. Like, we have a significant effect. It's like a 4% uplift and base if we implement this within 1 year, we will generate 1,000,000 extr ...a revenue. You cannot say this from one experiment because those experiments can be a false positive. It can be that the outcome is significant, but the measured outcome is not the same as the reality. The only way you can have money for your hero program is to look at, like, 100 experiments if you have a whole group of experiments, then you can calculate the false discovery rates because you know upfront based on your significance levels, how many of those outcomes will probably be a false positive. And then also once you implement them all, you have this type M error adding to the equation. All your significant results are right skewed as I say in statistics. So the outcome looks more positive than it is in real. So if you consider those two, then you can calculate the edit value of your optimization program. But, of course, also make sure you are adding the negatively significant outcomes, the stuff you were not implementing because the experiment was telling you not to implement this. If you add those 2 together, or those 3 together, then you have a failure to your program, but you will not be able to tell if it's that one experiment that brought the money, you just don't know. Only if you like, we want to experiment 6 or 7 times like they do in science if you have to publish a paper, then you probably can tell with a certain assurance that this is reality and this one experiment made a difference, but we are in business. We're not in science. We don't wanna rerun the same experiments seven times just to be 100% sure. Florentien: Oh, and, in addition to that, you also need to take into account, what customers were included in your experiments, and what customers were not. So what percentage of your customers does it apply to even, and the seasonality effect? So if, for instance, Albert Heijn is making a lot of money in December, and, if you are running your experiments in December, then well, I would recommend rerunning your experiments again, then in an off-season. But you need to take into account that if there's something going on or maybe in summer, then when baskets are smaller,, the effect might also be smaller than, it would be in another month. -
Does it make sense to lowAnd then see if you already reached your sample size. But don't look at your results to Austin because you will have, like, the peaking effect. So I think if you decide to change your changes, then I would recommend stopping and otherwise just wait until they're sample size.er the traffic used in a particular experiment if we are seeing a negative uplift in the 1st days of the experiment?
- by DimitriTon: Or you continue or stop your experiment is the Simpson’s paradox as it's called in statistics. You can look this up on Wikipedia. In Simpson’s paradox, if you shift traffic during the experi ...ments, then you will have all sorts of statistical issues with that experiment. So if you really see that the experiment has a really low chance of becoming a significant positive outcome anymore, then just stop the experiment because it's hurting your business. And maybe something's wrong, maybe something is broken, or there's a bug that can also be an issue, but it's you can calculate the chance of that experiment still becoming a positive outcome along the way. It's like sequential testing. So if it's really hurting, then don't lower the traffic. Just stop the experiment. Go back to the drawing table. Florentien: Well, if the amount of traffic that you've collected is way below the sample size you've calculated, then the effect might also change over time because effects can fluctuate when your sample size hasn't reached its max or if the size is calculated. So I think it depends on how much traffic you already have in your test and on what day you are running your experiments. I think if you see a major change, then, you should really be alerted and stop your experiment. But if it's just like, a small negative effect, then just wait for at least 1 week. So you have, like, at least 1 business week of data. And then see if you already reached your sample size. But don't look at your results to Austin because you will have, like, the peaking effect. So I think if you decide to change your changes, then I would recommend stopping and otherwise just wait until they're sample size. -
How would you suggest predicting the contribution of experimentation to revenue when multiple tests are being run on the same site?
There are some issues where we'll have a positive significant outcome and some will be inconclusive. Yeah. And the very ones will add value unless they are false positives, but you cannot calculate th ...e value for one experiment. You have to look at the whole group of experiments. It doesn't really sound like a company that's still a bit immature in experimentation and wants to understand what's adding value. In the end, you will test everything. And it's like if you create a new medicine, you wanna test this first before you ship it to potential users. Otherwise, you may ask yourself if a lot of people will buy. You know, if you're giving products without testing, maybe your clients, your customers will die So don't do that.
Transcription
Disclaimer- Please be aware that the content below is computer-generated, so kindly disregard any potential errors or shortcomings.