Designing, Implementing, And Analysing Product Experiments
Designing, Implementing, And Analysing Product Experiments
People say that it's difficult to come up with ideas worth testing, that implementing a randomised controlled experiment is complicated, and that you need math skills to analyse the results. These are dirty filthy lies!
In this talk, Cian will give you the knowledge and tools required to quickly experiment, allowing you to build a demonstrably more useful product.
- How Hubspot do quick-iteration product experimentation
- How to bring these methods to your organisation
Hi, my name is Cian and I'm a Technical Lead at HubSpot where I work to build delightful onboarding experiences for our free users as part of our growth team. I've been working on growth projects at HubSpot for over five years, quite a long time including building the very first version of our free marketing tools and also building infrastructure for running and analyzing our product experiments.
During all of that time, our understanding and usage of data to drive product decisions has changed pretty dramatically. I like to think about it in terms of three era. Back in 2014, when we were first starting to build out our free marketing and sales tools, we built what we felt was right. We have fantastic product managers and designers. So, this pretty frequently worked out but every so often it backfired a little. We got something wrong and got a feature wrong and experienced didn't quite match with what our users were looking for and we didn't really know why.
So, by 2017, we understood the importance of data and the product development cycle. So, we started tracking absolutely everything. Every single user interaction from every single user was collected catalogs and then never looked at again. Sometimes we do build charts, which would show us how usage was changing over time. But this was generally either unscientific or retrospective rather than specific and real time.
Now in 2020, we've a process which lets us make accurate predictions about how the changes we're making are going to impact user behavior. We mix qualitative which is interviews and a quantitative data which is a user tracking, I guess, with product experimentation to deeply understand how to best serve our users. Most of our experiments boil down to something it's pretty simple. We just give slightly different experiences to different groups or cohorts of our users and then we see what happens. We've used this process to drive meaningful increases in revenue. And because of it it's been so successful, we often get asked to consult with internal and external teams on how they too can build experiments. That's what we're going to talk about today.
Before we get going though because I've done this all this time consultation and have spoken to a few places about building experiments. I frequently hear I've had a few problems that people perceive. I hear that it's difficult to come up with ideas that are worth experimenting with. I also hear that all the work involved is just far too much. I know my users. I know what I should build. I'm just going to go ahead and do that and I've also heard that all of this data requires a deep understanding of maths. And personally, I can't remember how to do long division. So, how can I be expected to do product experimentation? Well, I'm just going to quickly rearrange the heading of this slide out a few more letters because these are all dirty, filthy lies. Coming up with ideas is really easy. I'm going to tell you about a framework that I use to do that. You're also probably using or doing the right required research and engineering work can just be as simple as a few lines and tools do exist which do the analysis for you. I'm going to tell you about some of those as well. By the end of this talk, you're going to walk away with the knowledge and tools required to build a demonstrably more useful product. We are going to talk about and we come up with ideas, how we implement ideas as experiments and how we analyze the results. But because I wanted to give you practical examples, I'm going to tell you about two experiments my team has run in the last year. And in order to do that, it's going to be sort of useful for you to understand what HubSpot is, what we do. So, I'm going to give you a quick primer. If you use HubSpot, if you've heard of us, feel free to tune in for about 30 seconds but then please do come back. At HubSpot, we build software to help companies grow in better, more sustainable ways. Our suite includes free CRM tools, sales tools, marketing tools of customer service tools, as well as even more powerful premium versions of each. Well, every HubSpot account comes with all of this functionality built in, we know that everyone who comes to us who uses our software is using it to at least initially perform one specific job. So, our onboarding teams split across user missions. The mission of my team is to help teams who are focused on marketing decide that HubSpot is a good solution for them. And if so, how they can make use of the rest of the suite as well.
But first, let's talk about how we come up with ideas. It can often seem really difficult to brainstorm ideas for experiments. There are so many things that we could change in our apps that can just be hard to choose one or two but like everything picking experiment ideas gets easier if you introduce some guide rails. So, when we're trying to come up with experiments on our onboarding teams, we set ourselves the following limitations.
Firstly, we say that experiments obviously should always be done with the aim of building a better user experience. I say obviously here because our success is driven by user success. Secondly, we should use experiments to answer risky questions, especially we're using existing data to predict the result is a little bit difficult and thirdly, experiments should reach statistical significance. We'll talk about that later. So, don't be worrying your head about it just yet. But from the top, let's get this one out of the way first.
The experiment at HubSpot, the customer comes first. It's our first and most important product value and as I mentioned a minute ago, when a user is successful in growing their business, they're more likely to grow within our suite of tools. So, when we're designing a product experiment each one needs to have the aim of improving the user experience whether that means how they understand our tools, how they interact with them or how they meet their own goals. Something easy, it's obvious. Secondly, let's talk about answering risky questions. Before we start coming up with ideas, it is really helpful to realize and understand that only a subset of ideas are worth experimenting on. Running experiments, it's not free. It takes product manager, time design, your time, definitely engineering time. At HubSpot, we also involved product analysts. We involve content designers and we also of course run the risk of shipping sub-optimal experience for a subgroup of our users. Let's to be sure that we are limiting what we experiment on to the most impactful and high value of ideas. I've built new chart here on the X axis. I said high risky. An idea is I hear defined high risk as likely to impact either the users or our ability to do business. On the Y axis, we have the amount of existing data we have and it's a user interviews, a usage analytics and so on. On the lower left hand corner, if an idea is low risk, and we're pretty sure we know what's going on, we're just wasting our time we're going to project experiment here. That's just a build that feature, give it to our users and the lower right hand corner, if an idea is high risk, we're fairly certain. We know what's going to happen as well just kind of do it, I think. And I keep my eyes on the charts, there is also some cases here for product experiment too. In the top left hand corner, we have a low risk idea. We have no real idea on what's going to happen. We should probably consider doing some user research here before we make any moves and we could also just try depending on the risk. But if in the top right hand corner, we have no idea which is high risk and we have no idea what's going to happen, that is where an experiment will be really useful. So, with experiments, we should answer high risk questions where we don't have a huge amount of existing data.
Finally, the big scary one, the tough one, reaching statistical significance. You'll be really glad to know that I'm not going to get into any mounts here. We just don't have the time. Also, to be honest, I'm not certain I'll be able to explain them. Although, we're computer people so we can let computers do the hard work for us but let's at least get a bit of a gut feeling as to what statistical significance is. Let's imagine that our app has a user base of 10,000 people and we want to perform an experiment on them. So, we pick 1000 divide users and we show them something different. Leaving 9,000 people in our control group, our control of cohort. We got a result and we are happy but how can we be sure that we have gotten the full story? Maybe there's randomly chosen one thousand users are special in some way. Maybe a greater than average number of them signed up on a weekend, for example, or maybe they use a slower computers than the rest of our users which can take a whole lot of time to load stuff.
Let's take another example. Let's imagine that I suspect that one of my cohort is more likely to learn heads and tails from flipped. So, I decided to run an experiment. I've got a control coin which I definitely trust. And when I flip it 10 times, it comes up heads four which seems roughly right. Real life, not always 50/50. I'm going to flip my suspicious coin. It comes up heads seven times out of 10 which to me feels a little bit like something is up. But if I run the same experiment for 100 times, I see that coins are landing on heads roughly the same number of times that they land on tails. We see there isn't actually a thing and we see this because we took a larger sample size, we checked more times. The technical term for the probability of our results being reflective of reality is P value. At HubSport, we generally require a P value of less than 0.05 which is a less than 5% chance but the results we're seeing in our experiment don't reflect reality. Don't reflect on what happened. If we show this experience to everyone. By the way, the P value of this coin flip task before you do the maths, is that 0.23 which means is a 23% chance here that my hypothesis coin or my suspicious coin is actually a little of dodgy. This is far too high a chance for me to make a call either way. I think though what is good now is I suspect you have a good value or a good instinct as to what statistical significance is. But as I mentioned, I'm not going to show you how to calculate it. There are many tools online which do this for you for free. If you just Google Statistical Significance Calculator, you'll find some really, really simple ones and I'm also near the end of the talk going to make a few recommendations based on my experience of how we do it.
But how do we come up with experiment ideas? We've never done this before, I hear you ask something. We do on my team is called an Experiment Afternoon. It's called that because the afternoon. I was jet lagged at the time. I wasn't feeling creative and it just kind of stuck. So, all the engineers on the team lock ourselves in a room for three hours and we'd have two or three roughly written Experiment Documents. I'm gonna talk about those later which were then perfected over the course of the next week or so think of this as much a learning experience and it has to do experimentation. As well as a way to improve life for your users.
This is the layout. First, we choose what we would like to achieve. It's really important here that we pick two or three metrics to move, hopefully, metrics that are going to move fast. So, for example, daily active users rather than monthly active users or clicks on a specific link rather than a retention over a period of weeks. The reason that we want to pick a metric which is going to move fast is it closes the feedback loop really, really fast considering we're using this kind of as a learning opportunity as well as a experimentation generation exercise.
Once you've chosen the metrics, let's talk about them. We need to be really, really scrappy and we spent 15 minutes just chatting about each of these metrics and coming up with hypotheses that we might have as to how we're going to improve them. We write every single idea we have for each metric. No matter how off the wall onto a whiteboard or just write it down somewhere. I guess we're not in person anymore. Put it on a Google Doc, I don't know. Third, it's time to start tearing those ideas apart. Be really, really ruthless hunting every piece of evidence that you can find that each of these suggestions is wrong. We spend about 30 minutes doing this. You can use it both qualitative and quantitative data here. So, if you have user interviews, go get those. If you have usage tracking, digg in that, see what you can find. Finally, it's time to write some Experiment Docs.
But before we do that, I want to tell you about a quick experiment that we ran at HubSpot. We gave it the not very catchy yet pretty descriptive name Importing Contacts to HubSpot. Personally, I prefer descriptive over catchy but some people don't, whatever. This is isn't catchy but I know exactly what it is.
Bit of background context. Once we send users out of our onboarding experience that we've built into the big, bad world or the free HubSpot tools, most of the ways that they can quickly see value require them to have at least some data in our CRM. When we were doing user research as to what our users were looking to do first, the vast majority of them told us the first thing they wanted to do. The very first thing they wanted to do was to import contacts, to give a direct quote. I want you to organize my life. I want you to get my clients in first. That's the main thing. We've got this checklist of things that will walk you through when you sign up with HubSpot in order to get started. And one of the first things on that checklist is importing contacts. Despite 95% of our interviewees tell you that they want you to do it. Only 10% I actually go through it. So, there's a disconnect here. People want to import contacts but they just don't. Importing contacts is like going to the gym. It's not a fun thing to do but you need to do it to see results. The second you read the words, import contacts on a button you immediately start just like you tune out. You start checking Twitter. Honestly. I'm pretty sorry. I brought it up. It's probably lost your attention already and if it hasn't, I'm about to definitely I'm going to loose it because I'm about to throw up a chart. Because we do a great job of tracking user interactions, we can see exactly where in the import flow we are using contacts. There are eight steps to importing contacts and the first one I'm not going to lie are tough. I show you them later but what I want you to know, it's not from the point where the user starts the task, they show us the intent that they want to do something. That point that we ask them to upload a file, we lose like 67%. 67% these people who told us they wanted to do this. They're done. And honestly, these steps aren't all that hard and once we get them to the step by the way, they pretty much just finished the process even though there are way more difficult steps ahead of them. So, that's the problem we were tackling. We found the problem with quantified it, it's time to build an experiment. And the first part of building an experiment is writing an Experiment Doc. An Experiment Doc serves as the history of your experiments and also helps keep you honest about your methods and how you were going to measure success. Also, just as humans can tend not to want to remember the things that don't work but if what you're doing is innovative and risky, most experiments you run are going to show your hypothesis to be flawed like 90% of them not going to pan out. So, Experiment Docs help us remember and tell others about previous experiments that we run. Not going to lie, you're about to see a whole load of text because we are using an experiment, but I just told you if I had an example. I've also messed around with the numbers a little bit for legal reasons. If you want to revisit the slides, I believe I'm going to be online later but an Experiment Doc lays out five things.
Firstly it lays out, what is your hypothesis? Next he says, what evidence do you have to support it? Next we say, what change are you going to be making? How are we going to measure success with this experiment? And then what is the minimum improvement you'd accept in order to consider the experiment success? And that's where statistical significance comes in. We're gonna attack each of these one by one but taking it from the top, we need to say exactly what our hypothesis is. This lays out why we are running the experiment and gives the reader a quick overview as to what the experiment is about. Somebody outside your team should be to read your hypothesis and have a general idea of what is being tested. You should also, by the way, here call out any assumptions you are making just to get them on the table. ‘We know that importing contacts is the best way to get started using HubSpot and we know that users want to get their contacts right to the system. But we believe that it is too hard which we believe discourages users from doing so’.
We've laid out our hypothesis. What we know and what we believe. Next up, is trying to explain why we're running the experiment. What is there evidence that our hypothesis is correct? Gud feeling here does not count. We do a lot of user interviews. Kelly, who's a fantastic researcher on my team, spoke to over at dozen users as we were coming up with this hypothesis and almost every single one of them said, the first thing they want you to do was import contacts into the system. But only one out of every five users who start importing finished. We also know that if we can just get users to that like the bit where they choose a file, they're pretty unlikely to finish the import. So, what we need to do is get them to there: the point where they're likely or the point where they have selected a file to import. Next, time to lay out the change that we are going to be making at HubSpot. We do this in two ways. We use both language and a table just so everything is super clear. And to use our example here again, we will split our signup cohort from July 1st to July 14th in three. Control, see the existing onboarding experience. Variant one, we'll see a quick import banner. I'm going to show you that in a sec and variant two, we'll have the new quick import checklist item. In language, you can see in the table, they'd have pretty much the same control. No change 33% varient one get the in your face variant 33%. Again, varient two they have this integrated variant.
What does that actually look like? And the control this is our getting started checklist. In our control, we can see the regular import your context task in the red square to the far left and this is the one I just showed you that eight step chart for by the way with the 67% drop. Varient two is this huge big in your face banner which it tell the benefits of importing your contacts as well as offering this new super speedy import flow. Variant two linked to that same super speedy import flow but back in the old interface.
So, what actually are these import flows? This thing that we're trying to change, that we're trying to encourage users to complete. I'm going to show it to you. It is important that I say right now that this is actually a good flow. It's long, it's a bit convoluted but if you follow it through step by step, we know that you're going to get what you need to do done and also handles a lot more cases than just importing contacts which kind of explains the extra complication around it. But first all, we say, do we want to import contacts to use now or a list of people who've opted out of your marketing? Secondly, we say, do you want to input one file or multiple files each from the other. Next, we say, how many types of things? Just contacts or just one type of things or maybe you're importing contacts, companies, dealers, tickets, all these things all at once. I, by the way, chose just contacts. We never say, "Okay, just one type of thing." I said, Contacts here. Now, Upload - here is the point for people effectively, we've lost 67% of them. By the time they get here, 67% of them are gone. If we can just get them to choose that file we're home free. Next, we ask them to match the columns in their imports to how HubSpot thinks of contact properties. We have some machine learning stuff in here which makes this really easy and next we ask them to give the import a name. So, this, as I said is not fantastic but it is meticulous. If somebody follows this flow, they will get their stuff in portion into HubSpot but we know we're asking them to import contacts. We know we're ask them to import a single file of contracts so we can take those like first five steps and just kind of mash them into one, which is why you can see here. This is our quick contact import flow and step one is just right into that end, import contacts section like the select file section. We've also put a bit of educational content there and step two, match columns to properties. Step three, give it a name. So, this is a quick import flow.
Next step, what is your success metric? Holding yourself accountable to a specific metric makes it really easy to decide if an experiment is a success or not. It also helps us avoid confirmation bias. This perfectly human tendency to look for successes where they don't exist. Say sure, the number of upgrades are the same but Android users in Canada went up by 3% - success?! No, that's probably just counts. We will judge success on the percentage of users who complete an import from the Getting Started checklist. It's pretty fine, it's pretty like straight direct to the point.
Funny, as I mentioned before, running experiments is not risk-free. It takes product manager, time design your time, and also engineer time. Of course, as I said, we had HubSpot also involve product analyst, content designers and we run the risk of shipping to a sub-optimal experience for a subgroup of our users. So, when we're writing our doc, let's make sure that the improvements we're chasing is actually worth all of this time that we're putting into it. Our minimum improvement is a 25% improvement on the existing metric. About what when I say 25% improvement, I mean is like a 25% improvement on four is five. Cool. In order to detect a minimum improvement of 25% with 95% significance, that's that P value of 0.05, I mentioned earlier, we need to run our experiments for 14 days. We calculated this using one of those calculators, we have an internal one but the ones you'll find on Google are just as good and are you feeding the number of users you expect to be passing into your experiment every single day, you'll find out how long you need to run it for.
So, that is a huge amount of data. I am really sorry. Let's run through one last time. An Experiment Doc says what our hypothesis is. It tells us what evidence you have to support it. It says as what the change you'll be making is, we define what our success metric is and finally we state what the minimum improvement would accept is in order to consider the experiment of success. I have made available a kind of toned down version of a Experiment Doc template looks like at HubSpot. You can head to that big D link or scan that QR code. I hope it's really useful for you and your organization when you decide to run published experiments.
We've made it this far, we've identified a risky experiment to run. We've written our Experiment Document and now it's time to build the experiment itself. Other than building to experience, there are three important things you have to do when you're building an experiment. The first signing users into cohorts or saying, what's this user going to see? Cohort assignment should be random, stateless and functional. That's a whole load of engineering speak for saying you should be able to run the code which assigns the user a cohort, given a user ID and it should always return the same cohort for that user and experiment every single time. You also by the way here and you have to consider if you want to assign based on user or if you're like us and you have multiple users whether you want to assign the experience or assigned the account to the cohort rather than just the individual user. Once you've assigned a user cohort, you need to lay your analytics tool know by sending an event with the assigned cohorts and experiment as a property. So, you can build your charts later. And finally, once your user performs the actions, you're trying to impact send that to the analytics tool too. If you're already on board with user interaction tracking, you're probably already doing this bit which is nice.
At HubSport, we're at the point for our running many, many experiments that runs once and manage them all through code was just becoming a bit of a chore to be honest. So, we split up an infrastructure team as part of growth and they built us an experiment service which is still based in the background on Planet but which is fully controllable by all members of our team through UI. Now, the front end on our app makes a request that says, "Tell me what experiment cohorts this user is in." And then handles a response shows the correct experience to the user. Everything else is configured to the UI by either engineers or designers, PMs and the UI can also do automatic analysis and builds a library of our Experiment Docs which is absolutely fantastic. I want to say in the last say 14 days across the onboarding group, we switched on, I think 11 experiments. So, not having to manage all of those through code is a lifesaver. It makes life really, really easy in comparison to what we were doing before.
So, we have to we added our experiment duration as per our statistical significance calculator and we're going to want to analyze the data. If you're using an analytics platform like Mixed Planner or Amplitude. And this should be pretty easy, they just pretty much do the heavy lifting for you. Google Analytics does a pretty decent job too, I've never used it and as far as you know, I know you're going to need to manually calculate. If you've hit statistical significance, there might be a plugin to help you there. And so that's take a look at the results for the experiment that we are playing with our import code. Before I show you the chart though and I should note that HubSpot is a public company. So, I have not been able to share the actual results here but I'll show you something which are almost the actual results.
Two weeks have past, looking at our results we can see variant one which is the big banner has absolutely crushed the control and variant two. Being in the control by 27%, 27%, we've got 27% better chance of the user uploading their contacts. That's pretty good. That's statistically significant by the way with our P value of 0.05. So, we are happy but the rough notes reflect reality and we have a winner. We've productized this variant. Variant one for several of our user groups and are slowly rolling it out across with out cohorts as we learn more about how our users interact with it. So, that is how we do experiments at HubSpot.
I want to give you another example, tell you about a motor experiment. One that I kind of find kind of fun, kind of silly. So, you kind of interesting. When somebody signs up for a free HubSpot account, we throw them into the sandbox like environment that my team built to help them understand how HubSpot works. We bring them on this whirlwind tour through the products tailored specifically towards their needs based on what they told us when they sign up and this tour user might, for example, create and send a marketing email that might dig into our reporting feature and they may play around with our ideal tracking tools. They have to do all of this in the knowledge data. They're not going to mess up their actual account.
On speaking with users after they've experienced this, we have learned that at the end of the tour, they assume they're going to be asked to pay for this functionality which is not the case. Everything we show in this tour is free. We found that this causes anxiety around if we're going to suddenly start asking them for a credit card from money for tools, we've just shown them and so they just don't try the tools.
So, let's try and see if we can change this behavior. Our hypothesis is, W’e believe then if we reiterate through the demo tour but all functionality it's free users will be more motivated to try those tools.’ We have this, like, we have a huge amount of user research to suggest there is confusion here. Thanks to Kelly, user research indicates that after completing the demo tour, some users are uncertain what free tools they have access to. And user research also indicates that after testing the demo, some users say they assumed the tools shown required payment to use so that's our evidence. Let's take a look at the change are going to be making.
Once again, that's the language and the table. We will split our signup cohorts for two weeks into control. We'll see the existing demo tour copy and the variants we'll see updated demo tour copy which mentioned throughout that all the demonstrated functionality is free and I really do mean throughout. Every time we show the users something you, we hammer home this is free. You will never be asked to pay for this. So, in a table, 50% no change, 50% reiterate to the demo. Everything's free. Quick. Cool. Okay.
Our success metric is we're going to judge success based on activation of our free marketing tools. At HubSpot, we define or on my team, I should say we define activation as user seeing success. So, generating new contact, making a lead and that kind of thing. Now, we run this experiment, we used to have a much stricter values or much stricter requirements, I should say or in P value that will be used to acquire at P value 0.02 that is a 2% chance the results don't reflect reality. We since reduced them to a 5% chance because we have this sudden realization that our experiments don't quite need to be medical grade. So, our minimum improvement is 0.4 percentage points over the control on activation. We will read to run the experiment for 14 days in order to reach a P value of 0.02. So, it's interesting experiment. I just think it's kind of funny that our users think our tools cost money when they don't. And once again, I'm about to show you a chart. I'm not even going to be able to show you at numbers on this chart, because this is an activation metric but it is reflective of reality.
We saw a very small improvement and activation on our variant but nowhere near it enough to suggest that their results reflected a reality. We didn't consider this experiment's success and we have reversion to the control for all our users. This honestly just felt like a home run to us. People were telling us, "We think your tools cost money." So, to discover that we can't just like tell them our tools don't cost money and that doesn't resonate. This was just a huge surprise. We do know that I know a 10 experiments we tried don't lead us to a metric improvement. So, we have attacked this quite a few other ways since with those follow on experiments and we were eventually able to reduce the number of people telling us they thought our free tools were paid. I just wanted to tell you about that experiment. I think it's a good example of something going wrong despite seeming to be a slam dunk.
Thank you very much. Despite everything going on, we're still hiring like the absolute clappers we're hiring across the board. Every single one of the roles you heard me mentioned today. We are particularly interested in chatting to Senior Level UX experts.
I'm going to make my Twitter unprivate for the duration of UXDX so come say hello. Thank you.
Got a Question?
More like this?
Mon, Oct 05, 7:00 PM UTCData Driven Engineering Team - Changing The Culture
VP, Software Development, Hootsuite
Tue, Oct 06, 3:10 PM UTCMind the Gap Between the Product and the Platform
Developer & Agile Coach, Mojang
VP, Software Development, Hootsuite
Global Vice President, Platform Technology, Conde Nast
VP, Product, Gainsight