In the last post, we discussed how a healthy experimentation culture drives healthy innovation within your product.

One key aspect of this is the trustworthiness of your experimentation platform. This trust extends across teams, stakeholders, and technical systems and is critical to achieving reliable and actionable results.

Defining “Trust”

“Trust” in the context of technical experimentation is earned (or lost) in a few key areas: technical stability, observability, and data reliability.

Technical Stability

Technical stability is, as the name implies, the stability of the underlying technical framework that supports experimentation for your organization. This may be a third party vendor, or it may be in-house tooling.

An example - your organization’s experiments are frequently suffering from segmentation bugs. There are often instances of statistically significant over or under bucketing of users (what should be a 50/50 split actually ends up being a 47/53 split, for example). Users are being allocated treatments without the corresponding exposure data being recorded. Users are seeing multiple treatments more often than expected.

These things erode trust, and that erosion has tangible side effects:

Your tech team may spend more time than they’d like validating experiment health.
Your analytics team may spend cycles communicating with tech and product stakeholders to understand anomalies in the data.
Your decision makers may not trust that the experiment results adequately reflect reality.

Observability

Observability is your organization’s ability to follow the experimentation process and identify if and when there are issues.

Specifically, this represents things like understanding how many users have interacted with your experiment, how many have been allocated to each treatment, how many (and which, so you can handle them appropriately during analysis) users have encountered multiple variants, etc.

Observability continues to expand beyond the scope of individual experiments as well - how have your organization’s experiments moved core business metrics over time? Which teams are responsible for the most significant lifts? How much time, on average, does your organization spend running experiments? What methodologies and approaches does your organization use to run and analyze experiments?

All of these things are important, but some are more critical than others. More generally, the more robust your observability footprint, the more confidence your team will have in the end results.

Data Reliability

Data reliability is your organization’s trust in the integrity and reliability of the data generated by your experiments.

It should go without saying that if we don’t trust the integrity and reliability of our experimentation data, then we can’t really trust ourselves to make the right decision for our product and our users.

Unreliable data can take many forms, but might include things like missing exposure data, more users in the test population than expected, or data being lost.

The most toxic aspect of data unreliability is that when it is discovered, we may start to question all of our historical decision making.

What is a “Trustworthy” Experimentation Platform?

The hard part is that we need all of these things to foster a truly healthy experimentation platform, and we need them to be verifiable. That is to say, if someone asks you if the experiments that we run can be trusted, how will you prove it?

Missing the mark on any one of these core tenets of trustworthiness plants seeds of doubt with stakeholders:

“We trust our analytics pipelines, but the technical platform is constantly having stability issues…”
“It seems like our experiments are running as expected, but we don’t really understand how to verify without waiting for the Analytics team to summarize the data…”
“We’re seeing twice the number of users in this experiment than we expected…”

A trustworthy experimentation platform leaves no room for comments like these to be raised without also providing mechanisms for discovery and resolution.

Building Trust Through Transparency

Step zero in building a healthy experimentation footprint is transparency.

“Transparency” means communicating and documenting the testing methodologies that your organization supports, best practices for designing, implementing, and analyzing experiments, and providing accessible metrics for experiment health (at a minimum).

It should be easy to discover the underlying business metrics impacted by an experiment and if/how they led to the success or failure of that experiment. You also want your teams (and not just your PMs / Analytics stakeholders) to be able to discover how they have moved core business metrics over time with their experiments in aggregate. Bringing your developers closer to the business value that they generate is an important and often overlooked aspect of a healthy culture of experimentation.

To support meaningful and accessible transparency, find ways to bring experimentation data closer to your stakeholders at ALL levels - from the engineers who implement the experiments in code, to the executives who will make decisions based on the outcomes.

Building Trust Through Thoughtful Experimentation Systems

There are three primary components of a technical experimentation platform:

The “segmentation” layer. How users are “bucketed” or randomly assigned a given treatment. This could be as simple as a coin flip + a row in a database table, or as advanced as deterministic assignment via hashing and modulus operations over subject IDs.
The “persistence” layer. How assignment and exposure data is recorded for analysis. This could be as simple as a database table, or this could include a data warehouse where experiment data is stored for long term analysis.
The “analysis” layer. How we will access and analyze our experiment data. This includes everything from the engine we use to fetch our experiment data, to the statistics and algorithms that we’ll use to analyze it.

While we won’t prescribe specific technologies or dive deeply into each of these layers, we can discuss what makes a system “trustworthy”.

As with any discussion on architecture, the answer for what architectural characteristics matter most to you and your organization will be, “it depends”… but not entirely.

As it relates to technical experimentation frameworks, there are a few architectural characteristics that have an outsized level of importance:

Reliability. A trustworthy experimentation platform cannot be flaky, drop events, or introduce risk to the underlying treatments. Focus on uptime, data integrity, and fault tolerance.
Accuracy. Even with 100% availability, a technical experimentation platform is without value if assignment logic and metrics are incorrect. Focus on unbiased randomization, correct metrics computation, and safeguarding against contamination of results.
Auditability. Stakeholders may want to be able to verify how results were derived to rule out forms of manipulation, or just to understand more about the outcome. Focus on logging, transparent configuration, and immutable data.
Observability. When something does go wrong with an experiment, stakeholders will want to understand what and why. Being able to diagnose issues quickly will keep teams efficient and unblocked. Focus on coherent metrics, accessible data sources, and user-friendly dashboards.
Usability. An otherwise perfect system will suffer if it is easy to misconfigure experiments or misinterpret results. Focus on a clear user/developer experience, guardrails against common mistakes, and an accessible knowledge base.

Without reliability, stakeholders will not use your platform. Without accuracy, teams may use your platform, but they may make the wrong decisions. Without auditability/observability, any issues or anomalies will require escalation rather than the ability to self diagnose. Without usability, teams may opt to bypass your platform entirely.

Building a Culture of Trust at Scale

Beyond the technicals and building a culture of experimentation, it is important to foster a culture of trust. This ensures not only THAT stakeholders at different levels of the experimentation process are collaborating, but that they feel comfortable doing so.

Ensure that there is a tight feedback loop across Analytics, Product, and Engineering stakeholders. It can be valuable to get your stakeholders into a room (or at least into the same Slack channel) to mitigate any issues or points of clarification that arise. Don’t make your PM play telephone between Analytics and Engineering!

Encourage open feedback and discussion around experiment design and interpretation, and consider upskilling stakeholders across areas of expertise by offering knowledge transfer sessions encouraging statistical & methodological literacy, deeper technical understanding of how your experimentation systems work, and how core business metrics influence the decisions that you make.

As your organization’s experiments become more frequent and complex, those tricky edge case bugs or unexplained anomalies in your data will as well. Without a foundation of observability (as discussed earlier) and healthy organizational communication, these issues can erode whatever hard earned trust you may have earned up until now.

This makes it important to document issues inherent to your experimentation patterns, and either form a plan of action to address them immediately or identify a way to monitor them over time to understand the true impact as your experiments scale. Don’t force future engineers to re-discover what you already know, or Product/Analytics stakeholders to start to doubt the validity of past and present experiments.

Conclusion

Trust is the cornerstone of a healthy experimentation platform and a healthy culture of experimentation. Building trust up as a core tenet of experimentation within your organization by investing in your teams and stakeholders, the tools that they use, and the processes for evaluating outcomes is essential.

Our next post in this series will discuss diagnosing unhealthy experimentation, offering signs that there are deep-seated issues and solutions to consider when you encounter common problems.