Companies—the ones that thrive and survive in today’s mobile and web-based ecosystems—are forever searching for better ways to offer the best user experience in their apps. Such companies may already be aware of the benefits of A/B testing or experimentation, but they may not be aware of newer concepts like chaos testing—a method that introduces controlled chaos into their systems to simulate real-world failure conditions. The companies that want to remain on the cutting edge and care about how their brand is perceived can reap dividends by investing in experimental frameworks.
The process of experimentation
The normal process of experimentation can be broken down into six steps:
- Decide to conduct an experiment. Typically, this determination is derived from the nature of a new feature or product being launched. It involves delving into the key variables needed for the experiment, such as what color or what type of button was used (each of these could spawn multiple experiments). Experiments might also involve multiple geographies and learning what different user interfaces and experiences mean to those living in different countries or examining other differentiating factors.
- Define the goal metrics. When a variable is changed, there should be something measurable that happens, like orders dropping off or decreased user activity.
- Schedule the experiment. Many experiments are ruined by interference caused by overlapped scheduling, so be sure to keep experiments separated. Scheduling also helps notify other team members that an experiment is being planned, and they will know not to plan any competing experiments during the proposed time.
- Approve and monitor. Buy-in is always important, and it alerts the team to what’s happening. Continuously monitor the experiments. If an experiment goes badly, there should be a way to turn it off without delay.
- Examine the analytics. A good experiment would have a “control” group not used in the experiment that is measured against the “treatment” group(s).
- Determine how to proceed. Experiments are often an iterative process to help fine-tune the results of A/B testing
Facilitating this process of experimentation is a laudable goal for companies that want to perfect user experiences. Netflix, Uber, and Grab have well-documented in-house experimentation platforms that automate many of the standard processes. Automation adds guard rails to experiments and allows for a greater degree of control if something goes wrong. It is best to make sure any solution is lightweight and facilitates the creation of experimentation, is developer-friendly, and allows even non-technical users to schedule experiments and see metrics. Educating the team about these tools and the benefits they bring is also a great way to create a culture that embraces experimentation in a company.
Experimentation and chaos testing
So, what does chaos testing have to do with experimentation? The short answer is that they share some goals and often run on the same platform used for experimentation. Chaos testing is the introduction of targeted software or system failures that mimic not just system and hardware issues but also application errors that might lead to a poor experience for end-users.
An example of this would be increasing latency to a subset of machines where a service runs, sending bad data to an application.
The primary benefit is that chaos testing allows a programmatic way of inducing these failures into an environment, allowing experimenters to test different hypotheses. For example, does an Amazon order flow become affected if a non-essential service is unavailable? Chaos testing allows experiments like this to be conducted so engineers can help develop solutions to mitigate disaster when unforeseen circumstances develop.
Chaos testing has additional benefits too. It helps teams discover unknown couplings and dependencies in the microservices architecture. Sometimes a developer will use certain libraries or code with hidden dependencies that were unrealized before testing. Code regression can also be revealed during chaos testing. Chaos testing can disclose if the retry logic or circuit breakers were properly configured, or if there are any unnecessary dependencies between services. It can answer questions such as: can some of the data be cached rather than relying on these dependencies? can it fall back to an older version of the data in case a dependency is down?
For those already on board with chaos testing, the focus should be on how to make chaos test experiments easier to do. It could also be helpful in educating others about chaos testing’s primary goal—product resiliency—which is the main difference between it and traditional experimentation.
Best practices for implementing experimentation and chaos testing
There are some best practices to consider when considering implementing experimentation and chaos testing to a company’s processes.
- Institute a standardized, centralized platform to conduct experiments. A centralized platform allows for guardrails to guide the experimentation and to ensure that personal data is protected from unnecessary usage. Developers can also use this platform to create chaos experiments that direct chaos in the system all along a specific component. Most companies use their in-house experimentation platform for conducting chaos experiments e.g., Grab. This cuts down on the number of tools a team needs to learn and helps the team remain agile.
- Consider an out-of-the-box solution.AWS, for example, allows for both greater scalability and cost-savings in operations through a fully managed service for running chaos experiments. These can be used if they fit the company’s requirements.
- Make sure there are good analytics. Don’t just measure control and treatment groups. Calculate and evaluate system performance over time.
- Observe. Check all the variables, constantly, especially in production environments. Efficient observability is especially useful to stop experiments if they inadvertently start causing user pain.
- Look for consistency of results. Experiments with inconsistent results are not useful. Make sure that other experiments aren’t running simultaneously and throwing off results.
Reap the benefits
Chaos testing works best within an experimentation infrastructure, so for the price of one investment, companies can reap the benefits of two tools. Experimentation and chaos testing are powerful tools for gaining insight into a company’s user base and to design user experiences that maintain interest and precipitate expected goals. These tools can help a company develop a fanbase that is devoted to its app or service. The goal is to provide experiences that both enable users to accomplish their tasks quickly as well as make sure critical functions are still available if some supporting systems have failed. This is what product resiliency is all about. A company without such tools is missing out on critical data that can help them make more reliable decisions.
About the Author:
Rohan Tiwari is a software engineer manager at a top tech company. He has been a computer technology specialist for 10 years, heading up engineering departments in the United States, India, and Singapore. For more information, please send emails to rohan_tiwari1@hotmail.com.