Automated chaos engineering experiments usually explore weaknesses at the more technical levels, such as the platform and applications, leaving the People/Practices & Process level to Game Days. Then we run the experiment using chaos run…. This is why we celebrate when a chaos experiment "fails", because we've found an area for improvement, hurrah! Not just surfacing new weaknesses, but also ensuring that you've overcome a weakness in the first place. Currently, we mainly use it to test TiDB clusters. In many system contexts there is a dividing line between those who are responsible for managing a Kubernetes cluster (including all the realcosts associated with these resources), let's call those the Cluster Admins, and those trying to deploy and manage applications upon the cluster, the Application Team. Using the Chaos Toolkit's experiment format, first I define my steady-state hypothesis for my experiment. Our Pod Disruption Budget worked! I don't have trust and confidence that things actually will be ok. If this situation occurred "for real", now would be a good time for the Cluster Admin to come talk to the Application Team, or maybe bring another node online so the Application Team can maintain the minimum number of pods. But how do we prove that this budget does what we want? The whole point of automating your chaos experiments is so that you can run them again and again to build trust and confidence in your system. In this case we are un-cordoning the node that we cordoned off previously when we asked it to drain itself. Wouldn't it be nice if we can replace each match with a custom replacement.

Chaos engineering's goal is to discover and help you overcome weaknesses across the entire sociotechnical system of software development. A while ago, I created a document in which I explained how to use the Levenshtein Distance to get the best match from a list of options. But remember: After automating a chaos, all you have is an automated chaos. Now we can match for a date and replace it on the fly by the correct format. On our testprojects, I use regular expressions most of the time for the matching or testing of strings on our Application Under Test. February 15, 2012. As we have a potential People & Process system weakness it makes sense that we can explore and surface this weakness through executing a chaos experiment. What is not specified, but implicitly agreed (hence the weakness) is the Application Team know there must be three pods during any known conditions (such as upgrades etc) otherwise Bad Things Could Happen(TM). In fact, when you think about it a chaos experiment is, by definition, trying to find unknown, unknown weaknesses.

"All new possibilities arise!". You don’t have to do this alone. On our testprojects, I use regular expressions most of the time for the matching or testing of strings on our Application Under Test. To inspire you to take action on your dreams and live an amazing life.

NAME READY STATUS RESTARTS AGE, my-service-6dc649f897-22p7f 1/1 Running 0 2h, my-service-6dc649f897-7m72z 1/1 Running 0 46m, my-service-6dc649f897-xkmpd 1/1 Running 0 2h, my-service LoadBalancer 80:31724/TCP 2h

