Release to QA not to Production Part I

Written by Petru Rebeja | Jan 16, 2023 12:10:00 PM

Introduction

It's Friday afternoon, the end of the sprint, and a few hours before the weekend starts, and the QA engineers are performing the required tests on the last Sprint Backlog Item (SBI). The developer responsible for that item, confident that the SBI meets all acceptance criteria, is already in the weekend mood.

Suddenly, a notification pops up — there is an issue with the feature being tested. The developer jumps on it to see what the problem is, and after discussing with the QA engineer he/she finds out that the issue is caused by some leftover data from the previous SBI.

Having identified the problem, the developer spends a few minutes to craft a SQL script that will clean the data, and gives it to the QA engineer. The QA engineer runs the script on the QA database, starts testing the SBI from the beginning, and then confirms that the system is "back to normal".

Both sigh in relief while the SBI is marked as "Done" and the weekend starts. Bliss!

1. Getting to the root problem

Although the day and the sprint goal were saved, I would argue that applying the cleanup script just fixed an issue but left the root problem untouched. And to get to the root problem, let's take a closer look on what happened.

The database issue stems from the fact that instead of being kept as close as possible to production data (as the best practices suggest) the database grew to become an entity of its own through not being kept tidy by the team.

When testing a SBI involves changing some of the data from the database, it is not very often that those changes are reverted as soon as the SBI leaves the QA environment. With each such change the two databases (production and QA) grow further and further apart, and the probability of having to apply a workaround script increases each time. However the paradox is that the cleanup script, although it solves the issue, is yet another change to the data which widens even more the gap between QA and production databases.

And there lies our problem: not within the workaround script itself, but within the practice of applying workarounds to patch the proverbial broken pipes instead of building actual deployment pipelines.

But this problem goes one level deeper; sure, we can fix the problem at hand by restoring the database from a production backup but to solve the issue once and for all we need to change how we look at QA environment.

But our root-cause analysis is not complete yet. We can't just say "let's never apply workarounds" because workarounds are some sort of necessary evil. Let's look at why that is, shall we?

2. Why and when do we apply workarounds in production?

In Production environment a workaround is applied only in critical situations due to high risk of breaking the running system by making ad-hoc changes to it.

Unlike the QA environment where, when the system breaks only a few users are affected — namely the QA engineers, when the system halts in Production the costs of the downtime are much, much higher. An improper or forgotten where condition in a delete script which wipes out whole tables of data, and renders the system unusable, in the happiest case will lead only to frustrated customers that can't use the thing they paid for.

As such, in every critical situation first and foremost comes the assessment: is a workaround really needed?

When the answer is yes (i.e., there is no other way of fixing the issue now), then usually there are some procedures to follow. Sticking with the database script example, the minimal procedure would be to:

create the workaround script,
have that script reviewed and approved by at least one additional person, and
have the script executed on Production by someone with proper access rights.

OK, now we're settled: workarounds are necessary in critical situations, and are applied after assessment, review, and approval. Then, going back to our story, the following question arises:

Why do we apply workarounds in QA environment?

QA environment is isolated from Production environment, and by definition it has way fewer users. Furthermore, those users have a lot of technical knowledge of how the system runs, and always have something else to do (like designing/writing test cases) while the system is being brought to normal again.

Looking from this point of view, there is almost never a critical situation that would require applying a workaround in QA environment.

Sure, missing the sprint goal may seem like a critical situation because commitments are important. But on the other hand, and going back to our example — if we're applying a workaround in QA just to promote some feature towards Production, are we really sure that the feature is ready?

Now that the assessment of criticality is done, let's get back to our topic and ask:

What if we treated QA environment like Production?

Production and QA environments are different (very different I may add); there's no doubt about that. What makes them different, from the point of view of our topic, is the fact that when a feature is deployed in Production environment, all the prerequisites are known, and all preliminary steps are executed.

On the other hand, when deploying to QA environment we don't always have this knowledge, nor do we have all the preliminary steps completed at all times. Furthermore, deploying on QA may require additional steps than on Production, e.g.: restoring the database to the last backup from Production, data anonymization etc.

But the difference between the number of unknowns can be compensated by the difference between number of deployments, and the fact that a failure in QA environment is not critical. In other words, what we lack in knowledge when deploying to QA environment can be compensated by multiple deployment trials, where each deployment trial gets closer and closer to success.

And when it comes to doing repetitive tasks… Automation is Key.

3. Automation is key

To alleviate the difference between (successive) environments you only need to do one thing, although I must say from the start that achieving that one thing can be really hard — automate everything.

If a release workflow is properly (read fully) automated, then the difference between various environments is reduced mainly to:

The group of people who have proper access rights to run the workflow on the specific environment. With today's' tools on the market the difference becomes simplified further — it is in the group of people that are allowed to see or to push the "Deploy" button.
The number and order of "Deploy" buttons a person has to push for the deploy to succeed.

Although we strive to have our environments behave all the same, they are still inherently different. As such, it goes without saying that not everyone may have rights to deploy to Production, and — due to some constraints — on some environment there may be additional actions required to deploy. Nonetheless, the objective remains the same: avoid manual intervention as much as possible.

Stay tuned for the second part of the article.

View full post