What are Test Data Gold Copies

You lean back in your chair with a satisfied grin. You did it. It wasn’t easy, but you did it. You diagnosed and fixed the bug that kept defying your team. And you have the unit tests to prove it.

The grin slowly fades from your face as you realize that you still need your code to pass the integration tests. And you need to get data to use in them. Not your favorite activity.

You can put that grin back on your face because there is another way: using a gold copy.

Read on to learn what a gold copy is and why you want to use one. You will also find out how it can help you work on an application with low test coverage. You know, the dreaded legacy systems.

What Is a Gold Copy

In essence, a gold copy is a set of test data. Nothing more, nothing less. What sets it apart from other sets of test data is the way you use and guard it.

You only change a gold copy when you need to add or remove test cases.
You use a gold copy to set up the initial state of a test environment.
All automated and manual tests work on copies of the gold copy.

A gold copy also functions as the gold standard for all your tests and for everybody testing your application. It contains the data for all the test cases that you need to cover all the features of your product. It may not start out as comprehensive, but that’s the goal.

Building a comprehensive gold copy isn’t easy or quick. But it’s definitely worth it, and it trumps using production data almost every time.

Why You Don’t Want to Test in Production

Continuous delivery adepts rave about testing in production. And yes, that has enormous benefits. However:

It requires the use of feature toggles to restrict access to new features and changed functionality.
Running the automated tests in your builds against a production environment is not going to make you any friends.
The sheer volume of production data usually is prohibitive for a timely feedback loop.
Giving developers access to production data can violate privacy and other data regulations.

There’s more:

Production data changes all the time, and its values are unpredictable, which makes it unsuitable as a base for automated testing.
Finding appropriate test data in production is a challenge. Testing requires edge cases, when users and thus their data tend to be much more alike than they would like to know.
To comply with privacy and other data regulations, extracts need to be anonymized and masked.

Contrived Test Data Isn’t Half as Bad as It Sounds

Contrived examples usually mean that you wouldn’t encounter the example in the real world. However, when it comes to testing, contrived is what you want. A contrived set of test data:

has only one purpose—verifying that your application works as intended and expected and that code changes do not cause regressions
contains a limited amount of data, enabling a faster feedback loop even for end-to-end tests
can be made to be self-identifying and self-descriptive to help understand what specific data is meant to test
contains edge cases that willtrip you up in the real world but are generally absent from production data by their very definition
can be built into a comprehensive, optimized, targeted set of data that fully exercises your application

Of course, production data can be manipulated to achieve the same. But extracting it stresses production, and manipulating it takes time and effort. And you really don’t want to be doing that again and again and again.

That’s why you combine contrived data and gold copies. You start your gold copy with an extract from production data that is of course anonymized and otherwise made to conform to privacy and data regulations. Over time, you manipulate it into that optimized, targeted set of data. But using that initial set of test data as a gold copy will bring you benefits immediately.

Benefits of Gold Copies

In addition to the benefits of contrived data, using a gold copy gets you these benefits:

You can easily set up a test environment with a comprehensive set of test data
You can easily revert the data in a test environment to its original state
The ability to automate spinning up test environments
Automated regression testing for legacy systems

Everyone working on your application will appreciate it. They no longer have to hunt for good data to use in their test cases. And they no longer have to create test data themselves. A good thing, because creating test data and tests that produce false positives (i.e., tests that succeed when they should fail) is incredibly easy. You only have to use the same values a tad too often.

The ability to automate spinning up a test environment is what makes using a gold copy so invaluable for large development shops and shops that need to support many different platforms. Just imagine how much time and effort can be saved when providing teams and individuals with comprehensive, standard test data that can be automated. For example, using containers and a test data management tool like Enov8’s.

Finally, gold copies can help reduce the headaches and anxiety of working with legacy code. Here’s how.

Slaying the Dreaded Legacy Monster

Any system that does not have enough automated unit and integration tests guarding it against regressions is a legacy system. They are hard to change without worrying.

The lack of tests, especially the lack of unit tests, allowed coding practices that now make it hard to bring a legacy system under test. Because bringing it under test requires refactoring the code. And you can’t refactor with any confidence if you have no tests to tell you if you broke something.

Fortunately, a gold copy can bail you out of this one. It allows you to add automated regression testing by using the golden master technique. That technique takes advantage of the fact that any application with value to its users produces all kinds of output.

Steps in the Golden Master Technique

How you implement the golden master technique depends on your environment. But it always follows the same pattern, and it always starts with a gold copy.

Use your current code against the gold copy to generate the output you want to guard against regressions. For example, a CSV export of an order, a PDF print of that order, or even a screenshot of it.
Save that output. It’s your golden master.
Make your changes.
Use your new code against the gold copy to generate the “output under test” again.
Compare the output you just generated to your golden master.
Look for and explain any differences.

If you were refactoring, which by definition means there were no functional changes, the comparison should show that there are no differences.

If you were fixing a bug, the comparison should show a difference. The golden master would have the incorrect value, while the output from the fixed code would have the correct value. No other differences should be found.

If you were changing functionality, you can expect a lot of differences. All of them should be explicable by the change in functionality. Any differences that cannot be explained that way are regressions.

Explaining the differences requires manual assessment by a human. It’s known as the “Guru Checks Output” anti-pattern. And it needs to be done every test run if you want to stay on top of things. Marking differences as expected can help. Especially when you can customize the comparison so it won’t report them as differences.

Go Get Yourself Some Gold

Now that you know what a gold copy is and how you can use it to your advantage, it’s time for action. It’s time to start building toward the goal of a comprehensive set of test data and use it as a gold copy.

Your first step is simple: save the data from the test environment you set up for the issue or feature you’re working on now. That is going to be your gold copy. If your application uses any kind of SQL database, you could use that to generate a DML-SQL script that you can add to a repository.

Use your gold copy to set up the test environment for your next issue. Make sure you don’t (inadvertently) change your gold copy while you’re working on that issue. When you’re finished, and if you needed to add test data for the test cases of this issue, update your gold copy.

Rinse and repeat, and soon enough you’ll be well on your way to a truly useful comprehensive set of test data.

Which Test Data Management Method Is Best?

Replicating Data From Production

The first approach we’re going to cover in this post is perhaps the most popular one, at least for beginners. And that makes perfect sense if you think about it. When you first encounter the challenge of coming up with data to feed your testing processes, it isn’t too far-fetched to think you should just copy data from production and be done with it. It’s the easiest way to obtain data that is as realistic as possible. You just can’t get more real than production.

Not everything is a bed of roses when it comes to production data replication. Quite the opposite, actually. The easy access to data is pretty much the only advantage this method has. And what about the disadvantages? These, sadly, abound.

Here Be Dragons: Some Downsides of the Approach

Here’s the first problem: replicating data from production continues to be mostly a manual process. Sure, you can come up with scripts and automated jobs to do most of the heavy lifting for you. But keep in mind that generating the data isn’t the whole job of a TDM management solution. “Availability” is an integral part of the package. That means that the TDM tool is responsible for making sure the data is available where it’s needed, at the right time. A naive approach based on scripts might not be sufficient to manage the demands of a complex testing process, forcing you to rely on a manual process to do so.

Secondly, production replication doesn’t lend itself well to negative test cases. It’d be out of the scope of this post to give a lengthy explanation of negative testing. In a nutshell, negative test cases are tests that validate the system against invalid data. Basically, you throw faulty data at your application to check how well it can handle it. Since production data would (hopefully) be in good shape, this approach isn’t well suited to this type of testing.

Production data replication also doesn’t work…if there is not data replication for you to replicate in the first place! What should you do when you need to test an application that is still in the alpha stage of development or even a prototype? Since no one is actually using the application, there would be no production data for you to copy. That’s a severe downside of this approach since every new application will face this problem.

Here Be Dragons (For Real): Legal Implications

Finally, we have the most serious downside of this approach—data sensitivity. Data compliance is a crucial part of the modern IT landscape since companies are responsible for the data they store and manipulate. It’s up to them to protect their client’s data, ensuring it’s not abused. When replicating data from production, software organizations run the risk of failing to comply with privacy acts, such as GDPR. And that can bring catastrophic consequences, legal, financial, and reputation-wise.

Data Masking

In order to solve the downsides of production data replication (a.k.a the naive approach), test data management tools have come up with more sophisticated methods. One of the
most popular of these approaches is test data masking. As its name implies, tools that adopt this approach enable its users to apply masks to production data. Such masks will remove personally identifiable information (PII) from the data.

Data masking is an improvement over naive production data replication, for sure. But the approach is not without its downsides.

First, consider the “time” variable. Data masking doesn’t reduce the time spent generating (or rather, copying) the data for testing. On the contrary, it increases it because now you have a new added in the process. You could argue—and I’d gladly agree—that it’s time well spent, but it’s more time nonetheless.

Then, you also have to keep in mind that data masking isn’t a standalone approach on its own. Instead, it complements the previous approach by solving one of its more serious issues. The problem is data masking can’t fix every problem that the production replication approach has. For instance, if you intend to test an application still in development, for which there is no production data at all, data masking is powerless to help you.

Synthetic Data Generation

Synthetic data generation is yet another method of test data management. As its name suggests, this approach consists of generating “fake”—or synthetic—data from a data model. Tools that implement this approach are able to preserve the format of the data. The values themselves, though, are completely disconnected from any original data. What does that imply?

The implication of this is that synthetic data generation’s greatest asset is simultaneously its most significant downside. By populating the database with entirely “made-up” values, the approach dramatically reduces (virtually eliminates) the risk of exposing sensitive data. On the other hand, depending on the tool’s sophistication—or lack of—you might end up with data that feels “fake-y.” One of the goals of an excellent TDM strategy is to provide data that is as production-like as possible.

To wrap-up, let’s talk about the biggest advantage of synthetic data generation, namely: speed. Once you have a model in place, you can quickly generate data from it, effectively eliminating the time delays that plague other approaches.

Test Data Management Is More Than Test Data Generation

In this post, we’ve covered some of the most used approaches to generate test data. The list is definitely not exhaustive; there are many more methods that we didn’t cover. However, many of them are variations or combinations of the approaches we did talk about.

Another thing to keep in mind is that test data management is much more than just generating test data. TDM is responsible for ensuring the quality of the test data, its availability, and also its security. In other words: the data must be good, and it must be available at the right place, at the right time. And bad actors shouldn’t be allowed to expose it or misuse it in any way. That’s why, depending on the needs of your organization, you should consider adopting a full-fledged data compliance solution, which can not only supply your data generation needs but also make sure your data adhere to the compliance requirements you must follow.

DataOps Explained

What Is DataOps?

The term DataOps is an abbreviation of the words data operations.

The speed of development and product release has decreased in the last 10 years due to technologies such as DevOps (development operations). As a result, we have a new problem: data and more data. To help draw insight from loads of raw data, companies use data analytics. Of course, there are various types, such as data mining, that help identify trends, patterns, and relationships in large data sets. Unfortunately, in our need-it-now economy, users of data analytics can’t—or won’t—wait for weeks or months to receive new analytics.

With the increased complexity of the emerging data ecosystem and the need to deliver insights more quickly, a new strategy is essential if we’re to gain value from massive amounts of data.

This is where DataOps comes in. It helps improve the delivery speed and robustness of analytics. In other words, DataOps is an automated, process-oriented methodology that helps analytics and data teams improve the quality of data analytics, as well as reduce its cycle time. To achieve this, DataOps combines agile development, DevOps, and statistical process control.

Similar to how DevOps brought together development and operations teams to handle software delivery problems, DataOps seeks to bring together data practitioners to deliver quality data for applications and business processes.

But do we really need another methodology?

Why DataOps Matters

In our current on-demand economy, a company has to rely on data from various sources to better understand their products, customers, and markets. This all sounds good until you factor in the dynamic nature of data. How do you effectively monitor the flow of a company’s data that includes prediction changes, business anomalies, trend changes, and more?

Someone could argue that we already have analytics to handle all of the data issues. But here’s the problem: Data analytics pipelines are in a deplorable state because of

Inadequate automation and orchestration
Minimal code and data reuse
Or a lack of coordination between the involved parties, such as IT, operations, and even business stakeholders.

In the end, we have poor-quality data that’s delivered too late to meet a business’s needs.

As more and more data is collected, the data pipelines become more complex. At the same time, large, more traditional enterprises realize the need to use all the data their company generates. Such information is becoming important even in everyday decisions.

Needless to say, all of these factors make it necessary for an organization to implement a new approach to govern the flow of data through its life cycle.

And here’s one more reason to consider using DataOps. Companies that have already implemented DevOps practices will find that implementing DataOps gives them a higher competitive edge. This is because the DevOps engineering framework may be regarded as preparation for DataOps. Organizations that rely on data need a similar high-quality and consistent framework that’s useful for fast data analysis.

Implementing DataOps in 7 Steps

DataOps is still a rising approach for data-driven organizations. DataKitchen, a company that developed a DataOps platform for data-driven enterprises, suggests seven steps for implementation. And the good news is you don’t have to discard your existing analytics tools.

Here are the seven steps to implementing DataOps.

Add Data and Logic Tests

This step requires that every time you make changes to an analytics pipeline, you have to add a test for the change. Testing applies to data, models, and logic. The idea is to make sure nothing will be broken in the analytics pipeline. These incremental, automated tests ensure that quality and integrity are built into the final output.

Use a Version Control System

In order for raw data to produce useful information, it goes through many processing steps. And all of these steps involve coding. In a similar manner to other software projects, the source files that data analysts use in the data pipeline require maintenance in a version control system such as Git. The aim of version control is to help keep track of changes and revisions. Keeping the code in a repository is also important, as it helps when there is a need for disaster recovery.

Branch and Merge

To maintain coding changes, data analytics should borrow the approach that software developers use to maintain their projects, which is to continuously update code source files. For instance, when a developer wishes to make changes, they pull out the relevant code from the repository. Changes are then made on the local copy (also called a branch) pulled from the repository. Once new changes are made and tested, the local copy (branch) is merged back into the repository.

Use Multiple Environments

Data analytics team members should have their own environment to work from. These environments will allow team members to work on subsets of data while isolating the rest of the organization from any effects of the ongoing maintenance or additions to the existing data.

Reuse and Containerize

Breaking down a data analytics pipeline into smaller components facilitates code reuse and containerization. By doing this, the data analytics team can move quickly as they leverage existing libraries or other code whenever they want to extend or develop new code.

Parameterize Your Processing

Borrowing the idea of parameters from software development will help in designing a robust data pipeline. And a flexible data-analytics pipeline will accommodate varying run-time circumstances.

Use Simple Storage

Simple storage helps make the whole data analytics pipeline readily available, and it eases the updating process.

What About Data Security?

There’s a lot of concern about how to gain insights from raw data in a robust yet fast manner. But we shouldn’t forget the consequences of data breaches across the globe. The costs you may incur for mishandling personally identifiable data is becoming too expensive. As you work toward building more and delivering faster, it’s important to consider the security of the data you handle.

When implementing DataOps, you must protect the data at every stage of its journey. Always keep in mind the bad guys who are ready to grab your data. And don’t forget the issue of accidentally sharing sensitive data that may cause you to fail to meet regulatory compliance.

Thankfully, there are solutions that help take these worries away, such as Data HotSpot—a product specifically designed for those in test data management and those who consume test data. With Data HotSpot, you are assured complete security, customer protection, brand protection, and penalty avoidance. That means you can implement DataOps and stay way ahead of your competitors with real-time or near real-time analytics.

Unlock the Value of Data

Today, there’s a need to avail data in real-time or near real-time because businesses rely on it to retain a competitive edge. As a result, it became necessary to create analytics methods that can quickly provide data for consumption by users or applications.

DataOps is a multidisciplinary approach that helps data analytics teams overcome the challenges of inflexible and poor-quality data. If an organization can implement DataOps properly, they will experience great improvements in producing robust and adaptive analytics.

As we’ve seen, DataOps matters today because it helps organizations create reliable and readily available data flows. And availability plays an important role in unlocking the value of an organization’s data.

Category Archives: Test Data Management