Which TDM Method is Best

Which Test Data Management Method Is Best?

Introduction

Setting up a great test data management strategy is a crucial step for taking your test automation process to its fullest potential. However, many software professionals are still not familiar with the concept of test data management (TDM). Even those that are familiar with TDM might have a hard time putting it in practice. Why is that?

 

When it comes to test data management, the “what” is relatively straightforward, but we can’t say the same about the “how.” As it turns out, there are several competing methods of managing test data. Which one should you choose? As you’ll see in this post, this isn’t a one-approach-fits-all kind of situation. Each method has its unique strengths and weaknesses and might be more or less appropriate for your use case.

Today’s post will cover some of the existing test data management approaches, listing the advantages and disadvantages of each one. Let’s get started.

Replicating Data From Production

The first approach we’re going to cover in this post is perhaps the most popular one, at least for beginners. And that makes perfect sense if you think about it. When you first encounter the challenge of coming up with data to feed your testing processes, it isn’t too far-fetched to think you should just copy data from production and be done with it. It’s the easiest way to obtain data that is as realistic as possible. You just can’t get more real than production.

Not everything is a bed of roses when it comes to production data replication. Quite the opposite, actually. The easy access to data is pretty much the only advantage this method has. And what about the disadvantages? These, sadly, abound.

Here Be Dragons: Some Downsides of the Approach

Here’s the first problem: replicating data from production continues to be mostly a manual process. Sure, you can come up with scripts and automated jobs to do most of the heavy lifting for you. But keep in mind that generating the data isn’t the whole job of a TDM management solution. “Availability” is an integral part of the package. That means that the TDM tool is responsible for making sure the data is available where it’s needed, at the right time. A naive approach based on scripts might not be sufficient to manage the demands of a complex testing process, forcing you to rely on a manual process to do so.

Secondly, production replication doesn’t lend itself well to negative test cases. It’d be out of the scope of this post to give a lengthy explanation of negative testing. In a nutshell, negative test cases are tests that validate the system against invalid data. Basically, you throw faulty data at your application to check how well it can handle it. Since production data would (hopefully) be in good shape, this approach isn’t well suited to this type of testing.

Production data replication also doesn’t work…if there is not data replication for you to replicate in the first place! What should you do when you need to test an application that is still in the alpha stage of development or even a prototype? Since no one is actually using the application, there would be no production data for you to copy. That’s a severe downside of this approach since every new application will face this problem.

Here Be Dragons (For Real): Legal Implications

Finally, we have the most serious downside of this approach—data sensitivity. Data compliance is a crucial part of the modern IT landscape since companies are responsible for the data they store and manipulate. It’s up to them to protect their client’s data, ensuring it’s not abused. When replicating data from production, software organizations run the risk of failing to comply with privacy acts, such as GDPR. And that can bring catastrophic consequences, legal, financial, and reputation-wise.

Data Masking

In order to solve the downsides of production data replication (a.k.a the naive approach), test data management tools have come up with more sophisticated methods. One of the
most popular of these approaches is test data masking. As its name implies, tools that adopt this approach enable its users to apply masks to production data. Such masks will remove personally identifiable information (PII) from the data.

Data masking is an improvement over naive production data replication, for sure. But the approach is not without its downsides.

First, consider the “time” variable. Data masking doesn’t reduce the time spent generating (or rather, copying) the data for testing. On the contrary, it increases it because now you have a new added in the process. You could argue—and I’d gladly agree—that it’s time well spent, but it’s more time nonetheless.

Then, you also have to keep in mind that data masking isn’t a standalone approach on its own. Instead, it complements the previous approach by solving one of its more serious issues. The problem is data masking can’t fix every problem that the production replication approach has. For instance, if you intend to test an application still in development, for which there is no production data at all, data masking is powerless to help you.

Synthetic Data Generation

Synthetic data generation is yet another method of test data management. As its name suggests, this approach consists of generating “fake”—or synthetic—data from a data model. Tools that implement this approach are able to preserve the format of the data. The values themselves, though, are completely disconnected from any original data. What does that imply?

The implication of this is that synthetic data generation’s greatest asset is simultaneously its most significant downside. By populating the database with entirely “made-up” values, the approach dramatically reduces (virtually eliminates) the risk of exposing sensitive data. On the other hand, depending on the tool’s sophistication—or lack of—you might end up with data that feels “fake-y.” One of the goals of an excellent TDM strategy is to provide data that is as production-like as possible.

To wrap-up, let’s talk about the biggest advantage of synthetic data generation, namely: speed. Once you have a model in place, you can quickly generate data from it, effectively eliminating the time delays that plague other approaches.

Test Data Management Is More Than Test Data Generation

In this post, we’ve covered some of the most used approaches to generate test data. The list is definitely not exhaustive; there are many more methods that we didn’t cover. However, many of them are variations or combinations of the approaches we did talk about.

Another thing to keep in mind is that test data management is much more than just generating test data. TDM is responsible for ensuring the quality of the test data, its availability, and also its security. In other words: the data must be good, and it must be available at the right place, at the right time. And bad actors shouldn’t be allowed to expose it or misuse it in any way. That’s why, depending on the needs of your organization, you should consider adopting a full-fledged data compliance solution, which can not only supply your data generation needs but also make sure your data adhere to the compliance requirements you must follow.

Author Carlos Schults

This post was written by Carlos Schults. Carlos is a .NET software developer with experience in both desktop and web development, and he’s now trying his hand at mobile. He has a passion for writing clean and concise code, and he’s interested in practices that help you improve app health, such as code review, automated testing, and continuous build.

DataOps Explained

Preamble

Companies—especially large internet companies—treat collections of data as an asset. And more and more companies are developing an appetite to leverage their data to compete. There are also increasing customer expectations for the fast release of high-quality products or services.

So how do you balance speed and quality? DataOps is your answer. Let’s take a look at what DataOps is and why it matters.

What Is DataOps?

The term DataOps is an abbreviation of the words data operations.

The speed of development and product release has decreased in the last 10 years due to technologies such as DevOps (development operations). As a result, we have a new problem: data and more data. To help draw insight from loads of raw data, companies use data analytics. Of course, there are various types, such as data mining, that help identify trends, patterns, and relationships in large data sets. Unfortunately, in our need-it-now economy, users of data analytics can’t—or won’t—wait for weeks or months to receive new analytics.

With the increased complexity of the emerging data ecosystem and the need to deliver insights more quickly, a new strategy is essential if we’re to gain value from massive amounts of data.

This is where DataOps comes in. It helps improve the delivery speed and robustness of analytics. In other words, DataOps is an automated, process-oriented methodology that helps analytics and data teams improve the quality of data analytics, as well as reduce its cycle time. To achieve this, DataOps combines agile development, DevOps, and statistical process control.

Similar to how DevOps brought together development and operations teams to handle software delivery problems, DataOps seeks to bring together data practitioners to deliver quality data for applications and business processes.

But do we really need another methodology?

Why DataOps Matters

In our current on-demand economy, a company has to rely on data from various sources to better understand their products, customers, and markets. This all sounds good until you factor in the dynamic nature of data. How do you effectively monitor the flow of a company’s data that includes prediction changes, business anomalies, trend changes, and more?

Someone could argue that we already have analytics to handle all of the data issues. But here’s the problem: Data analytics pipelines are in a deplorable state because of

  • Inadequate automation and orchestration
  • Minimal code and data reuse
  • Or a lack of coordination between the involved parties, such as IT, operations, and even business stakeholders.

In the end, we have poor-quality data that’s delivered too late to meet a business’s needs.

As more and more data is collected, the data pipelines become more complex. At the same time, large, more traditional enterprises realize the need to use all the data their company generates. Such information is becoming important even in everyday decisions.

Needless to say, all of these factors make it necessary for an organization to implement a new approach to govern the flow of data through its life cycle.

And here’s one more reason to consider using DataOps. Companies that have already implemented DevOps practices will find that implementing DataOps gives them a higher competitive edge. This is because the DevOps engineering framework may be regarded as preparation for DataOps. Organizations that rely on data need a similar high-quality and consistent framework that’s useful for fast data analysis.

Implementing DataOps in 7 Steps

DataOps is still a rising approach for data-driven organizations. DataKitchen, a company that developed a DataOps platform for data-driven enterprises, suggests seven steps for implementation. And the good news is you don’t have to discard your existing analytics tools.

Here are the seven steps to implementing DataOps.

Add Data and Logic Tests

This step requires that every time you make changes to an analytics pipeline, you have to add a test for the change. Testing applies to data, models, and logic. The idea is to make sure nothing will be broken in the analytics pipeline. These incremental, automated tests ensure that quality and integrity are built into the final output.

Use a Version Control System

In order for raw data to produce useful information, it goes through many processing steps. And all of these steps involve coding. In a similar manner to other software projects, the source files that data analysts use in the data pipeline require maintenance in a version control system such as Git. The aim of version control is to help keep track of changes and revisions. Keeping the code in a repository is also important, as it helps when there is a need for disaster recovery.

Branch and Merge

To maintain coding changes, data analytics should borrow the approach that software developers use to maintain their projects, which is to continuously update code source files. For instance, when a developer wishes to make changes, they pull out the relevant code from the repository. Changes are then made on the local copy (also called a branch) pulled from the repository. Once new changes are made and tested, the local copy (branch) is merged back into the repository.

Use Multiple Environments

Data analytics team members should have their own environment to work from. These environments will allow team members to work on subsets of data while isolating the rest of the organization from any effects of the ongoing maintenance or additions to the existing data.

Reuse and Containerize

Breaking down a data analytics pipeline into smaller components facilitates code reuse and containerization. By doing this, the data analytics team can move quickly as they leverage existing libraries or other code whenever they want to extend or develop new code.

Parameterize Your Processing

Borrowing the idea of parameters from software development will help in designing a robust data pipeline. And a flexible data-analytics pipeline will accommodate varying run-time circumstances.

Use Simple Storage

Simple storage helps make the whole data analytics pipeline readily available, and it eases the updating process.

What About Data Security?

There’s a lot of concern about how to gain insights from raw data in a robust yet fast manner. But we shouldn’t forget the consequences of data breaches across the globe. The costs you may incur for mishandling personally identifiable data is becoming too expensive. As you work toward building more and delivering faster, it’s important to consider the security of the data you handle.

When implementing DataOps, you must protect the data at every stage of its journey. Always keep in mind the bad guys who are ready to grab your data. And don’t forget the issue of accidentally sharing sensitive data that may cause you to fail to meet regulatory compliance.

Thankfully, there are solutions that help take these worries away, such as Data HotSpot—a product specifically designed for those in test data management and those who consume test data. With Data HotSpot, you are assured complete security, customer protection, brand protection, and penalty avoidance. That means you can implement DataOps and stay way ahead of your competitors with real-time or near real-time analytics.

Unlock the Value of Data

Today, there’s a need to avail data in real-time or near real-time because businesses rely on it to retain a competitive edge. As a result, it became necessary to create analytics methods that can quickly provide data for consumption by users or applications.

DataOps is a multidisciplinary approach that helps data analytics teams overcome the challenges of inflexible and poor-quality data. If an organization can implement DataOps properly, they will experience great improvements in producing robust and adaptive analytics.

As we’ve seen, DataOps matters today because it helps organizations create reliable and readily available data flows. And availability plays an important role in unlocking the value of an organization’s data.

Author: Alice Njenga

This post was written by Alice Njenga. Alice’s areas of expertise include technology, artificial intelligence, IoT, cloud computing, security, and telecommunication. She especially enjoys converting dense technical material to articles that are easy for the layman to understand.

Test Data Management

5 Steps to Better Test Data Management

Forward

I always say that it's important to test in production because nothing compares to a production environment. But it wouldn't be very professional of you to test only—and directly—to production. Testing in production usually gives the impression that you didn't care enough to test before you reached the production stage.

But I'd say that in order for you to even dare to test something in production, you need to have run a set of tests previously in a similar environment—including all the data you need for testing.

That's where test data management (TDM) comes in.

TDM is the process of creating test data that's similar to the real data being used in a production environment. Developers and testers use this non-production data to be more confident that the new software changes aren't going to break anything during the release.

 

A good testing strategy is usually accompanied by testing data taken from production. Developers sometimes use funny, dummy, and non-real data as well. But what are the steps that you need to follow in order to create good TDM data?

Top 5 Considerations

#1 Only Use the Data You Really Need

If you don't know where to go for your next vacation, just book the next flight and start packing. You might have the best experience of your life...or the worst. If you don't know what data you really need for testing, you might just use it all. That approach has pros and cons, so when you test software without having an idea of which scenarios you need to test, you'll want to have an exact copy of the production database because it's the easiest way to start testing with real data. Otherwise, you'll end up spending too much time and money waiting to get your copy of the data for testing.

When you start creating your testing process by building the list of test cases you'll need, it becomes pretty obvious how much and what type of data you're going to need. More importantly, think about your testing process as an iterative one. If you start testing the login page, you don't need to have all the information from the user for that test case, such as their birthdate or home address.

As you keep iterating, you're going to need more testing data. And as you find more bugs, you're going to need more real data. Unless you need to run stress tests, subsetting data is going to be enough for the majority of the test cases. And even if you still need to validate that the system can handle high waves of traffic, you can also generate varied static data for that purpose. More on this later.

Taking small sets of your production database should be enough for most of the tests you'll run to validate the software. You'll also reduce costs and complexity when building only the test data you really need at the moment.

#2 Avoid Having Sensitive Data for Testing

We've seen a lot of recent GDP-related lawsuits involving big companies in Europe. Europe is taking data protection more seriously than other countries. Pretty soon, regulations like GDPR will be implemented on other continents too. If GDPR is already affecting other companies, we better avoid having unprotected sensitive data in our testing environments.

SOX compliance regulation fosters a separation of duties within an organization. I've worked with these type of regulations. In my experience, auditors only want to see that only certain people have access to the production environments. These people with privileged permissions are legally responsible for what happens with customer data.

Even with regulations in place, data is still leaked. We have to be prepared for that, so you should operate as if you expect the information you're storing will be stolen someday. Mask any data that could identify a person, or what's also called personally identifiable information (PII).

Use irreversible methods to mask data so that it's difficult to unmask it. And make sure you're constantly checking that PII is protected. Managing test data will be simpler and easier if you create subsets of the data to fulfill your different testing needs. And you won't have to worry about giving sensitive data to developers or whoever needs production data for testing.

Ideally, try to avoid having sensitive data. But since sometimes you can't avoid it, try to keep PII data at a minimum, and securely mask the data you need to have.

#3 Build Synthetic Data for Better Efficiency

Even though you decided to mask sensitive data—especially if the data is going to be used for testing purposes—you want to make the security gap as small as possible by not including sensitive data in your tests (even if it's masked). One way to improve security is to replace real data (like credit card numbers) with autogenerated dummy data. That's synthetic data, and it will help you get more efficient results in testing.

You can take advantage of the synthetic data approach by using more realistic data than just dummy data. For example, you might have a user called Joe in your records, but for testing purposes, you decided Joe will be called Jeremy. This gives you a chance to run machine learning experiments where you can learn more about "Jeremy's" preferences without knowing that Jeremy is actually Joe. You're protecting Joe, even if the data is leaked or misused.

Synthetic data makes real data more shareable because you only have the data you need. Why would you need to know a person's name if you're just trying to replicate a bug in production? You're only interested in knowing which paths through the system's workflow the user took. What matters is why the data ended up in a certain state that caused the software to break. You can then decide either to ignore the person's name or replace it with other "real" data.

If you need to have large amounts of data for performance testing, you can use synthetic data to double the size of the database. Along with the previous benefits we discussed, synthetic data makes your tests more efficient by only using the data you need to cover specific test scenarios.

#4 Create Test Data As a Self-Service Model

DBAs are in charge of generating testing data. They know the best ways to do it and what data is sharable among teams (as I explained in the previous section), and sometimes they're the only ones who have access to production databases. When this happens, the DBAs become a bottleneck, and the time spent in the testing stage increases.

That's why you should create test data as a self-service model. It's not just so you don't constantly interrupt DBAs when a developer or tester needs data. The ability to automatically have testing data will let you parallelize the boring task of manually generating data for testing. Do you need to reduce the testing time? Fine. Create more subsets of testing data in parallel and distribute the test cases.

Another benefit of having a self-service model is that you can easily drop and re-create environments on demand. By doing this, you ensure repetition and predictable results when preparing testing data. It's also easier to include TDM in your CI/CD pipeline, which brings you closer each day to one-click deployments.

Creating a self-service model is far from an easy task. So it's important that DBAs, developers, and testers work together to create this model. Not all of them have the same needs and objectives. Join experience, knowledge, and skill to create a better model for data testing.

#5 Keep Testing Data up to Date

Last but not least, keep your testing data up to date. Your software will continue evolving, so the test scenarios and the data they need will keep changing over time too. Some test scenarios will become obsolete, and so will their data. Try to always keep the house clean by making sure you're only generating the testing data you really need regarding its relevance in time.

This process takes discipline, and good communication within the team always helps. Developers need to inform everyone which tests are no longer needed and when it's OK to remove them. And either DBAs or testers need to keep confirming that the data they're using for tests is still valid and relevant.

Keeping data fresh might seem like common sense. But I've seen delivery pipelines where tests continue to grow, even though some of the features no longer exist. Sometimes we get too extreme about trying to have a high percentage of test coverage, which isn't efficient.

Having up-to-date testing data will help you have higher quality TDM.

Benefits: Better Test Results With Better Testing Data

I'd say that testing is the most important stage of any software release life cycle. The more quickly you can verify that everything is still working, the better. Always keep the mindset that parallelizing testing will help you to speed up the process. For that, you need to have better test data quality, and it's not always necessary to have an exact replica of what you have in production. In fact, if you don't, it may help you in the cost, security, or speed departments.

It's important that you start by defining what you truly need and iterate from that. Automation helps with repetitive and boring tasks, but you need to continue taking account of the human side of things in the equation to generate data for testing purposes.

TDM helps you provide only the data you need, on time and securely.

Author:  Christian Melendez 

Further reading suggestions: Holistic Test Data Management