Just Enough ITSM

Just Enough ITSM (or ITSM for Non-Production)

Preamble

We’ve all experienced the frustration that comes from too much or too little service management in your test environment. Lately, the DevOps engineer in me has been thinking about how we end up in one of those states. How can we get just enough service management in non-production environments?

Production environments require more service than non-prod environments. But we shouldn’t throw the baby out with the bathwater when it comes to service management in non-prod. I’m a software developer who practices DevOps, so I do a lot of work involving operations, deployment, and automation. I interface with many groups to achieve a good workflow within the organization.

Operations and development often have contradictory goals. Fortunately, we can all find common ground by working together. Understanding each other’s needs and goals through communication is the key to success!

But before we get into that, let’s explore the world of IT service management (ITSM) for a bit. In this post, I’ll discuss different levels of service management in non-prod environments and borrow some fundamental DevOps principles that can help you get the right amount of ITSM. Let’s start with an overview of non-production environments.

What Are Non-Production Environments?

We use non-production environments for development, testing, and demonstrations. It’s best to keep them as independent as possible to avoid any crosstalk. We wouldn’t want issues in one environment to affect any of the others.

These environments’ users are often internal—for the most part, we’re talking about developers, testers, and stakeholders. It’s safe to assume that anyone in the company is a potential user. It’s also safe to assume that anyone providing a service to the company might have access to non-production environments. But there could also be external users accessing these environments, perhaps for testing purposes.

Unless you have the environment in question tightly controlled, you may not know who those users are. That’s a big problem. It’s important to understand who’s using which environments in case someone inadvertently has access to unauthorized information. Or maybe you just need to know who needs to stay informed about changes or outages in a specific environment.

That’s where service management comes in. The next section explains how bad things can be when there is no service management in non-production. This exercise should be fun…or it might make you queasy. Better have a seat and buckle up just in case!

When You Have Zero Service Management in Non-Prod

Let’s call this the state of anarchy. Here’s what it looks like:

  • Servers are running haywire and no one knows it.
  • Patches are missing.
  • Security holes abound!
  • The network is barely serviceable.

Can anyone even use this environment? How did it get like this, anyway? I have a couple of theories…

  1. Evolutionary Chaos: This model was chaos from the start. Someone set up an environment for testing an app a long time ago. It did its job and was later repurposed. Then, it got repurposed again. And again. Eventually, it started to grow hair. Then an arm sprouted out of its back. Then it grew an extra leg. Suddenly, it began to “self-organize.” Now it seems to have a mind of its own. It grew out of chaos!
  2. Entropic Chaos: Entropy is always at play. It takes work to keep it from causing decay. In this theory, things were great in the past. But over time, service management became less and less of a priority for this environment. Entropy won the day, and the situation degraded into chaos.

However the environment got into its current chaotic state, the outcomes are the same. Issues are resolved slowly (if at all). Time is wasted digging up information or piecing it together. Data becomes lost, corrupted, and insecure. Owning chaos is a burden and a huge risk in many respects. We don’t want to end up here!

If you’ve made it this far and still have your lunch in tow, you’re past the worst of it. You can uncover your eyes, but be wary! Next, we’re going to look at a wholly buckled down environment and how it can go wrong in other ways.

When You Have Too Much ITSM in Non-Prod

It’s better to have too much service management than not enough. But it’s still not ideal. For one thing, it’s wasteful. For another, it causes morale to suffer. Granted, it’s reasonable to default to production-level service management at first. But staying on default is a symptom of a big problem—communications breakdown. And the root cause of having too much ITSM is due in part to human nature and in part to organizational legacy.

Here are my two theories on how organizations end up here:

  1. Single-Moded Process: Service delivery, operations, and all other departments focused on service management are hell-bent on making sure the customer is absolutely satisfied with their service. Going the extra mile to make sure the customer is happy is a good thing! Operations folks are trained on production-level service management, so their priority is to keep the trains running. With this in mind, operations management systems are set up for production environments. It’s easiest to use that same default everywhere. For better or worse, every environment is treated like a production environment!
  2. Fractured Organization: Organizations are sub-divided into functional groups. When these groups aren’t aligned to a shared purpose, they’ll align to their own purposes. They even end up competing with each other. They’ll center up on their own aims, tossing aside the needs of others.

How You Know When There’s a Problem

The fractured organization theory may explain what happened to a friend of mine recently. Let’s call him Fabian.

Fabian was the on-call engineer this past June. The overnight support team woke him up several nights in a row for irrelevant issues in the development environment. He brought this up to operations, who were responsible for managing the alert system. Unfortunately, the ops engineer was not sympathetic to his concerns in the slightest. Instead, the ops guy put it upon Fabian to tell him what the alert system should do. That’s understandable, but Fabian had no information to that aim. The ops guy wouldn’t share anything with Fabian or collaborate with him on putting a plan together.

This story illustrates a misalignment between operations and development. Problems like this crop up all over the place. Usually, we can remedy or even avoid these situations by taking just a bit more time to understand the other side.

The four theories I’ve presented tell us about extremes. And yes, these extremes push the boundaries and aren’t likely to occur. Still, an organization sitting somewhere in the middle may not have the right service management in non-production. As we’ve seen with Fabian’s story, this is often an issue of misaligned goals.

So how do we get to just enough service management? Maybe the answers lay in what’s working so well for DevOps! Let’s see how.

Just Enough Service Management

IT teams have members with specialties suited to their functional area. Operations folks keep the wheels turning. QA makes sure the applications behave as promised. There are several other specialties—networking, security, and development are just a few examples. Ideally, all of these teams interact and work together toward a well-functioning IT department. But it doesn’t just happen. It takes some key ingredients.

Leadership

Working together effectively takes good leadership. Leadership happens at all levels in an organization. Remember, a leader is a person, not a role.

Shared Vision

It’s also critical to have a shared vision and shared goals. Creating a shared vision is part of being a leader. Here are a few points to remember about vision:

  • A shared vision creates alignment.
  • The vision should be exciting to everyone.
  • You have to do some selling to get everyone aligned with the vision.

Your vision for the test environment could be something like: “Our test environment will be a well-oiled machine.” Use metaphors like “Smooth Operators” or “Pit Crew” to convey the right modes of thinking.

Open Communications

Keep communications open and honest. Open, honest communications can be one of the most significant challenges you’ll face in implementing the right amount of service management. Many of us have a hard time being honest for fear of looking weak in the eyes of others. That fear is difficult to overcome, especially in an environment where we don’t feel safe and secure. Managers have the vital task of creating an environment where employees feel safe and able to communicate openly. Trust is essential to success.

One Last Look

Getting the wrong amount of service management in any environment is a problem. Too little opens up all kinds of risks. Too much ITSM results in wasted time and resources. In this post, I presented four theories for how an organization might end up with the wrong amount of service management in non-prod and discussed what changes you can make to correct that.

ITSM doesn’t happen in a bubble. It takes alignment between many stakeholders. There are three main things we can do to get alignment: wear your leader hat, share the vision, and converse honestly. You can accomplish any goal when you’re set up to win—even with something as challenging as achieving just enough service management.

Author: Phil Vuollet

This post was written by Phil. Phil Vuollet uses software to automate process to improve efficiency and repeatability.

The EMMi

The 8 Dimensions of the EMMI (Environment Management Maturity Index)

If your interested in IT & Test Environments Management then you have probably heard of the  Environment Management Maturity Index (EMMI), the de-facto standard for measuring ones  Test Environment Management capability.

If not then let me summarize: the EMMi is a maturity index that provides you with a standard frame of reference to help you assess your strengths, weaknesses, opportunities, and threats.

A powerful tool for assessing your environment and operational capability across your enterprise and help you quickly opportunities to improve.

As shown in the diagram, the EMMI does this by scoring you on eight key performance areas (KPAs). Today, I’ve decided to dive deeper into each of those key performance areas so that you can make a well-informed assessment.

KPA 1: Environment Knowledge Management

First up is environment knowledge management. This refers to your ability to understand how your projects move through all your environments, including development, testing, staging, demoing, and production.

However, this is about more than just one software team. This is about understanding how your systems are connected in each environment across multiple software systems and business units. You will likely need a few models of both low-level relationships and higher-level connections of your systems to gain a strong understanding.

When you know how your software systems are connected as they move through environments, you can avoid many problems. You reduce the risk of disruption when a team needs to release to a new environment. For example, if your billing system is dependent upon your product catalog and the product team releases a new version to QA, you may suddenly see network timeouts when you call the service. That timeout is probably due to a performance bug. If you understand how these systems are connected in QA and if you know the process well, you’ll avoid hours or days of triaging, trying to figure out why your tests are intermittently failing.

KPA 2: Environment Demand Awareness

Next up we have environment demand awareness. This is not about how much load is on your environments. It’s about why you have those environments. Ideally, you should know who’s using them and why. Some environments may have obvious uses, like development. However, other uses may be surprising.

Take QA for example. I was once on an engagement where we developers thought it was our job to test out new features before we released to production.  So we kept changing the setup to suit our needs. Eventually, a flock of business analysts came our way, yelling and waving their arms for us to stop. It turned out that many of our customers used QA to test out significant pieces of data before they staged into production, and we were deleting their hard work. Knowing who’s using your environments and why will prevent these kinds of things from happening.

When you know who’s placing demands on your environments, you can also plan better. You may know of a new group of users coming in the pipeline. Or perhaps your environment is taking a hit from many users at once. If you realize you have two different sets of users in that environment, you can split that environment. You can even tailor each environment depending on those users’ needs.

KPA 3: Environment Planning & Coordination

Once you know who’s connected to your environments as well as who’s making demands of them, you can plan for their needs and yours. It’s key to be able to consistently plan and roll out environmental changes to meet upcoming milestones across your enterprise.

Imagine if one of the product team members decided to load test their catalog system and generated five million fake products in their QA environment. This ripples forth to your QA, and none of the purchasing testers can actually do any work. This in turns clogs up their deployments and delays your ability to launch. We can avoid these types of problems with good planning and coordination.

It’s also important that your planning and coordination is consistent across teams. When you have a consistent process, all the teams will know when to share knowledge and when to synchronize efforts.

KPA 4: Environment IT Service Management

It’s not enough to deliver and manage your environments. Since you have users who demand these environments, we need to put on our customer service hat and support their ongoing use. We should diligently manage incidents, changes, and releases to ensure our users are getting what they need. If we neglect the ongoing support and operations of our users in these environments, the piling amount of incidents and user demand may threaten to overwhelm us.

When we spin up a new environment, we need to ensure the appropriate teams own it end to end. They need to have the necessary tooling and operational support to maintain this environment for its entire lifetime. This means well-understood communication on incident resolution and criticality. And it means well-understood processes to manage changing environmental needs.

KPA 5: Application Release Operations

Alright, this one gets a little tricky. It’s healthy to have consistent and repeatable processes across your enterprise for releasing applications. But it’s an easy risk to read this and interpret it as “standardize your deploys.” I want to be clear: application release and deploys are not the same things.

Your deploys are all about getting packaged source code to the right place. But application release is about exposing new functionality to customers. At the lowest maturity, this happens only during deployment. But with mature teams, we can use tooling and processes to separate the idea of deploying code from activating it for customers.

This means you want to ensure your software teams are equipped to continually deliver code to production and to do it in as automated a fashion as possible. Once your teams are doing this, we can shift our focus to how to activate—or release—this code to our customers. There are many tools to help you make this change. It’s this process that you want to standardize across your organization. That way, customers know what to expect, and they’ll understand how to check if new features have arrived.

KPA 6: Data Release & Privacy Operations

Let’s talk about another key performance area: data release.

Data release across your environments is just as important as application release. But it’s often neglected. Each application team ideally owns its own data, but teams need to be explicit how they manage that across their environments.

Time for another story. I knew a team that was quickly delivering high-value financial software, but they depended on a few backend services. Some of these services had a data refresh that occurred once a quarter or so. However, they didn’t make this known to the team, so the team had set up their QA environment with a test bed of data to give them a speedy turnaround time on user stories. This data refresh hit them like a punch in the gut. It killed their velocity for weeks.

It’s healthy to avoid such problems in your enterprise. We want to ensure data release processes are well known and consistent across teams. It’s also a good idea to automate as much as possible to ensure this consistency stays intact, letting our teams work on more valuable efforts.

KPA 7: Infrastructure & Cloud Release Operations

In the same vein as data releases, infrastructure releases have an indirect but profound impact on your teams’ applications. How you handle your infrastructure has a ripple effect across multiple applications. If managed well, you can provide a cushion of protection for software systems to run and fail in isolation. If mismanaged, it can bring down a whole ecosystem of applications.

One would think I’d be out of stories by now, but I have another: I was on an engagement at a Fortune 10 company that, as far as I know, is still mismanaging their infrastructure releases. They built an in-house cloud platform from the ground up, but they didn’t consider their environmental demand, nor did they create an automated and repeatable system. They instead created a system that requires every application team on it to move every few months. And every move brings with it different problems. They provide no tooling to automate this move. At one point, they would consistently lose a data center every week for three weeks straight. Not only was the platform unstable but it also actively hampered application teams from delivering because they were too busy migrating their infrastructure.

There are many tools to help us manage this effectively. We can take advantage of external cloud platforms. We can practice infrastructure as code principles. Also, we can use configuration management tools to ensure our environments are consistent and we can always go back to a fresh state.

Think of your infrastructure releases as a bed frame, and you want your software teams to feel like they are lying on a comfortable mattress, not a bed of rocks.

KPA 8: Status Accounting & Reporting

Complex systems are quickly becoming table stakes in the world of IT. This complexity makes it harder and more valuable to stay on top of your system health and behavior. Yet the faster you can make decisions about your systems and react to problems, the more competitive you will be.

Throughout your teams, you want to ensure you have ways of understanding team health. That way, you can support troubled areas. You want to monitor system health so that you can triage and fix defects before your customers even know. And you want to get real-time data on your system behavior so that you can react faster than your competitors and get new features out quickly.

This is connected with the infrastructure release key performance area, as you want to equip your software teams with standard tooling to accomplish all of this. The more consistent your tooling, the more you can aggregate data and see behavior across multiple systems.

Multi-Dimensional Success

Getting a handle on these key performance areas across your organization is a potentially tough but worthy endeavor. Mismanaging any of these will cause pain, but handling them well will create a cohesive, value-focused set of teams.

Ready to take the next step? If you’re feeling confident about your environments or you’re just curious, go ahead and calculate your environmental maturity. The results will give you insight into what area most needs your attention.

The Author: Mark Henke.

Mark has spent over 10 years architecting systems that talk to other systems, doing DevOps before it was cool, and matching software to its business function. Every developer is a leader of something on their team, and he wants to help them see that.

Test Environment Management and DevOps

Why DevOps Needs Test Environment Management

Testing is an essential component of software development. Modern software developers live by the mantra, "If it isn't tested, it isn't done." There's a lot of focus on unit and acceptance testing, but organizations often slip when it comes to practical system testing. That is, many projects fail to put all the parts together and check their interactions.

Sometimes the reason for this is simple: we don't have a good testing environment.

In this post, I will discuss why this is and detail some of the side effects of missing out on that good environment. I'm going to start by creating some context, then I'll talk about fundamental practices that can be applied to the general problem, and I'll close out with a discussion on testing environments.

Why Don't We Have Good Testing Environments?

Organizations face a number of competing factors when it comes to software development and deployment. There are the first and obvious issues of building the right thing and having a stable solution that users like. Within our organizations, we are always working to balance the cost of development and the cost of operations. We find that cost minimization is hard to achieve when we have these two goals.

Further, the creation and maintenance of each environment is a complicated and time-consuming activity. Coordinating multiple environments has a multiplicative effect on cost. Employees also suffer from fatigue and distraction. Being consistent and thorough becomes more and more difficult as complexity increases. Each of these factors leads to increased cost through waste and rework.

So that's the problem. How do we fix it?

Tradition vs. the New Way

Traditionally we establish one or more test environments. Often our test environment is a smaller version of production. It may not contain the same volume of data. There may not be the same amount of network traffic. The servers might not have the same number of cores or amount of RAM.

This is not the most effective way of testing the system as a whole—we all know that. But there are always reasons that we do it.

The new way of doing things is to create a production environment and run the tests. That is, we automate the infrastructure to the degree that we can create an entire environment with a simple button click and execute our test suite.

We should be building our test environments with the new way in mind as our ideal. That said, there are still many issues that we need to deal with in order to achieve this goal. The following are the heavy hitters on our issue management list.

What Do We Need for Better Testing Environments?

First, it's essential that you carefully lay out what tools you'll need and how they will be used. Having a solid foundation to start with will be helpful later on, so don't skimp on the thinking here. Identify the capabilities you are looking for in setting up an environment. Then ensure that you have the tooling in place to support that. It's much easier to build things into the system in the first place.

Having said that, like with all things, plan only for what you know you need. Don't be overly speculative or overly ambitious. Focus on what you know to be true about the end-state and work to make that a reality.

There is an ever growing list of tools available to help with every aspect of managing your environment. First off, cloud providers universally provide working APIs for every aspect of configuration, allocation, deployment, and provisioning. On top of those APIs there are often whole SDKs and CLIs to make using them even easier. Beyond that, there are 3rd party tools that make the use of those SDKs almost transparent.

As we consider how to create a good test environment, there are a number of considerations that we need to keep in mind.

What Makes for a Good Testing Environment?

The problem you might encounter is that there are so many tools you can't keep track of them. Further, not all of the tools and components may be entirely in your control. The difficulty here is balancing a lean solution against the vast array of available tools for managing that solution. Finding a tool that is light and easy to apply is a first order knowledge problem; how do you make a decision about an ever-changing environment to which you have little control without arriving at a possibly irreconcilable conflict?

This conundrum can be resolved with a light touch. If you can create a lean solution that satisfies your platform requirements, you have a basis for discovering cost savings without sacrificing capabilities.

This is where Test Environment Management (TEM) comes into your plan. At least in part, TEM can help you wrangle all these components and manage their use and deployment.

For good TEM, you'll need the following components.

  1. Testability

Modern software is tested software. Over the last 20 years, we have changed the way we make software by adding a tremendous focus on testability. While the debate rages on about what the most effective means of testing is, one thing we can count on is: there will be tests.

Building a systems infrastructure that supports testability is absolutely necessary for a modern delivery pipeline. So when we think about the capabilities we will need, it is somewhat of a foregone conclusion that we will be able to test the infrastructure before deployment.

So our test environment itself must be testable. Validation of the environment itself is a critical feature of our solution.

  1. Configuration Management

If we're going to use automated releases, we have to have good configuration management. Because we will make all our environments essentially the same way, this should be a straightforward process of identifying the configuration and codifying that into our build process. When we have done this, all environments going forward will be consistent.

  1. Release Management

Just as we would with a production release, our test environment is going to need release management. We need to know what features and fixes are contained in a release so we understand what we should be testing. This requires us to integrate our change management system, source control tools, and release process.

  1. Networks

Network configuration is a concern we must also address. Each deployed environment needs to function with a minimum amount of customization. That is, just because the deployment is to a test environment doesn't mean we should have to reconfigure every service. Virtual networks, Kubernetes, Docker-Compose, and other tooling can minimize these customizations.

  1. Load and Volume Testing

One thing that can be difficult to emulate in our test environment is message volume. In order to test load and performance, we will need some means of creating a transaction volume similar to production. For many web applications, this isn't overly complicated, but for an IoT solution with hundreds of thousands of devices, this can be a daunting task. Careful consideration of these needs is required in planning a test environment. There will be a lot of heavy lifting.

  1. Incorporation of Databases

Similar to message volume, test data is often a challenge. When planning a testing environment, we need to accommodate not only the database configuration, but also the volume of data in order to ensure that we have a proper simulation of the real world.

One approach is to develop a data loader that simulates real data. This loader is executed between the environment creation and test execution steps. Of course, this can be a challenging task for large systems. An alternative is to make copies of production systems. There are several laws we need to be sure we observe when we make copies of systems related to financial and privacy regulations; data masking can be as challenging as simulated data loading.

For greenfield development, getting ahead of these issues will save you a lot of pain and suffering. In the brownfield, developing a careful plan will help you immensely; organic growth in this area yields results, but often with the consequence of interrupted or delayed deployments as issues arise and data is backfilled into the process.

  1. Production's Security Settings

A final issue to be considered is security. In order to get a realistic test of our system, we need to include all of the security settings our production environment has. This includes establishing users with different roles, server certificates, network restrictions, and all of the other settings and configurations we have in production. Because we have automated the deployment process, this shouldn't be difficult to do, but it does increase the number of things we need to keep track of.

What Are the Risks of Not Having a Good Testing Environment?

I've described a complex system of testing environments and automation in very abstract terms. I'll add to those generalizations an important takeaway: if you don't have a consistent, reliable, and fast testing environment, you are at great risk for failure.

I don't mean your project will fail. I mean you are at risk that any particular deployment won't go well. If you don't manage your test environment well, it's easy to can get wrapped up in the test cycle with systems that won't deploy or tests that cannot execute. You might even release bugs because your test environment is tolerant of things that production won't allow.

It is essential that your organization puts effort into the creation, growth, and maintenance of an automated testing environment in order to maximize the effectiveness of your development efforts.

So, Why TEM?

All of the above are necessary components of a modern software delivery pipeline. As organizations move toward continuous deployment, the need for automation grows, and more tooling is necessary to enable that automation. Test environments specifically need additional management in order for things to run smoothly and cost-effectively.

If you want to get into more detail, there are a number of articles and posts elsewhere on the general topic of Test Environment Management (TEM). If you are looking to dig deeper into the topic, I can suggest this article that describes the Use Case for TEM and this one discussing the cost of an inefficient test environment.

Failure to create these test environments puts the organization at risk and can be very costly. In order to create test environments with any reasonable amount of consistency, you must manage them. Therefore, test environment management needs to be a required component of your environment.

Author Rich Dammkoehler

This post was written by  Rich Dammkoehler. Rich has been practicing software development for over 20 years. In the past decade, he has been a Swiss Army Knife of all things agile and a master of agile fu. Always willing to try new things, he’s worked in the manufacturing, telecommunications, insurance and banking industries. In his spare time, Rich enjoys spending time with his family in central Illinois and long-distance motorcycle riding.