DevOps-Metrics

Top 5 DevOps Metrics

When people start talking about DevOps, the idea of metrics usually comes along for the ride. To be able to monitor software after release, we need to know what data is important to us. There are so many options, it may seem overwhelming to know where to look. However, we can limit our options based on two key factors: what decisions we’ll make and how customer-focused they are. With that in mind, I’ll share what I believe to be the five most important DevOps metrics.

Metrics Are for Decisions

The thing about metrics is that they’re useless on their own. People often say, “We need to track this data!” But you need ask them only one question: what decisions will you make with that data? You may be surprised how often—usually after some mumbling—the answer is “I don’t know.” Any metric that doesn’t support a decision or set of decisions we may want to make ahead of time is simply noise. We want to eliminate noise from our minds and focus on what guides our decisions for our team.

Customers First, Then Everything Follows

Knowing what decisions our metrics will support is a good start, but it’s not enough. There are millions of decisions we could make about what we’re seeing. We need a North Star, a guiding light, that will be the anchor from which we can derive a strong set of metrics. This anchor is our customers. For any metric we use, we should be able to point back to how it helps our customers. After all, we ultimately owe them our existence.

Top Five Metrics

Without further ado, I give you the top five DevOps metrics you probably should measure for your team:

  • Customer usage
  • Highest and average latency
  • Number of errors per time unit
  • Highest lead time
  • Mean time to recovery

Customer Usage

The first metric on our list is customer usage. This is any measurement that tells us how much our customers, internal or external, are using our features. When delivering new or enhanced features, it’s important to get to production as soon as possible. But we can’t assume customers want or will use a feature just because we put it in production. This is true even if they specifically ask for the feature. We can weigh how popular a feature actually is against how popular someone claimed it would be or what we estimated it would be.

It’s helpful for us to know how often customers use a feature—even one they requested—after we release it to production and inform them of its existence. Customers often think they need something “right away.” This can cause us to scramble, putting this feature on the top of our backlog. The feature might then sit, inert, for weeks or months because the customers reprioritized their desires.

Internal customers commonly are on a longer cadence, unable to use the feature until they get to it in their own backlog. Tracking customer usage allows us to say, “I know you said this is really urgent, but the last time you said that, it took you six weeks to start using it. Please be sure this is as urgent as you say it is.” We can also use this data to enhance the feature, watching usage go up or down, using hypothesis-driven development.

A good application performance monitoring (APM) tool can track this metric for you. It usually comes in the form of request counts or percentage of traffic.

Highest and Average Latency

Knowing how often customers use your features is a great start. But how do we know if customers are delighted or frustrated with our applications? This is a hard question to answer, but our next metric can hint to us that customers may be frustrated. One of the leading causes of frustration is an application’s slowness. When the response time—that is, the latency—is too high, customers are likely to go elsewhere for their needs.

We want to give our applications the best chance to make customers happy. They’ll appreciate it and likely stick around. If you have internal customers, it may be tempting to say, “They have to use my application, so I don’t need to worry about latency.” Putting aside the potential ethics issue of not caring whether your users have a pleasant experience, that mindset is folly. Even if your direct customers are internal, it’s likely that they or a downstream app are responding to external customers. So, slowness for them is still ultimately hurting your organization’s success. Even if this isn’t the case, enough complaints to the right people may get your applications scrapped.

Two major signals to look for when measuring latency are average latency and the slowest five percent or so of requests. Looking at the average gives you a nice bird’s-eye view of the application as a whole. But even one feature or subset of requests can be enough to create disgruntled customers. This is why it’s also important to keep an eye on your slowest requests.

We can decide where to tune performance with this information. An APM tool can handily monitor all of this for you, in addition to usage.

Number of Errors Per Time Unit

In the same vein of finding out whether our customers are happy, we have the metric of number of errors per time unit. The benefits of this should be pretty clear. Errors with high business impact not only cost your organization money, but they can erode customer trust. Looking at our error rates help us nip these in the bud and find abnormalities that even our tests can’t prevent.

Note that I said “errors with high business impact.” Not all errors are created equal. Your error metrics should differentiate between types of errors. Small glitches and errors are unlikely to erode customer trust or cost a lot of money. For example, if the screen is green instead of blue, that usually won’t be a problem for most people. Also, some errors are caused by users and should be expected. User errors are still good to track because they can provide information about how hard a feature is to use. Just be sure to keep them separate in your monitoring tool.

With this metric in hand, we can decide where to enhance our resiliency. If we can’t control the source of an error, we can decide to escalate that error to the appropriate team. For user errors, we can decide where to focus our efforts on increasing usability.

APM tools are also a great fit for this metric.

Highest Lead Time

Ideally, the work you deliver in your team is set up as a value stream, creating a flow of work from inception to customer usage. This lets us easily identify the individual steps it takes for a piece of software, usually a user story, to reach the customer’s hands. Think of it like an assembly line, but for software features. It’s helpful for us to look at the lead time that a user story takes to go through each step. This helps our customers by increasing the speed by which we get features into their hands.

If we adopt a Theory of Constraints approach, there’s always one highest lead time in our value stream. If we keep finding and reducing that highest lead time, we’ll be ever faster in our ability to deliver software. Say, for example, our value stream has a “coding” step and a “QA testing” step. We can record each step as part of a Kanban board and record which user stories are in “coding” versus “QA testing.” At the end of our iteration, we may see that cards sit in “QA testing” for three days on average, whereas cards sit in “coding” for only two days. “QA testing” is our highest lead time. We can then inspect why it takes so long to do QA testing and make improvements from there.

Lead time comprises two factors: process time and wait time. Process time is the time someone is actively doing something with the user story. Wait time is how long the user story sits idle, finished from the previous step and waiting to be picked up by the next step. Knowing both of these values separately will help the team know what actions they can take to improve the lead time. The decisions you may take on this are varied, but it’s good to have a system in place to frequently inspect and adapt to this metric. A sprint retrospective is a great example of such a system. And, as stated earlier, a Kanban board is a great way to track this metric.

Mean Time to Recovery

The final metric, mean time to recovery, is somewhat of an extension of our error count metric. While it’s good to know how many errors we’re getting, it’s also important to know how fast we can resolve these errors. This goes back to business impact. Business impact is a function both of how often we receive an error and how long it takes to recover from that error. One error that lingers for minutes could have more impact than 20 errors that last only a few milliseconds.

Having both of these metrics will give us a good line of sight on our business impact on errors. This metric is also a good indicator of how equipped your team is to handle operational issues. It’s an often underinvested portion of a team’s tooling.

We can use this metric to decide where we want to improve our insight into our application, such as by adding more logging context. We can also use this metric to help us decide how to simplify our architecture or make our code more readable.

Many tools specialize in error tracking to make it easy to see how quickly the team resolves issues. Some APM tools also have error tracking features.

Strength in Measurement

The key to good measurement is to understand what decisions we’ll be making. These decisions will be most effective when we center our customers. Drawing from this, we can derive a set of strong metrics that ensure our team operates at its best. With these metrics, no challenges will stand in our way for long.

Author: Mark Henke

Mark has spent over 10 years architecting systems that talk to other systems, doing DevOps before it was cool, and matching software to its business function. Every developer is a leader of something on their team, and he wants to help them see that.

DevOps Myths & Misconceptions

Common DevOps Myths and Misconceptions

“Wait, what actually is DevOps?”.

If only I had a dime for every time someone asked me this. For many, the term DevOps comes loaded with misconceptions and myths. Today, we’re going to look at some of the common myths that surround the term so that you have a better understanding of what it is. Armed with this knowledge, you’ll understand why you need it and be able to explain it clearly. And you’ll be equipped to share its ideas with colleagues or your boss.

So, What Is DevOps?

Before we go through the myths of DevOps, we’ll need to define what DevOps actually is. Put simply, DevOps is the commitment to aligning both development and operations toward a common set of goals. Usually, for a DevOps organization, that goal is to have early and continuous software delivery.

The Three Ways of DevOps

DevOps is not a role. And DevOps is not a team. But why?

We’ll get to that in just a moment. But before we explain the myths, let’s build on our definition of DevOps by looking at “the three ways” of DevOps: flow, feedback, and continual learning.

  1. Flow—This is how long it takes (and how difficult it is) for you to get your work from code commit to deployed. Flow is your metaphorical factory assembly line for your code. And achieving flow usually means investment in automation and tooling. This often looks like lots of fast-running unit tests, a smattering of integration tests, and then finally some (but only a few!) journey tests. This test setup is what is known as the testing pyramid. Additionally, flow is usually facilitated by what’s known as a pipeline.
  2. Feedback—Good flow requires good feedback. To move things through our pipeline quickly, we need to know as early as possible if the work we’re doing will cause an issue. Maybe our code introduces a bug in a different part of the codebase. Or maybe the code causes a serious performance degradation. These things happen. But if they’re going to happen, we want to know about them as early as possible. Feedback is where concepts like “shift left” come from. “Shift left” is the idea that we want to move our testing to as early in the process as possible.
  3. Continual Learning—DevOps isn’t a destination. DevOps is the constant refinement of the process toward the early delivery of software. As we add more team members, productivity should go up, not down. Continual learning comes by having good production analytics in place. In practice, this could look like conducting post-mortems following an outage. Or it could look like performing process retrospectives at periodic intervals.

The three ways are abstract, that I’ll concede. But it’s the process of converting these abstract ideas into concepts and tools that have created confusion en mass throughout the industry.

So, without further ado, let’s do some myth busting!

Myth 1: DevOps Is a Role

As we covered in the introduction, DevOps is the commitment to collaboration across our development and operations. Based on this definition, it’s fundamentally impossible for DevOps to be a role. We can champion DevOps and we can even teach DevOps practices, but we can’t be DevOps.

Simply hiring people into a position called “DevOps” doesn’t strictly ensure we practice DevOps. Given the wrong organizational constraints, setup, and working practice, your newly hired “DevOps” person will quickly start to look like a traditional operations team member that has conflicting goals with development. A wolf in sheep’s clothing! DevOps is something you do, not something you are.

DevOps is not a role.

Myth 2: DevOps Is Tooling

For me, this is easily the most frustrating myth.

If you’ve ever opened up the AWS console, you know what it feels like to be overwhelmed by tooling. I’ve worked on cloud software for years, and I still find myself thinking, “Why are there 400 AWS services? What do all of these mean?” If tooling is often abhorrent for me, it’s definitely hard for non-technical people.

Why do I find this myth so frustrating? Well, not only is describing DevOps through tooling incorrect; it’s also the fastest way to put a non-technical stakeholder to sleep. And if we care at all about implementing DevOps ideas into our work, we desperately need to be able to communicate with these non-technical people on their terms and in their language. Defining DevOps by cryptic-sounding tooling creates barriers for our communication.

Tools are what we use to implement DevOps. We have infrastructure-as-code tools that help us spin up new virtual machines in the cloud, and we have testing tools to check the speed of our apps. The list goes on. Ever heard the phrase “all the gear and no idea”? Defining DevOps by tooling is to do precisely this. Owning lots of hammers doesn’t make you a DIY expert—fixing lots of things makes you a DIY expert! DevOps companies use tooling, but…

DevOps is not tooling.

Myth 3: DevOps Doesn’t Work in Regulated Industries

DevOps comes with a lot of scary, often implausible sounding practices. When I tell people that I much prefer trunk-based development to branch models, they usually recoil in disgust. “You do what?” they exclaim, acting as if I just popped them square in the jaw. “Everyone pushes changes to master every day? Are you crazy?” they say.

No, I’m definitely not. The proof is in the pudding. When you have a solid testing and deployment pipeline that catches defects well, having every developer commit to the same branch every single day makes a lot of sense. Don’t believe me? Google does it with thousands of engineers.

Many believe that these more radical approaches don’t work in a regulated environment or in scaled environments, like finance. But the evidence is abundantly clear. Applications that are built with agility in mind (meaning it’s easy and fast to make changes) are less risky than their infrequently delivered counterparts.

Yes, it might feel safer to have security checkpoints and to have someone rifle through 100,000 lines of code written over six months. But security checkpoints are little more than theater. They make us feel safe without really making things that much safer. What does reduce security risk is automating your testing process, making small changes, putting them in production frequently, and applying liberal monitoring and observability.

DevOps works in every environment.

Myth 4: DevOps Replaces Ops

Implementing DevOps doesn’t mean you need to go fire your system admins and operations staff. In fact, on the contrary, you need their knowledge. Knowing absolutely everything about development and operations is almost impossible. So, you’ll need people who have different specialties and interests.

Rather than fire our operations teams, we need to make sure their goals are aligned with the development teams’ goals. Everyone simultaneously should be driving toward faster delivery of high-quality software. A good waiter has tasted the food on the menu, but all waiters don’t need to be chefs.

DevOps doesn’t mean removing Ops.

Wrapping Things Up

So, there you have it. The top four myths about DevOps—busted. Hopefully, this clears things up a little and you now know what DevOps is and isn’t. It’s principally a set of beliefs and practices first, with tooling, roles, and teams being secondary.

Every company can and should incorporate ideas of DevOps into their business. It will lead to happier engineers and happier customers.

This post was written by Lou Bichard. Lou is a JavaScript full stack engineer with a passion for culture, approach, and delivery. He believes the best products emerge from high performing teams and practices. Lou is a fan and advocate of old-school lean and systems thinking, XP, continuous delivery, and DevOps.

The Cat and the Map

Why Map Your IT Environments?

“Would you tell me, please, which way I ought to go from here?”“That depends a good deal on where you want to get to,” said the Cat.“I don’t much care where” said Alice.“Then it doesn’t matter which way you go,” said the Cat.“so long as I get somewhere,” Alice added as an explanation.“Oh, you’re sure to do that,” said the Cat, “if you only walk long enough.”  — Lewis Carroll, Alice’s Adventures in Wonderland

Preamble

Running a high-functioning IT team or tech company requires you to be clear in your mind where you want to take your team. If you’re not clear about that, then just like Alice in the quote above, it doesn’t matter which way you go—or, in the context of the increasingly complex tech ecosystem, it doesn’t matter which methodology or tools you adopt. Then you end up implementing this technology or that methodology halfheartedly, which leads to you switching to new technology and methodology, and the cycle repeats. This leads to a form of techno-methodology whiplash for your team. Is that what you want for your team? I hope not.

Know Your Destination, Know Your Landscape

What the Cheshire Cat didn’t point out is that for most of us dealing with complex situations, knowing the destination isn’t enough. We need to know the landscape to plot our way to success. In this article, I will cover the top four reasons why you need to properly map your IT and test environment to bring your team to perform at a high-functioning level.

One View to See It All

When you map your IT and test environment, you essentially establish the landscape of the situation. A good map lets you bring together various priorities and interests of your team and organization in a single view. The benefits of doing so can’t be underestimated. Miller’s law states that the average human mind can hold only about seven things at any one time. Without a map to oversee the entire landscape, how could you possibly navigate your team around risks of deployment, development, and the day-to-day running of the IT and test environments?

In addition, you can build a map that contains multiple levels. Imagine that at the organization overview you map out the various key structures, such as business, ops, IT environment, and test environment. Then you can drill in further by adding in the substructures, such as system instances, applications, data, and infrastructure. All these structures and substructures will interact among themselves, which is why you need to add in the relationships among these structures, the projects, and the teams in your organization.

Now imagine you have this map right now. Wouldn’t that make it a lot easier to think about your decisions and weigh your options? You can almost literally trace how a possible solution would impact which system and which team—so before you even encounter objections, you can anticipate them. That’s the power of a single view of your landscape captured in a map.

Spotting Existing Gaps and New Opportunities

When you have a map, the map almost immediately shows you some low-hanging fruits to pick. Existing gaps and opportunities to improve your existing operations show themselves easily. These low-hanging fruits can give you some quick wins for you and your organization.

Some typical quick wins would be:

  1. Identify waste and save costs. For example, you may identify system instances being maintained but not used.
  2. Identify underutilized resources and consolidate them. This happens quite frequently as well. For example, you have a bunch of system instances that constantly have low utilization. You can decide to consolidate them to bring about a better return on your expenditure on these resources.
  3. Identify undersized systems or applications and reallocate buffer resources. Once you reduce waste and free up resources targeting underutilized resources, you can deploy some of these freed-up resources at the undersized systems. Typically, people would complain that these undersized systems were constantly stretched and not enough resources could be spared due to budget. In other words, you can help reallocate your resources better simply by having this map.
  4. Identify the high-growth areas and enable them to grow faster. With a map, you can view how certain systems or applications are growing quickly because they are driven by fast-growing demand. When you can link these high-growth areas with how they help with organization, you will be able to convince management how adding more budget makes business sense. Or you can redeploy resources from other structures facing slowing growth. In either case, a map bolsters the strength of your decision.

Streamline and Simplify Processes

Everyone has a story about dealing with silly, ridiculous bureaucratic processes. However, as a civilization, progress means more processes are needed for things to run smoothly. Running your IT and test environment successfully means having good processes to ensure things run smoothly. Think Value Stream Mapping.The key is to know when these processes become less effective or even outright unnecessary. Then you need to retire or remodel your processes. The key, therefore, is to discover these increasingly ineffective processes and nip them in the bud.

So, study the stats from your troubleshooting and logs and add those to your map. Talk to your various teams from business and customer support. Add anecdotes in as well. In a single view, you would be able to allow both data and personal stories to drive your decision on how to simplify running your IT and test environments. Streamlining and pruning away processes that used to be (but are no longer) necessary would release more resources back to your budget. This kick-starts a virtuous cycle as freed-up resources can then be redeployed for growing opportunities.

Better Impact Analysis and Scenario Planning

Once you take advantage of the single view to quickly exploit new opportunities, uncover waste, increase better utilization of resources via reallocation, and streamline processes, you have established credibility about mapping. Imagine earning all that success without even using the methodological or technological fad of the day.

Now it’s time for the exciting stuff—planning the future. Once again, the mapping will help greatly. You can plan several scenarios and strategies in a playbook and then check them against the map. The check would involve some kind of impact analysis. The scenario planning exercise is widely used by some of the top-performing organizations in the world. Having a map of your IT and test environment improves the effectiveness and efficiency of the exercise. No more guessing about potential impact of brainstormed strategies for future scenarios; you can immediately check and verify obvious drawbacks and benefits. Scenario planning is better because impact analysis becomes better with a map of your environments.

Conclusion

In Enterprise IT intelligence, “environment mapping” represents a highly beneficial and foundational exercise all IT teams and tech companies should perform at least once every quarter or so. It provides high visibility to the many interrelated structures and their relations in your organization. It is not easy to discern these structures and their relations without the map. The increase in visibility delivers great benefits. Agility, smooth delivery, greater collaboration, and good operational and business decision-making all flow from the greater visibility of the landscape surrounding your team and organization. Buy-in becomes simpler when everybody can be on the same page—and when everybody is looking at the same map as well.

The importance of mapping your environments is key to your organization’s success. Bear in mind that maps are imperfect, but they are still very useful. Mapping helps you and your team become better at your jobs simply because you did the exercise of mapping. The exercise surfaces the differences in the thinking between the members in your team. Therefore, don’t wait until you come up with the perfect map. Your team automatically becomes better with more practice mapping. Your team and your organization will thank you for that when they start to see the uptick in results.

Author: TJ Simmons

This post was written by TJ Simmons. Kim Sia writes under the nom de plume T.J. Simmons. He started his own developer firm five years ago, building solutions for professionals in telecoms and the finance industry who were overwhelmed by too many Excel spreadsheets. He’s now proficient with the automation of document generation and data extraction from varied sources.