Introduction to Automation Metrics

Another topic that’s come up a lot over the past year or more has been around metrics. I’m sure by now you’ve heard of the Four Keys developed by Google. Now, these metrics get a lot of coverage in the spotlight, but should everyone use them? Not everyone is at Google’s level. More commonly it sparks a conversation around:

Now, there’s already lots of coverage around this topic so I’m not going to reinvent the wheel. Rather the points I’ll suggest for those starting an automation/devops journey are:

Your metric(s) should be representative of tracking a value item or showcasing an outcome that’s meaningful to you, your team, and your organization. Common metric categories are often related to:

All of these are off the top of my head without searching while I write this. I’m sure in a room full of brainstorming people there would be a number much greater. The important piece is to pick a few that in aggregate will help tell the story that you want to tell. Also important is not using metrics as a crutch as the only way value is being understood and measured. These metrics are complimentary to your story, not the whole story. And the story is likely to evolve, and mature, and change. This is a sign of healthy, iterative improvement.

With respect to Ansible, I wanted to show an example of gathering a first metric, absent of a formal observability tool. Where the best place is to get your metrics from is another conversation altogether, but suffice to say there’s likely going to be options and overlap and that’s okay as well.

I’ll start with the first one mentioned, the percentage of devices under management. As one begins an automation practice, there’s often an initial onboarding phase where we’re connecting to all devices in our infrastructure, sorting through access patterns, networking, authentication, etc. which can take some time to sort though. Especially if other groups have responsibility over other devices and may need convincing to allow them to be onboarded.

In this example, I show a basic method of performing a connectivity and authentication test against three kinds of devices:

  1. Linux Servers
  2. Secure Servers
  3. Windows Servers

The reason for separating these out this way is you may have different authentication methods to them, or different infrastructure environments that they live in. You also may have thousands of them and want to structure or throttle the connection and reporting on them in different ways. Here’s what I did:

Each device type got it’s own playbook and jinja template, and I write the file to an NFS server, here’s how the resulting report looks:

Hosts Under Management Metric(s):

Reachable Linux Hosts: 7 / 9  (77.78%)

Reachable Secure Hosts: 2 / 2  (100.0%)

Reachable Windows Hosts: 2 / 2  (100.0%)

Now this is pretty basic, but again it’s an example of how to quickly populate a kind of metric that can help you reinforce the value of an automation practice. This result could easily be expanded to publishing the results into a webservice, time series database, or a README file in a git repo potentially accompanied by an auto-updating badge.

Here’s my github repo with all the code examples:

https://github.com/aludwar/ansible/tree/master/metrics

Happy automating!