Risk, Measured: 7 Characteristics of Good Metrics

Media Thumbnail
  • 0.5
  • 1
  • 1.25
  • 1.5
  • 1.75
  • 2
This is a podcast episode titled, Risk, Measured: 7 Characteristics of Good Metrics. The summary for this episode is: <p>Continuing our miniseries into Risk, Measured: we go back to statistics class and discuss some of the characteristics of good metrics to help people understand what you should be looking for when you want to meaningfully quantify cybersecurity phenomena, program performance, or anything really.</p>
Lies, Damn Lies, and Statistics
01:23 MIN
Understanding Good Data Has Never Been So Important
02:05 MIN
Meandering Discussion About Good and Bad Metrics
07:49 MIN
Statistics Is Philosophy If You're Doing It Right
00:12 MIN
Characteristic #1: Bounded
05:32 MIN
Characteristc #2: Scales Metrically
05:49 MIN
Characteristic #3: Objective
04:25 MIN
Characteristics #4 & 5: Valid & Reliable
05:58 MIN
Characteristic #6: Context Specific (resistant to gamification)
06:21 MIN
Characteristic #7: Computed Automatically
02:38 MIN
Closing Thoughts & Visit Kennasecurity.com/blog To Get (ISC)2 CPE Credit
01:02 MIN

Dan Mellinger: Today on Security Science, the characteristics of good metrics. Thank you for joining us. I'm Dan Mellinger, and today we continue our mini- series of risk measured, where we break down the concepts of quantifying cybersecurity. In this podcast, we cover a lot of pretty complex, or at least very unintuitive concepts. When it comes to measuring and tracking cybersecurity, and risk in particular, we've seen a ton of cybersecurity research reports lately, and it can sometimes be difficult to understand, let alone employ metrics that are good and know which could be inaccurate or even sometimes misleading. I love this quote popularized by Mark Twain. There are three kinds of lies, lies, damn lies, and statistics. Today we want to go back to statistics class a bit and discuss some of the characteristics of good metrics to help people understand what you should be looking for when you want to meaningfully quantify cybersecurity phenomenon, program performance, or really anything. Myself, I'm far from qualified to teach anyone about anything related to math, so my guest today is our head of measurement magic, our Chief Data Scientist, Michael Roytman. How's it going, Michael?

Michael Roytman: Pretty good. Pretty good. This one is really exciting. This was work I did maybe almost 10 years ago now, and it's useful for anything, like metrics for measuring your apple health goodness. How good are you at meeting your health fitness goals? You also need those metrics to be statistically valid and useful. People rarely think about this. This all came about, I was looking at security metrics that existed before we started Kenna, and as we're kicking off our, like, we should do reporting in the platform, how should we build that out? All these metrics kept popping up, like meantime to remediate, days to patch. I was looking at those being like, where did these come from? I started to realize that nobody had thought about whether they're good or not. They were just what was available at the time.

Dan Mellinger: Or what you think you want to know about something. Right?

Michael Roytman: Yeah. Sometimes that leads you down pretty bad rabbit holes, as I think we'll get to today.

Dan Mellinger: I think America discovered that these last four years, honestly, which I mean, it's interesting. At first, when we were talking, or thinking about doing this concept for a podcast, right? I was talking to you about this earlier. I was like, okay, we're going to help people understand some of these characteristics, what it means, right? I thought it was going to be kind of boring, but necessary primer on this underlying foundational thinking you should have when you're looking at numbers.

Michael Roytman: You thought it was going to be a statistics class.

Dan Mellinger: I hate math, right? I'm really bad at math. I went into comms for a reason. But the more I've been laying out the show notes and doing some of my research on this, I'm like this is probably going to be a good episode, honestly, and I think it's kind of more necessary now than it has been. We have more data points now. Companies are using data to market like crazy. Politicians are using" data points" to sell policy and create emotion and do all this kind of stuff. To go back to Mark Twain's quote, which, by the way, I know he didn't originate it, and I tried to look up. No one knows for sure who originally actually wrote that.

Michael Roytman: Mark Twain's like the guy you attributed a quote to when you don't know who said it, and it's of like a little witty, a little funny and has some meaning.

Dan Mellinger: Exactly. He was really good at finding other people's quotes apparently as well. Anyway, I did want to call that out. I know that it is not necessarily his quote per se, but I'm actually pretty excited about this. I think it's a really good time, just in history for people to understand, what makes good data, how do you test for that, and how do you look and find out something may not be so good? Just because it's a stat, doesn't mean it's right. There's a lot of ways to interpret this stuff. Anyway, Michael, before we begin, I know I kind of stole your thunder with some of the reasons, why this is important, but I wanted to get your feedback as a data scientist who looks at the numbers day in, day out. Why is it just really important to understand some of these underlying characteristics for good metrics?

Michael Roytman: Absolutely. So, why are we measuring stuff at all, to begin with? Let's ask that question, and what do we want to measure as a security industry? I would say the answers are actually very simple. They just aren't very easy to implement. So, we want to see if we're more or less secure. I think naively, before we used to think, are we secure or not? Now everybody knows that's not a reasonable thing to measure, which is your first clue as to bad metrics. What we want to measure is our progress around that continuum of good, bad security in a way that makes sense, is quantifiable, and qualitative enough for us to make financial decisions on it. Because now we have security budgets, we have CISOs and we have CIOs, and we have businesses that we ultimately support. In a naive world, if you could have anything you wanted to, you had genie, the first thing I would wish for has given me a metric that measures how secure I am, and then give me another one that tells me how I'm trending. From a very basic perspective, that's your position and your velocity, right?

Dan Mellinger: Yep.

Michael Roytman: Your basic physics metrics. Position and velocity are awesome metrics because they very accurately described something that matters. But position by itself might not be that useful of a metric because you don't know what you're relative to, and that's where the whole, how secure are you concept starts to enter. We'll walk through all of these. We have seven of these things that a good metrics should include. Very intuitively, we know that some metrics are bad. Very intuitively, we know that some metrics just don't apply. Like, how hydrated are you? Is probably a pretty good metric, but how do you measure that? Percent hydration? No. You just say I had some water today or I had enough water. It's a yes, no. There's no like number of hydration or probability of being hydrated at any given moment. Those things don't make sense. Even when you look at your Apple watch. I see, what do you have there? Fitbit and Android thing?

Dan Mellinger: Oh no, this is... Oh, what's it called? Garmin.

Michael Roytman: Perfect. So, it's measuring your steps per day, your miles per day, right?

Dan Mellinger: Yep. crosstalk good stuff.

Michael Roytman: When you look at that Apple watch, like those three rings they have come up with, they really tell you just to close the rings. Like, did you walk enough today or not? When I think about something like, I don't know, how many vulnerabilities have exploits on them across my whole environment, that metric might not be very useful, because if I tell you it's 20,000 or 40, 000, you don't know the difference, or how much more or less security, or you probably know that one is worse than the other, and that's about it.

Dan Mellinger: Or how they're distributed. Right?

Michael Roytman: Right. But if I do it and just say, what percentage of your... Or here's a better one. I know that none of the vulnerabilities on this application have remote code exploitable metal splints. That's a yes, no, that the presence of it tells you something very concrete about that environment. The absence of it tells you something. Anything in between, I don't know if it's more or less meaningful. I don't know if having one is 10 times better than having 10, right?

Dan Mellinger: Yeah.

Michael Roytman: So, maybe the metric of the number isn't that useful to us.

Dan Mellinger: That's interesting. We just recorded an episode with Ed, where he's talking about how context around these scores, these measurements really is what matters in a lot of cases. Sometimes the number is literally just a number. It doesn't mean anything without the proper context around it.

Michael Roytman: Yep. I mean, if you told me something about like electrolytes I got from Gatorade, because I drank that many Gatorades, and you ask me how hydrated I am. I would be like, well, let me go Wikipedia what electrolytes means. I don't know.

Dan Mellinger: Yeah. Well, I mean, and again, talking about context, if you had a very salty breakfast, maybe you've started with more electrolytes than you needed in your system to begin with. Or if we're talking about hydration. Do I need to drink this next cup of water if I just drank three gallons over the course of the last four hours? Probably not. That context matters.

Michael Roytman: That is actually awesome. I was thinking about why we call them metrics and not like, I don't know... There's so many different words for this, business intelligence, analytics, but metrics are...

Dan Mellinger: KPIs.

Michael Roytman: Yeah, they're calculated things, but what are they actually? They're not statistics. They are a combination of information that we want to monitor over time. You just mentioned this in two different ways. One is, what happened this morning? The other is, what has been happening throughout the day? That's context that's relevant to the metric. A metric by itself might not be very useful, but a metric over time might be incredibly useful. A GPS signal with a timestamp is data. It's a metric. Your GPS signal over the past three years tells me everything I need to know about you.

Dan Mellinger: Interesting.

Michael Roytman: What you measure over time also matters a lot. I remember, I think yesterday, wow, it feels like it was weeks ago, I was looking at a ServiceNow vulnerability response module with a bunch of people at Kenna, and we were looking at some of the metrics that come out of the box there, and one was number of vulnerabilities per week. What does that mean to you?

Dan Mellinger: Discovered like inherent, remediated.

Michael Roytman: Instantly, you're like, give me more context, but let's say it's discovered.

Dan Mellinger: Okay.

Michael Roytman: And it's just a graph of like a thousand here, 10, 000 here, a thousand here, 20,000 here.

Dan Mellinger: Are they scanning one week or not in another? What are the sources? What are people doing? Did they just onboard a whole bunch of iPads or something?

Michael Roytman: I have a lot of questions. There's like 14 different ways you could affect that metric, which is why it's a bad measurement of the thing we're trying to measure, which is allegedly security.

Dan Mellinger: Unbounded.

Michael Roytman: Not bounded makes it... Oh, let's talk about the first one. But before we do that, number of vulnerabilities per week is a terrible metric because time is not a very useful way to measure the number of vulnerabilities. It is a description of a state, and you're telling me, I'm going to pick one description of the state of a really complex system, an enterprise, and measure it over time. You probably actually want to measure like 14 different things over time to get a good picture of the system. Time is a very useful metric for measuring things like probability or risk because inherently those things are time- based. The probability of something grows over the exposure period, it has time baked into the very definition of the thing, but open vulnerabilities or closed vulnerabilities have nothing to do with time. It's like that thing is blue. What you're really looking at is a graph that says it was blue, it was green, it was blue, it was green. That's not information that you can make decisions off of.

Dan Mellinger: Yeah. It's functionally irrelevant to the end goal, right?

Michael Roytman: Yup. So, we meandered a lot, but before we kick into the characteristics of crosstalk.

Dan Mellinger: I love how we got into like a philosophical discussion on why it's important to understand good metrics.

Michael Roytman: Statistics is philosophy if you do it right. Or at the very least, your philosophy influences which statistics you look at.

Dan Mellinger: I'm stealing that as well. Statistics is philosophy if you're doing it right.

Michael Roytman: Well, if you're thinking about it deep enough, you get back to first principles eventually. We want to measure security. We want to measure how secure are we, we want to measure the state of a system, and we want to measure the derivative, or how much that system is changing, our position and our velocity. We don't know what metrics are good at describing that, but so far, what we do know is that time is useful when it comes to some metrics. For example, your meantime to remediate. If that's changing over time, that's telling you something about how quickly your teams are responding to the vulnerabilities. Of course, there's other confounding things there, like what types of vulnerabilities you're responding to. But time is not a useful metric for other things. Time is not a particularly useful metric for the color of car that I owned, because I owned a black one a couple of years ago, and now I own a silver one, and color of my car over time tells you absolutely nothing. It probably just confuse us. Similarly, the number of systems might change, the organization might change, the vulnerability is released might change.

Dan Mellinger: Time, since you remediated a vulnerability, doesn't matter.

Michael Roytman: I think so. I think we actually overvalue time- based metrics and security in general because a lot of them don't apply. They do apply to probability and risk. This is part of the risk measured series after all. Let's talk about some good and bad metrics.

Dan Mellinger: Cool. All right. Number one, we've got seven characteristics that you've identified. The first one is a bounded metric. So, bounded means it has limits. Let's break that down, Michael.

Michael Roytman: Some things naturally have limits. The percentage of oil in my engine has a natural limit. It goes from the right amount, 100% capacity to 0%. As that changes over time, that tells you something about the state of the system. Some things naturally don't have limits like the number of tickets sold to a concert might not have a natural limit. It could be the whole population, but some things grow unbounded over time. Sometimes you want to impose a bound on it. It's true, in every discipline, I didn't come up with these. I found these when reading about what makes metrics good across disciplines. The classical security example is days to patch. Your days to patch can be infinity. It can grow over time. If you've never patched a vulnerability, it's going to keep growing. You might want to make that a lower thing because you think it's indicative of how secure you are, but it's an unbounded metric. If you lowered it from, my average day to patch was 100 to 99, did you fix one vulnerability really quickly and forget the rest? Did you actually make anything more secure, or did you happen to do one anomalous event? Are you seeing the relationship to the flaw of average is crosstalk.

Dan Mellinger: Yep.

Michael Roytman: If your distribution was completely uniform and you had one thing you were monitoring, I mean maybe time to patch the one car that I drive, that my time between oil changes is a good metric and it might be bounded or unbounded, but when you're talking about a statistical population, an unbounded metric is very susceptible to flaws in the distribution. So, it's not a particularly good metric.

Dan Mellinger: Especially when it comes to making decisions based off of that. Right?

Michael Roytman: Yeah, exactly. What's the classic example. If you've got, I don't know, three people doing remediation, your time to patch is 90 days. If you hired three more people with the same amount of systems, would it now be 50, or would it stay a hundred?

Dan Mellinger: There's a lot of mitigating factors that would actually impact that. Yeah.

Michael Roytman: Well, there's a lot that's obscured in the metric, and so the metric by itself is not particularly useful to our decision- making. Now, if you knew that it took some amount of money to reduce the time to patch your vulnerability by some percentage, an extra day costs us this much, or is this much effort, that's very different. Just measure the effort it takes to patch a vulnerability instead of the time, because that time could be... You could be in PTO, who knows.

Dan Mellinger: Interesting. Bounded metrics inherently kind of exist on a scale, so it allows you to find, I think you... I'm reading some of your background work on this. Define the state of no people hired or many people hired, or the state of patching well, or patching poorly. When they're bounded, now you set the limits one way or the other, and now you have that gradient, I guess, in the middle of, are we more or less secure.

Michael Roytman: It assumes that you've looked at the statistical population of the thing all of the times to remediate vulnerabilities. Maybe you read a Cyentia prioritization to prediction report. And you've assigned meaning to certain values of the metric, and you have essentially said, if we're at 90 days or faster for every vulnerability, that's a state we want to be at, how close are we to getting to that state? Is a valuable metric, but that's bounded by that decision you already made by studying the population and assigning some value to the particular thing. But if it can grow infinitely, you don't know how much progress you're making and you don't know whether that progress is actually linear or not. A really good example of this ism back in the early 2000s, I was looking at scanner software, and I was looking at some score that... It has changed 50 times since then, in one of the scanning vendors. It was something like, an asset might have a hundred vulnerabilities, might have 200 vulnerabilities on it. They were assigning a scored age vulnerability, and they were adding them up for every asset. I was looking at it. They don't do that anymore, so this is a relic of the security industry, but I think it was something like some assets would be scored 1. 1 million points of stuff, and others would be scored 90, and others would be scored a thousand. It all depends on how many vulnerabilities were on it and how risky those vulnerabilities were. This is a classical mistake of they were summing up probability or risk that really is multiplicative or has some way of interacting with each other. Maybe it's just the riskiest one of the matters. Maybe it's some permutation of it, but it created an unbounded metric. If you looked at a spreadsheet and it had 10 assets in it, and the scores range from nine to 90, to 1. 1 million, to 400,000, you got to make some decisions about them.

Dan Mellinger: Go for the one with the most zeros behind it.

Michael Roytman: I mean, I guess so, you go for the biggest number. That's what people naturally do, but that does not mean that, that's a good decision. The second you start to bound that metric, you have a threshold for decision- making. The bounds itself has that threshold.

Dan Mellinger: And that leads to the next characteristic number two, which is it needs to scale metrically. The difference between two values of any metric should offer some kind of information, right? You'll do a better job of explaining this, but there should be some kind of relationship that is scalable and measurable between two different numbers within this bounded scale.

Michael Roytman: Jay Jacobs actually taught me this and drilled it into me years ago. A calibrated probability is a probability where if you say something has a 50% chance of happening, it happens five out of 10 times. If you say something has a 10% chance of happening, it doesn't never happen. That would be a bad probability, happens one out of 10 times. By calibrating probability, is we're able to articulate the difference between something having a 50% chance happening and 100% chance happening, and how much worse is it, or how much more likely is it to happen? Similarly, there's a concept which the name of which I'm now forgetting, but it's essentially for communicating the difference between any two values of a metric. If you can't tell me how much worse is a seven than an eight on some score you came up with, then that's not a very useful score to me because it's not a good guide for decision making.

Dan Mellinger: There should be a relatively obvious relationship between what a five is and what an eight is on whatever scale you have.

Michael Roytman: It doesn't have to be linear, right? It could be logarithmic. It could describe any of a number of actual natural that happen. But if I have one cat or I have three cats, it's not three times worse. It becomes much, much crazier when I have three cats.

Dan Mellinger: That's an exponentially escalating problem.

Michael Roytman: Right. If I want to measure a chaos created in my home by cats, the number of cats is probably a bad metric.

Dan Mellinger: That is interesting. Can you assign them chaos points on a scale of one to a hundred?

Michael Roytman: Cat chaos is a good metric. It might also be bounded if you construct it right.

Dan Mellinger: Yeah. That's interesting. Litter tracked all over the place, but basically you're just saying there should be a meaningful and understandable relationship between the relative numbers in the scale. Like, I know what one means versus a 10. On a scale of one to 10, when you're asking people, rate your pain from one to five, one being the least, five being the worst. Now I understand there's a semantic differential between those and I logically can understand what you're trying to ask me to describe.

Michael Roytman: Or the Richter scale for earthquakes is a really good one. A seven is 10 times worse than a six.

Dan Mellinger: Yes.

Michael Roytman: A six is 10 times worse than a five, and people can intuitively understand that it's an order of magnitude shift that you've went through for an earthquake.

Dan Mellinger: I remember when I learned that as a kid, I was like, whoa, mind blown.

Michael Roytman: That's a California thing. At the time I learned that as a kid, I was in an IHop, and it was like a 2. 9 in Illinois, and people were like, things can shake on their own. Amazing.

Dan Mellinger: What is it? It's not the Palace of Fine Arts. Anyway, they have a earthquake simulator down here in Golden Gate Park. You can go in and it'll walk you through each of the big ones and what it would feel like and for how long.

Michael Roytman: I thought the whole state was an earthquake simulator. I can't believe they actually to built one.

Dan Mellinger: Sometimes. I have some pretty good stories on that. Anyway, anything else on the scales metrically piece or?

Michael Roytman: Well, it flows very naturally with the bounds. That's the other thing I want to say. Once you bound a metric, you've kind of constrained your decision- making, you've constrained the information that you're looking at to be useful to you. Now, you need to figure out what happens on the in- betweens. The coolest thing about this is that if your metric is yes or no, does this vulnerability have a Metasploit? What percentage of vulnerabilities have Metasploit? It's a binary variable. That scales very easily. The one is bad, the zero's good. You have a very easy understanding of the difference between the two possible outcomes. Now, if you have a thousand different outcomes, you also need to understand the difference between the thousand possible outcomes, and that takes some work and some crafting. Percentiles are very good at communicating that difference. You're in the 10th percentile, you're in the 20th percentile, think about how they grade people on a curve. They don't say your score was bad. They say you were in this percentile, or your score... You might've gotten the 75 on a test, but that's an A, because it's in the top 10 percentile or something. It's useful because the difference between them is much better communicated than the actual number of questions you answered correctly.

Dan Mellinger: Yeah, because that would require additional context, right?

Michael Roytman: Mm- hmm(affirmative). It could be that like three of the questions were unanswerable. So what that you got six of them right instead of 10 of them right?

Dan Mellinger: Yeah, or if it was only three questions, period.

Michael Roytman: Great analog to security. If you're patching 20% of your vulnerabilities, you're patching 20 vulnerabilities per asset. But the other 170 vulnerabilities on that asset don't have patches because of end of life software. You might still be doing really great in terms of your remediation because you've remediated everything that you possibly could.

Dan Mellinger: But you might want to cut off the internet connection to that device anyway.

Michael Roytman: Right. So, it might be a huge risk, but if you're measuring performance of the team that's responsible for managing that, just because it's risky, doesn't mean they're doing bad.

Dan Mellinger: No. Yeah. There may not be a solution to that other piece. Interesting.

Michael Roytman: The metrics game is long and complicated, but it's all interconnected.

Dan Mellinger: Well, and going to the next one is they should be objective, and this is pretty intuitive, but ultimately, being objective just means that they should be easily to test it, easily reproduced by, not you guys ideally, if you're going from scientific method. I like this, they're resistant to changes in input data.

Michael Roytman: This is nice because some things matter a lot on how you view them. There's a technical mathematical definition of objectivity that our folks can Wikipedia, and it has like some functional definition of how functions interact with each other, input, output, operators. But if you really think about it, what variable you use to define the metric can vary based on the source of the data maybe. Something like, what's a good subjective one? One of the CVSS sub metrics could be very subjective like, what's the difficulty of exploiting this vulnerability? If you're measuring somebody is remediation performance and your only input metric is exploitation complexity, or even whatever, access complexity for a vulnerability, and you piped in NVD data about it. And the national vulnerability database tells you this one's easy to exploit, but then you next month bought a subscription to IBM X- Force, and you have better CVSS judgements, more specific ones. All of a sudden, the underlying metric might shift because your subjective dataset shifted. Time is really good because it's super objective. The number of days this vulnerability has been open is not a very objective metric, unless you clearly define you mean by open. Is it when your scanner found it? Is it when it existed on the system? Is it when you first cut a ticket for it? There's a fixed objective point which you can use to define a metric, and it's important to make sure that it's not really gainable by the data.

Dan Mellinger: Which is another thing we'll get to a little bit later, but yeah.

Michael Roytman: Yeah. People can gain metrics, but I think a lot of folks, unless they're deep in it failed to recognize that your data can gain your metrics too.

Dan Mellinger: Yeah, that's interesting because I think what... We were talking to the register and he actually wrote a story from... I guess I'll try to look it up for the show notes as well, but there's kind of a data scientist doing an analysis on objectivity of CVSS based score components. They were showing how it really came down... A lot of these things, where they may appear objective, they're all being interpreted by human analysts and so you're seeing things that have these... When you test and have multiple analysts look at the same vulnerability and try to assign the same CVSS score, massive range. We already know that CVSS is primarily, especially V3, I think is the worst one offender of this, from a distribution standpoint, but it's going to be between five and 10, like 90% are between scores of five and 10, and the sub- components of that will vary wildly based off of the person who did the analysis.

Michael Roytman: Yeah. A much better example, I think, is something like, how important is this machine to your business? 10 people might look at it and give you 10 different answers. Douglas Hubbard, Doug Hubbard in his book, The Failure of Risk Management has a lot of good strategies for calibrating expert opinions and finding ways to make a dataset that you may have gotten from a bunch of business owners more objective. There's a lot about, if people are doing it right after lunch, they're going to get assign different scores than if they're hungry. So, a lot of metrics are susceptible to this. A lot of failures occur because the input metric is subjective, but we don't realize that. This was my first big revelation when I was looking at security data. It was me looking at CVSS sub metrics being like, this is the 10 commandments handed down to us at the mountain. But when you literally look at what's happening on the mountain, those people are making judgements.

Dan Mellinger: I have the tablets. Yeah, interesting. Oh, I mean, that makes perfect sense. It's again, one of the more intuitive ones, but also not necessarily the most easy to identify and practice, to your point. Then closely related, the fourth key characteristic is valid. The metrics should be valid. In this case, that means that it actually measures what you want it to measure, which is not so cut and dry.

Michael Roytman: I love this one because you read it, and you're like, well, duh, of course it should be valid. I know what validity means, but if you try to define it, there's again, a very technical definition of validity. If you give me two inputs that produce two outputs, and the difference between them is the same, then the next time you give me those two inputs, the output should be the same. That makes a metric valid because it's reliable, or that's a completely different one, but it's... Maybe we should switch the order. Let's package them up together, valid and reliable. Reliable means you put something in, you get the same thing out time and time again, assuming all of the other conditions are equal. That's not true for all metrics because of that subjectivity sometimes, or because the distribution might change a lot, or because the bound keeps going up. You could have metrics that are recursive where a new vulnerability came out, and the score that you get on that asset might be based entirely on everything else that's on it. That's a bad way to measure something like progress towards like the past state of a system based on new information that's coming out about it. Reliability is all about input gives you a predictable output. You know how you got it. So, inaudible networks will sometimes struggle with this, but usually give you the same output for the same input, and that's how you can figure out what contributes to the models. Similarly, validity is, if you've got two inputs, two outputs, the distance between them has to be the same.

Dan Mellinger: Oh, so the interrelation between the metrics, because they're bounded, they're scaled. That scale, the way they're represented when you do the number crunching, the validation on the backend, they should be the same between the two, that the differences that you set out to measure should consistently come out that way, and then reliable is consistency, right?

Michael Roytman: One's talking about your input output function, and the other is talking about the result of the relationship, and if you have a metric that's not scaled, it's likely not going to be all that valid. If you have a metric that's not bounded, it's likely not going to be reliable. These are very technical terms that I don't know the deep, deep technical statistics about. Our listeners probably also don't need to know them, but the Wikipedias are out there. We'll link to them. It's interesting that there's very mathematical technical definitions of things that we use very colloquially, and we'll say things like, oh yeah, that's totally valid. I looked at that data and it makes sense. I rely on this metric. That's not what you mean. You mean something very precise and technical, and we need to make sure that our security data is that too.

Dan Mellinger: Absolutely. Which is really hard to do because one of the biggest challenges as well, not even measurement, but just collecting is how dirty all these data sets can be. You may put in specific queries and you may not get out a consistent or reliable output from that either, and so that-

Michael Roytman: For a number of technical reasons that have nothing to do with your metric or its poor construction, but it's hard to query every GitHub repository in the world for an exploit.

Dan Mellinger: Ah, we've done a podcast on that one too.

Michael Roytman: Good. Jay will tell you about the problems he ran into trying to do that, but you query GitHub for a CVE, and you can't crawl all the code in the world at once with that API request. If your metrics reliability is based on that query rather than the output of what you got, it might look very different depending on what response you get from the API. That's it.

Dan Mellinger: To top it all off, while we're talking about interconnectedness of all of these characteristics, they had to have humans. Go look and make a judgment call. That was not necessarily objective on what was and what's not, because there's a lot of gray area when they were looking at GitHub. Was it a student whose the professor asked them to go create a new vulnerability in something. But it theoretically possible, was it a CVE? Probably not. Jerry is all over this right now, with CVE stuffing. But yeah, so that ties back to the whole object... Now humans have to make a judgment call and now your metric isn't even objective anymore, nor is it reliable, probably not valid.

Michael Roytman: The crazy train leaves the station very quickly because you start to realize it's all about how you define your data and what you count as valid or correct, and where it's coming from, and you have to follow it all the way down to the beginning. Metrics are quantitative ways to describe things that are ultimately subjective. When I came up with this list, I think it was for a conference in Boston source conference in 2014 or 2015. I remember being like every metric must meet every one of these criteria or else it's terrible. Over the past five or six years, I've started to realize that even metrics we create are susceptible to degrees of vulnerability across these metrics. Reliability might trade off with objectivity and you have to find a way to deal with that because you still need a way to make decisions that's as reliable as it can be. You're not going to score a hundred on all of these, because data is messy, because false positives exist, because humans make mistakes, because there are humans that are trying to make us make mistakes with a lot of these metrics.

Dan Mellinger: I love when I'm coming to these like philosophical understandings mid podcast. I think that happened on our last one too, on the power law distributions. It was like, things just randomly clicked. Hopefully, that's happening for the audience out there, but that takes us into probably, I guess number six sums a lot of this up, and it's the metric needs to be context specific. So, resistance to gamification, to humans literally trying to obfuscate or turn... Some, I guess, object within that you're trying to measure being able to manipulate the data. This ties back to almost everything in there, but this is the deepest, most difficult concept for me to understand so I'll pass it back over to you.

Michael Roytman: It's a continuum. Again, foolish young me trying to come up with these thought you have to be context specific or not, but it's really a continuum. Now, what it means is there's... Well, there's really two types of metrics. There's metrics that this, type one metrics, that describe a really controlled environment. A really good type one metric is, what percentage of your workforce will respond to this phishing email? The environment is really controlled. A type two metric might be, how often do you get phished? That's not just about how likely is Dan to click on this phishing email, it's also about what kinds are coming in, whether does the threat environment look like, are people even sending you phishing emails?

Dan Mellinger: Who do they have access to?

Michael Roytman: So, you're measuring two different things. A good metric will include both types of metrics, or a good decision making framework will include both types of metrics. So, ultimately, a type one metric really shouldn't be influencing policy. It's describing a base rate that you can use and a model that you can use to calibrate your type two metric. CVSS is a type one metric and describes vulnerabilities really well. It really tells you nothing about how attackers are interacting with it. The data that would build a temporal model is a type two metric. That's all about go out and measure the world and bring that into my scoring metric.

Dan Mellinger: Got it. So, okay. I think I'm understanding this. Type one, going back to your phishing example, which is... We know this is a terrible, terrible thing to do to your staff as a security training, by the way, so I'll just put that out there as CYA.

Michael Roytman: Do you have something you need to tell us on this podcast?

Dan Mellinger: No, no, no, but I did joke about being a red hat on our blue voices in security podcast by not updating my system on purpose. But now, going back to the phishing example, so if you're conducting a phishing test on your environment, as a security practitioner, you are the person who defined who you're sending this phishing email to, whether it's the entire sample, what countries, what geos, if it's only going via email and there's not the greater context which would be the type two, right? So, type one is super controlled and then you get this base rate. These many people click the link I sent them via their corporate email from an official looking email address, to North American employees who are working from home, so they know that. Then type two is more of the interaction with the surrounding environment. Realistically, if you're getting phish, people might do it through their LinkedIn, or they might look and get more details on specific people who make their LinkedIn public. Now I know this person does this at this company versus a lot of the other people don't. So, there's other factors that could lead to success in phishing other than that's very controlled type one metric, right?

Michael Roytman: Type one metrics are a controlled experiment to try to remove a part of what's happening in the real world so that we can learn about it, and type two metrics make validity really hard because attackers change their behavior all the time. So, if you have a great type one metrics, this is why we need CVSS, this is why we need CVEs, we can figure out if this state's fixed and the real world environment is changing, what's actually changing? And we can construct valid metrics using both of these. Context specific is also really interesting because the best way to detect that something's not context specific is when it's really gainable. If you're measuring somebody's performance on a metric where they can find out a nice, lazy cheat code to not do any work, but have the metric increase or decrease or whatever, that probably means that your metric isn't context specific enough, doesn't include enough of type two, the outside threat environment, or it doesn't have a good enough base rate understanding of type one to actually measure the behavior that you're trying to measure. I will reveal a truth about what we've learned. We spent 10 years getting very good at measuring risk, and we have a lot of customers that have started measuring the performance of their IT ops teams using, how risky is this group of assets that you're responsible for? Well, it turns out that sometimes a risk and a group of assets will go up like crazy towards the end of the year because new vulnerabilities came out, and all of a sudden, the IT ops folks who were being measured on risk look bad, but that's because of risk is not a context specific metric to the performance of that team. They need to measure their actual performance.

Dan Mellinger: They couldn't control that this new zero day came out that was very, very risky and was on their systems. They had no control over that.

Michael Roytman: Exactly. If you had a metric that was very context specific to how likely is Mellinger to get phished, that metric would not tell me much about how likely is Jerry to prevent Dan from getting phished. You can't affect the gamification of your likelihood to get phished. Jerry certainly can't. The context specificity of metrics has some technical definition, but it's more about, is this really measuring what I needed to measure? If it's doing it right, then the people shouldn't get much in the middle of the truth in me.

Dan Mellinger: That makes a lot of sense. Well, and I know we're going to come up on time here pretty soon, so I did want to get into the last one. This is, I mean, it makes sense for anyone who does anything with cybersecurity data, but it needs to be able to be computed automatically, right? Because literally, there's just too much data, too high velocity with too many factors that humans can't go do this, right? You can't balance your checkbook by having an army of accountants doing this stuff.

Michael Roytman: That's my only contribution to this whole discussion. Everything else I read in some paper somewhere or pulled from some slide deck that some professor had for a management class, I added this one for cybersecurity, and I added in 2015, and I even more strongly believe in it today. Metrics in cybersecurity are not very useful if you have to go in and manually change the score in every asset, because most organizations have hundreds of thousands of assets. Report confidence and CVSS two and three temporal is a terrible metric, because that means that somebody has to make a subjective evaluation manually. There's no other way to do it. You can automate whether or not the report exists. That piece of the metric is good. You can automate whether or not the exploit exists, but you cannot automate whose confidence. What are we talking about? Does one person have to score every vulnerability for this to be a valid metric? You certainly can't automate it. I think for security specifically, maybe for most other fields, maybe start here, maybe figure out which metrics you can compute automatically, because if you wait to construct a metric and then try to figure out how to compute it automatically, you might end up in a place where you won't be able to cover what you need to cover with them.

Dan Mellinger: Absolutely. Even then, like in your example, you're going to start to lose objectivity, you start to introduce more error, like just even data entry error, right?

Michael Roytman: Well, that's the other thing. The things that we need to make something automatically computable, objective data, good data pipelines, places to store it, places to process it. If you have those things, you're right. You're much less likely to make mistakes on everything else.

Dan Mellinger: Yeah. Awesome. Well, I know this has been a complex, but I mean really enlightening episode for me, but well, I think we'll probably end up putting something out on the blog, just talking about different characteristics and maybe some examples, but I think we ran through a bunch, but any final thoughts on measurement? What are these characteristics of good metrics before we hop off?

Michael Roytman: These are the seven deadly sins of metrics, and most people listening to this podcast, most people insecurity interact with some kind of dashboard, either created by them or created decades before that they need to look at to make decisions off of. Take 10 minutes to look through these seven deadly sins, look at the dashboard you're looking at and figure out how good is it at meeting all of these criteria. Nothing's going to be perfect, but some things might stick out that were kind of obstructed by the language of validity and reliability that we don't use, right?

Dan Mellinger: Awesome, great advice, and I'd also want to give someone else advice. If you have ISC squared certification and need some CPE credits, you can get them from listening to this podcast. So, go to kennasecurity. com/ blog. There'll be our podcast episode as one of the blogs and you can go fill it in with your email and ISC squared code, and you will get credits for listening to us ramble about metrics. Michael, thank you very much, and we look forward to the next one. Take it easy.

Michael Roytman: Thank you. Thank you.


Continuing our miniseries into Risk, Measured: we go back to statistics class and discuss some of the characteristics of good metrics to help people understand what you should be looking for when you want to meaningfully quantify cybersecurity phenomena, program performance, or anything really.