Risk, Measured: Power Laws and Security
Risk, Measured: Power Laws and Security
We’re picking back up on our “Risk, Measured” series where we dive into specific concepts used to measure risk w/in the context of cybersecurity. We discuss the application of power law distributions in cybersecurity.
Dan Mellinger: Today on security science, power laws and cybersecurity. Hello, and thanks for joining us. I'm Dan Mellinger and we're picking back up on our risk measured series, where we dive deep in a specific concepts used to measure risk within the context of cybersecurity. Today, we're discussing the application of power law distributions to cybersecurity. With me as our power law of applied mathematics, chief data scientist at Kenna Security, Michael Roytman. What's up Michael?
Michael Roytman: Hey Dan, this is a good one. I'm excited for this one.
Dan Mellinger: Yeah, this should be fun. I don't imagine this being too long, but I did find this very interesting just because it's kind of counterintuitive to the way, at least I think, and I think a lot of humans interact and think about risk in their world. Power law distributions. So what is a power law? If we could start there?
Michael Roytman: So it is a distribution. It's a form that measuring some quantities takes on statistically. Everybody's probably familiar with the bell curve, which is your most normal, basic distribution. That's called the normal distribution. And the key takeaway and the trick here is that some things are normally distributed, like the heights of people. And what that lets you do is that lets you drive to an average. So like the average man is 5'10 or something. And if you meet somebody who's seven feet tall, you're like, Oh, that's pretty extreme. That's a couple standard deviations out. But if you meet somebody that's 10 feet, you're like that's not possible. It's not a human. Like the probability of that happening is extremely low. In a power law distribution or in a whole set of fat- tailed distributions, like loud normal is another example, it's actually just as likely that you meet somebody that's 10 feet tall as you meet somebody that's seven feet tall. So the distribution doesn't converge on that average. And this is where things get a little wild and the internet is generally a place where things get wild, but statistically as well.
Dan Mellinger: That's interesting. So that's exactly why I wanted to tackle this topic because I think like as a human we're really good at categorizing things. We like to put things into buckets and there's a ton of psychology on how our brains automatically organize things into similar nature, all that good stuff we're good at seeing means. We're good at trying to find averages and being like, this is normal. And that's how people think about the world. In bell curves. IQ is a bell curve, where the median, the average, what most people are distributed right in the middle. And then the farther out you get, the less likely. Cybersecurity is not really like that. And power laws, basically, I like to think of it more in the VC realm, like venture capital. There's a lot of research that shows that a small amount of companies ultimately capture the lion's share of money within any given industry. So what let's talk about cloud, AWS 33%. Microsoft is 10%. Something like that, just barely double digits. And then you got a bunch of single digit earners below that and a whole lot of nothing else.
Michael Roytman: Well, this is actually a really good way to think about distributions that comes from the black Swan guy inaudible. So the key to all this, why we care about this is whether or not the outcomes are predictable. So we've got a normal distribution and a very simple outcome, like normal distribution. I want to get a height. That's super predictable. You got a normal distribution and a complex outcome. Like let's imagine that company success was normally distributed. And the outcome that we were looking for was like get listed on the stock exchange, perform well, have four quarters in a row of success, retain talent. That's predictable, but it's tricky because you got to layer in a lot of distributions together. When you've got a power law distribution and a really simple outcome, you can still predict the outcome. But when the outcome is complex and the distribution is fat- tailed, there's very little chance that there's any predictability in the outcomes. And I think a lot of what's happening in venture capital, a lot of what happens in company formation, we think is predictable, but probably isn't. Probably a lot of it has to do with chance or some distributions that we just can't measure. This is why you have 400 different podcasts about success factors that lead to outcomes for startups. And you know, we still don't know what it is that actually causes startups to be successful.
Dan Mellinger: Yeah. That's interesting. Because in terms of VC investment. VCs, they can't reliably pick winners ultimately. And so their whole thing is trying to construct portfolios that can generate returns. Which requires that they understand power laws and of all of the companies, they provide the investment in, their goal is to have the highest probability that one of those is going to really amp up the returns, give them$ 2 billion.
Michael Roytman: Now we're getting actually into the... I see where you're going with this. We're entering the security realm now because they do have a strategy. They're not just shooting in the dark. They've come up with a way to deal with this phenomenon. And it's certainly not on average, we think you're going to perform like this. It's let's spread our bets in a way that covers the distribution.
Dan Mellinger: Yes, exactly right. So of the thousands of vulnerabilities. The vast majority pose no threat, but there's a small number. And of those, the potential downside... With VC, it's a good thing right there. They're trying to predict it and they get a bunch of money. We made$ 2 billion and it covers the spread of investment that we did in all of these other companies that gave us almost nothing or folded or did well, but nothing spectacular. Cyber security, the outcome of some of these breaches could be catastrophic. Or I know you don't like using averages. Wait, what average cost of breach? You know, people put it in 50 million realm, That kind of a thing. So the outcome can be severe, very dramatic comparatively.
Michael Roytman: Well, this is why we're always surprised year after year, when there's a new biggest breach. Like if you think about Home Depot or Target being the biggest breaches, the next year there's a bigger one. The next year there's a bigger one. And we're constantly surprised. And we're surprised because the average that we are looking at does not accurately describe the distribution. So it's just as likely as we're going to get one that's 10 X or two X bigger, and this is the key point. It's what you're measuring that matters a great deal. If you're measuring loss, that's one distribution, the distribution of loss and actually some work by Thomas Malliard out in Geneva. This was a while ago, probably 2013, 2014. He has continued that work since has done a great study of which ones are actually power law distributions, which ones are log normal or just fat- tailed. That's just another form of distribution. That's a little more skewed than normal. Loss, especially individual loss is power law distributed in cybersecurity. And we can't really assign things to it. But when we talk about vulnerability is we're not really measuring loss necessarily. We might be measuring things like how many machines does this new vulnerability affect and Heartbleed affected a ton, but then Meltdown affected even more because it was super cross library or things like NotPetya affecting so many different systems and different organizations affected even more. These things, the impact of a vulnerability, how you define the measurement will determine which distribution you have, because if it's CVSS severity that you're using to describe the vulnerability. Yeah. I mean, there is an average there it's like 7. 7. That's just not a very useful measurement. If instead it's how many machines in the world does this vulnerability affect or put at risk, that's definitely going to be a fat- tailed distribution, but a really hard one to measure.
Dan Mellinger: Interesting. So given that, what are some of the challenges with trying to... I mean, we make predictions that kind of right within the platform. What are some of the with trying to predict and measure risk when they're typically mapping back to this kind of power law distribution, where the ramifications are inherently harder to predict and quite costly, if they should happen.
Michael Roytman: I really like that question because I think I would have answered it differently five years ago then I'm going to answer it today. The fundamental thing with power law distributions, if you think you're in one, there is no average. The mathematical thing is there's no second moment generating function, which means your average isn't going to accurately describe what's actually happening in the distribution. So there are ways we have to deal with that. One way is instead of saying, this is the average vulnerability and here's how many standard deviations that's out. You could say, we think the 40th to the 60th percentile of vulnerabilities looks like this. And as you go out to the right, as you look at the 90th percentile, we don't really know, our confidence decreases as we move away from the distribution, or as we move further into the tail of the distribution. We see that because only a small percentage of vulnerability is pose a risk to most organizations. And we want to make sure that the scoring system, for example, that we build, we actually have a thing that one of our security engineers, Sam inaudible built called Mr. Radar, which monitors how the distribution looks to make sure that the scoring system we're coming up with, which was really just a fake construct we've created to guide people's actions and decisions, that it looks fat- tailed, because if it starts to look normal, that means that maybe one of our data sources is wrong. Maybe our measurement system is wrong. We know that that risk isn't normal. So our distribution that describes it shouldn't be normal. And we should be looking at the percentage. And we don't measure that distribution by the mean or the average. We actually look at the 10th percentile, the 20th percentile to the 80th percentile, the 90th percentile to make sure there are no big shifts in those. That sounds simple, like monitor the distribution and measure it. But it turns out that observability, which was the hot term in Silicon Valley nowadays usually applies to how applications are performing. Few people apply observability to the data itself or to the models or the data quality. I know there's machine learning startups and platforms that are thinking about this. We've been doing this internally for five years. And I think because we're in this hard to predict power law distribution regime, security is especially complicated. We're constantly aware that if we're looking at an average or a normal distribution where might be wrong and we might give bad guidance to our customers.
Dan Mellinger: Interesting. So average is bad in general, I think is a theme I'm picking up here.
Michael Roytman: Yeah. I mean, ultimately it all comes down to how it is that you're measuring your decision, support and your guidance. And if you're looking at a dashboard from a security vendor in the 2000s, it might have a whole bunch of meantime to remediate or averages or CVSS average score. How many points above average is it? Things like that we now know after looking at 20 years of data and not that useful because the distribution is not normal, because our human thinking seems to fail when we think about the scale and interconnectivity of the internet.
Dan Mellinger: That brings me into another thought on this. So humans, I said at the beginning, we naturally like to create means and averages. And this is what normal looks like. And that seems to predominantly drive kind of some of the metrics that are traditionally looked at within cybersecurity. So mean time to remediate. We literally have a whole slew of meantime twos. Which what I'm hearing from you is averages and means are generally not super helpful as it relates to cybersecurity, because a lot of the consequences and a lot of the way these things mapped out are power law, which is anything but average. I'm just curious on your thought on like how efficacious are these kinds of meantime twos and or should people be thinking about some of the metrics a little bit differently as it pertains to security?
Michael Roytman: So this is where my thinking has changed over the past five years or so. I used to think the right answer was spread out your strategy, learn your capacity, and focus it on things that aren't necessarily in the average, because, one breach, one vulnerability could actually be the only event they care about. Now what I'm learning is that most organizations, even when they think about risk and start using a risk based approach to, forget vulnerability management security in general, recognize the limited capacity and try to look at the, I don't want to say they're looking at the fat- tailed. They're looking at the things that are most probable they're looking at the riskiest events, but what's key here is that any one vulnerability that pops up could actually end up being riskier, a 100 times riskier than the last one that you saw, because there's no such stable average. And so how you respond to risk starts to become very important. And this is the work that we did about two years ago with the Cyentia Institute and have since put into the product. When we think about the SLA that you assigned to a risky vulnerability, it is imperative that the riskiest stuff gets time to remediate. That is orders of magnitude faster than whatever your next risk bucket is. So here's an example. If you've got vulnerabilities that have a 90% exploitation or higher, that's what your model is telling you, you remediate those in 30 days, and then from 70% to 90%, you remediate those in 60 days, you've staggered your capacity. You're essentially saying, let's do first. Let's do these second. But that's not precise enough given how risky the next one could be. So I would say, let's say your organization has capacity to remediate one vulnerability a month within three days. That's a huge effort. You're scrambling, costs a lot. Then find the things you think are riskiest and do those in two to three days. Maybe it's three vulnerabilities. Maybe it's one because eliminating it quickly might actually be life or death when the next one gets exploited within six days instead of your average of 14 days. When we built remediation guidance, we looked at the meantime for an attacker to exploit a vulnerability after it comes out. And that's what we based that remediation guidance on. But that's not enough. You also have to stagger that meantime to exploit by the risk level and try to go as fast as possible for the smallest subset, because ultimately just one vulnerability could be the thing that causes a hundred million dollar breach at your business.
Dan Mellinger: Interesting. So it's even less about the numbers game and more about the speed to address potentially highly active, highly exploitable vulnerabilities, something that's hyper risky.
Michael Roytman: Yeah, that's absolutely right. A lot of this theory comes from the counter- terrorism world where they've realized the terrorist attacks are fat- tailed distributed for a long time and that their traditional defensive measures don't necessarily work against that. So the military, since the 60s and 70s, and I mean, some militaries have developed methods for responding really quickly to incidents, because they realized that coverage or attempting to look at every scenario isn't going to work, that's called the red queen hypothesis. So they're essentially saying if we can quickly mitigate a threat, if we can increase our time to response, then those things that end up being in the fat- tailed, those things that ended up being the riskiest or most impactful events won't happen. We'll fix them sooner before they'd make it down that chain. So an example of this and security that I really like is you can remediate vulnerabilities as fast as you want, but you're never going to be as good as a completely cloud- based company where they can spend down and spin up an instance of a machine with a new version instantly. So that kind of Netflix or Etsy or Airbnb have some of this capacity built in where it's all immutable infrastructure, you've got something that's vulnerable to a vulnerability. You need to patch it. You don't patch the box. You just shut it off, turn on a new one instantly and rollover. That speed of response might do more to alleviate the fat- tailed of risk than anything else that you could do about measuring it or describing it. Of course also expensive. Of course not everybody do this, but there are definitely ways that any organization can increase the speed of response. The most simple way is just to say the things we want to be quick about. Let's make that a smaller subset. Let's be very deliberate about what subset of things we respond quickly to.
Dan Mellinger: Interesting and it's like very aggressive for a very few things that are probabilistically highly, highly critical and respond ultra fast. And then everything else you can kind of... That's interesting. So it's like a power law response to power law.
Michael Roytman: Yeah. That's exactly what I was thinking. It's you've learned how the world behaves. The world does not behave with averages and mind. Attackers certainly don't. And so you've got to tailor your response to have a similar distribution.
Dan Mellinger: Interesting. That's super. Yeah. I love when I like literally something solidifies it in my brain mid podcast. I'm just jumping back to, I did want to just, I guess, elucidate a little bit, the small number of vulnerabilities that are ever exploited. When we're talking about this, there's roughly 150K CVEs, just over, I think. Now where we are in 2021, 69% of all those vulnerabilities don't even exist. We don't even see them in scanned environments. So no enterprise even has any of 70% of the vulnerabilities. That means that 31% of that 150K in theory exists within an environment and you need to pay attention to in some way, shape or form.
Michael Roytman: Well, let's, talk about them as like a 69% don't pose any risks to an organization they're just not there. And 31% pose some risk, and now we got to get into how much.
Dan Mellinger: Yep. And so out of that entire gamut of 150K, 4% both exist in business environments and have an exploit developed. So that's like our first benchmark if you've listened to this podcast at all, we typically look at, or any of our Cyentia research prioritization and prediction reports, if an exploit is developed, that's one of our, I guess, low watermarks, I would say in terms of risk, that's where we were like, you should pay attention to this stuff. We would classify that all as high risk because there's a higher than what seven X likelihood of exploitation, if an exploit is developed.
Michael Roytman: And essentially that means that it is possible to easily use that vulnerability to exploit somebody. It doesn't mean it's going to happen still. But the chance of that event occurring has increased significantly.
Dan Mellinger: Exactly. And even of that, subset, I think what we see 1. 8% roughly actually used of all the CVEs ever reported. 2%. So that's a very, very small overall number. And then on top of that to your point, Michael, we almost never see like the, I think there's an article Jerry was talking about where the NSA hasn't dealt with a true zero day used by another nation state attack in years. Well, I think we get these articles all the time. It's typically something that exists, something where an exploit exists and we know about it. So I just wanted to put some of that stuff out there.
Michael Roytman: Well, when I think about the things that I've actually like struts to as a good example, the things that really affect a ton of enterprises or pose a huge systemic risk across all of our infrastructure. Those things are few and far between. 150, 000 CVEs in the national vulnerability database, maybe five or six a year are actually panic worthy events that you need to respond to within days. But if you're looking at average scores or average risk, or if you start to think that the distribution is converging, you might be tailoring your response to thinking that 40 or 50 a year or like that. And then you're not quick enough on the ones that are truly risky. So I think the shift in mindset for me that really drove this home has been operational. It hasn't been on the measurement side. It's actually not what we do day to day. It's how our customers end up using that information. So if you're measuring the right things and then your response is proportionate to that measurement, your operation looks a lot better, but if your measurement is wrong to begin with, you're never going to have the right operational outcomes. You see that all the time when people are just looking at a spreadsheet and picking off CVSS 10 vulns within 30 days. That's not responding to the real risk out there.
Dan Mellinger: Yeah. So that's interesting as well. There's a case to be made that setting up kind of a SWAT team type approach, you were talking about antiterrorism. So I just bring up that analogy, but setting yourself up to address a few things like BlueKeep, for example, is a good one that keeps on giving. When that one came out, we had early indicators in our system and we knew it was going to be bad. We wrote about all that good stuff. And if your team was quick enough to address that, then you don't got to worry about it. And in theory, those hours and that time spent was much more fruitful than addressing a 100 other vulnerabilities. And I think that also ties back to even the recent fire hack. The SolarWinds hack and all that good stuff. FireEye's red team tools. When you look through it, it was all pretty traditional. It was things that you would expect, BlueKeep being one of them, those are the vulnerabilities they are using the pen test other organizations. So those are the ones that you should go out and address very, very aggressively versus the hundreds of others.
Michael Roytman: Yeah. I mean, I think that's absolutely. It's if your thinking becomes more probabilistic, then how you structure your teams, how you delineate resources or how you measure their success or failure will also change to be very probabilistic.
Dan Mellinger: Interesting. And you know, when we're talking about defensive strategies for power law distributions versus bell curves, we've talked about this example before and I love it, but the air force, I think it was world war II. And they're looking at bullet hole patterns in airplanes and being like, Oh, we should put armor where the bullets are. And then that's...
Michael Roytman: My favorite tweet on that is if you're a data scientist doesn't know what this picture means, fire her. And it's the picture of the airplane with bullet holes in the wings.
Dan Mellinger: The only reason we're studying this is because it made it home. But no, I just think it's interesting because you know, in a war time, you can actually guard against the average type of attack. And so they use armor because on average, you're going to be hit with this type of bullet because you're an airplane. They're going to be shooting this at you. And that is not how... I see that as kind of firewalls in cybersecurity and or some of these kind of perimeter type thinking, and or perimeter security strategies or to defend against the average. Against the spray and pray type attacks for bad IPs, botnets, DDoS, that kind of thing.
Michael Roytman: It reminds me of a paper I read right when I started at Kenna. So this was like 2012. The paper was written maybe in 2010, it was a simulation that they ran. It was a repeated game, theoretic model of an attacker and defender interacting. And the defenders got this network and many different nodes in it and interconnections between it. And what they found was the network topology does not matter at all if the cost of the attacker is sufficiently low enough. And so the whole concept here is that the attackers just going to keep trying stuff. And if your defense is air gaps or firewalls or some kind of like network perimeter defense, the reality is, is that they're going to hit every node eventually anyways, and that's going to cause the exploitation. So you're defending against what you think with our puny little human minds is the average type of attack. Like this thing is more vulnerable because on average it takes more paths to get there. Well, it turns out when you're looking at all the interconnections of a network that doesn't matter much.
Dan Mellinger: Interesting. Well, I mean, we still need those in place. Because you're trying to raise that cost so that you're reducing the overall likelihood from an average style attack. So what are some of the other defense strategies outside of reacting extremely aggressively in the cybersecurity realm? So I think the challenge here is scale.
Michael Roytman: Scale, speed, operations. Those are great defenses. I mean, we got to try to predict, even if it's difficult. I think Richard Searson said this at Sierra con conference that our only asymmetrical advantage against attackers is that analytics. Everything else is kind of symmetrical and proportionate to how they're like they develop an exploit, we develop a countermeasure, they find a vulnerability, we write a patch. And when we think about measuring the overall statistical distribution of vulnerabilities and exploits, that's an asymmetrical advantage that we have. We can kind of look at what is more likely, what is more probable, what is more impactful over time.
Dan Mellinger: Interesting. And then also countering that kind of scale, I think is almost a detriment to us right there. There's a ton more CVEs published every single year. The entire gamut is getting bigger. And you know, we talked about this in the 2010 one with Ed. The raw volume is actually the same, but the overall gamut of choices, all the vulnerabilities you have to choose from every year is getting exponentially larger over time. And so it's a smaller proportion of a much larger pie with the same volume that you're having to monitor, analyze to your point about analytics and then make action, take actions on.
Michael Roytman: The national vulnerability database at its core is just measurement. Like these vulnerabilities don't exist because they're in the national vulnerability database they've existed. We've just started to measure them. And so the measurement device has changed, but a lot of our... More certified numbering authorities, more vulnerabilities being published, more exploits in, GitHub than an Exploit DB. Some work that Jay at Cyntia recently did. The measurement device has shifted. But our thinking is still this like idea of an average that we've gotten from an older measurement device from 1999 or 2005. And so if our actions are guided by that inaccurate distribution in our minds, we're very likely to be surprised. And we don't want that.
Dan Mellinger: Yeah. Because the consequences are dire.
Michael Roytman: Well more and more dire every year.
Dan Mellinger: Yeah. More and more dire every year. Well, thanks, Michael. I figure we can close this off because you had a couple really good piece of advice for practitioners. Because we're talking about you need to kind of reorient your thinking to something that's a lot less intuitive for us puny human brains. Do you mind giving some advice for practitioners to take away from this episode?
Michael Roytman: So there's a really simple snippet and I think this might be our shortest podcast yet. But if you think about trying to model something that has no average, that's the biggest takeaway. And when you start to look at everywhere where you thinking about average as reporting on averages, going to your Cisco and saying, on average a month we do this or on average we've seen this many exploits come out for this operating system. And start to think that, that's not actually accurate. It might be under or overestimating what's happening. It really makes you question everything that you're doing and every way that you're measuring it. And the answer should be shifted to a probabilistic view. What's the probability of this will happen? How much higher is it? Is my measurement device good? That's a huge shift in thinking, just stop, reevaluate, take a look at what you're looking at. Modeling risk differently is the solution to these not normal distributions, you can't model a risk using the same models that you would have if you were measuring the average height of a person. And I think intuitively, I mean, I do this too. You take some data, you throw it into Tableau, you're looking at distribution and you're like, okay, this is the decision guidance. This is the average of this thing. It's not good enough for cybersecurity. We have a much more complex environment we have to deal with. So don't use those averages, think about where they could be affecting the way that you're viewing the world, but ultimately focusing on identifying the riskiest of things, the things that live in the fat- tailed of the distribution and look at making a response that is more aggressive to the things that are likely to become more aggressive. The things that are likely to become riskier, not always easy to do, not easy to model, but just shifting your thinking about the distribution will cause some shifts. And I think the most obvious one to me is rethinking SLAs. If your SLAs are still 30, 60, 90, those are based on an average view of a ancient measurement device that has since shifted many times. And if you really pause and update them and think about other types of distributions that are out there, you might find that your SLAs should be 7, 30 two years, because that's a much more fat- tailed distribution of remediation.
Dan Mellinger: Interesting. That's super, super interesting. And I will link to actually some resources to help out. So you brought up thinking about kind of risk and vulnerabilities in a more probabilistic sense. So I will link back to the exploit prediction scoring system calculator. There's also a paper there that you can read, the white paper from Jay Jacobs and Michael and a bunch of people at Cyntia and RAND Corp and Virginia Tech. So really, really good white paper that they all worked on. And it's a really cool calculator that looks at probability of exploitation within the next 12 months for any given vuln. And that's based off of the characteristics of the vulnerability and the benefits of hindsight to seeing what was actually exploited based off those characteristics. So that's a good start to think about things more probabilistically. And then I think this concept of risk- based SLAs. So maybe picking a smaller number to go attack and remediate aggressively gives you a lot more free time to maybe ignore this very, very long tail of small things that don't get the 69% that don't exist. Not worrying about those as much. Awesome. Anything else before we hop off, Michael?
Michael Roytman: No, I think that's it. I think this is a small pithy insight. It's a fun think path to go down that's also super nuanced and super complex, but it might not be a power law distribution, but it's certainly not a normal one.
Dan Mellinger: Absolutely. So yeah, I will put that out there. If you are a statistician and you take some issue with some of the nuance here. Yeah. We're trying to just more think about the thinking of how you would address something that is atypical, like a power law distribution. So there may be some very nuanced statistical inaccuracies here from a mathematic standpoint, but I'm not a mathematician, so it's all good. Michael, thank you for joining us today. Have a good day.
Michael Roytman: Thanks Dan.