Risk, Measured: Epidemiology for Cybersecurity
Dan: Today on Security Science, we discuss epidemiology for cyber security. Thanks for listening to the second episode in our Risk Measured series where we discuss all the nitty gritty, nuanced, and highly technical concepts for measuring risk as it pertains to technology. Today's topic should prove particularly interesting, given the current world events, epidemiology and how it relates to cyber security. Joining me today is everyone's favorite data scientist, and soon to be cold brewed coffee provider, Michael Royman. How's it going, Michael?
Michael: Pretty good. These keep sounding like just ads for my coffee company.
Dan: Well, you're always doing something new, each time. So, I've kind of got to plug it, right?
Michael: I've got to stop doing stuff, otherwise this is just going to be a coffee spiel. Do you want to be sponsored by a coffee company?
Dan: That's what I've been quietly angling towards, over time.
Michael: Let's talk about it offline.
Dan: We also have a special guest joining us today. He's an expert in complex systems and network science, covering a vast range of topics including infectious diseases, forecasting and predictive modeling, disease genomics and transnomics... Or, transcriptomics, outbreak surveillance and decision making under uncertainty. It's my pleasure to welcome assistant professor of the Network Science Institute at Northeastern University, Samuel Scarpino. Thank you for joining us, Sam.
Samuel Scarpino: Yeah, thanks so much for having me. I'm glad to be here.
Dan: And I know you get a lot of questions around COVID, we will try to not ask you about COVID on this podcast. I can't promise anything, but that's the goal from my side.
Samuel Scarpino: Well, I thought I was here because of some issues I've been having with my browser. Is that not what this is for?
Dan: Oh, Michael does the IT help desk after the podcast recording session.
Samuel Scarpino: Okay. All right, well.
Michael: That's your payment for being on this podcast. It's IT help desk.
Dan: Absolutely. So, I'm going to start off the podcast like we do pretty often with some of the complex topics with a probably completely out of whack definition of, what is epidemiology? So, according to CDC, epidemiology is the study, scientific, systematic and data driven, of the distribution, frequency and/ or pattern and determinants, so the causes and risk factors, of health related states and events. Not just diseases, in specific populations. Sam, what does that actually mean?
Samuel Scarpino: Well, what epidemiology actually means, and I think this is actually part of the reason that we're having this conversation today, has grown and changed quite a bit over the last 10 years. So, the definition that you read, when I translate that in my head, it's leveraging data and statistics to understand why some populations have a certain disease and others don't, why the prevalence of a certain disease is higher in some places than others, why a disease is reemerging or persisting. And that word'cause' is really important, because we want to try to understand what's causing those differences, so that we can suggest public health interventions.
Dan: Interesting. So, essentially trying to find patterns and complex interdependencies within specific groups?
Samuel Scarpino: Yes, although the only thing that I would add to that is that we ideally want those patterns to get as close to causality as possible, so that we can have targeted, effective interventions that will improve the public's health.
Michael: And this is where we're getting into security already. So, you said one thing in Sam's intro which stuck with me, which was, " Decisions under uncertainty," and, " Diseases across populations." Both of those sound an awful lot like security practice. So, if you just replace disease with a vulnerability, or insecurity, why do some organizations or some verticals behave differently security- wise and how can we measure and design interventions across those populations? We're pretty good at security at taking a particular vulnerability or a particular event and deciding what to do about it from a technical perspective. We're not so good at zooming out and looking at the statistical causes of it, or even the statistical correlations of it across populations. That's why we wanted to have Sam on the podcast, too. Is security, or insecurity, anything like a disease? Or am I just seeing parallels where they don't exist?
Samuel Scarpino: No, I think that there are a lot of similarities. And that's actually one of the things that I'm really excited to talk about today, is whether those similarities are superficial, and I don't mean that in a negative way. But whether they're superficial or whether they're actually foundational in the sense that some of the same kinds of things that might make a particular population of individuals or a particular office building, for example, more vulnerable to a COVID-19 outbreak. Whether some of those same kinds of patterns would apply to the way in which the technology infrastructure for a particular organization maybe... Or the technology practices or policies, maybe could make an organization more vulnerable to particular kinds of security issues. So, an example of a complex system would be a human society, where you have individuals who are going about their daily lives, they're going to the grocery store, they're going to the park, they're going to work. But the way in which a disease is going to move through those populations, might be very difficult to predict, just based on having some understanding about the way individuals decide to go to the grocery store, or the way individuals decide to go to a particular office building, or take a particular route to work. And so, this idea of even if you understand a lot about the component parts, the resulting behavior of the aggregate system is often very difficult to predict.
Dan: Could you give us a good example of a model, or how you would model a complex system like that? I know we were talking earlier and you used the example of power grids, and how a lot of people like to... Systems analysts, things like that, like to analyze how power grids and power failures can materialize through systems because they're not as easy to predict as you would think, from the outside.
Samuel Scarpino: Right. So, and actually for power grids, in some ways are easier to predict than other things we might think about for complex systems. And so, the reason the power grids are a really classic example is that you have this physical network structure that describes the connectivity between substations and transformers, and telephone polls, and all of the wires that are connecting our houses to each other, and back to the power generating sources. And if there's a blackout, and you want to predict where the next blackout is going to occur on the power grid, it's often very difficult of impossible to do that just based on understanding the physical connections between the different substations between the different telephone poles, you actually have to layer on top of that a physics model of the way electricity works and how electricity is going to flow through a network after there's been a inaudible. And so, this is very common across lots of different models of inaudible systems, where we need to understand something about how things are connected to each other, how human social networks are formed, how the connection between different internet of things devices arises. But then also, how the dynamics of anything that we're interested in is going to flow across those networks. And so, whether that's a hashtag trending on Twitter, or whether that's a virus, we need to understand both the connections and how things move over those connections, in order to start to make reasonable predictions about what's going to happen.
Dan: Interesting. So, that's a lot of data sources.
Michael: Yeah, are there situations where just having one or the other yields something fruitful? Or do you usually need both?
Samuel Scarpino: I would say that that's often a research question that people are very interested in. And is quite often an active area of debate amongst scientists and practitioners. So, I might tell you the reason I can't predict the next location of a blackout is because we have to have both the physical structure of the connections between the power sources, and also a good physics model of the way electricity works. Well, we understand quite a bit about the physics of electricity, so we know that's the right answer, right? I've said no, I'm putting scare quotes, you can't see them. But I'm putting scare quotes around no. But, if we didn't understand as much about electricity, maybe you would come back and say, " Well, if you just had information on all of these new solar panels that we've put up, that maybe aren't in your dataset, maybe then your prediction would be good." And so, not only is there a lot of debate kind of from a theory of knowledge perspective, or a philosophical perspective, or a complex systems perspective about whether you could predict with only the physical connections, and/ or only the dynamics, it often then becomes an argument about, well how do we really convincingly say it's a bad prediction without both of them, as opposed to just a missing data problem?
Michael: Okay, okay. So, tracking and trying to create an analog to security... I'm cheating a little because I've read some of your work. Are there systems for which we can't issue predictions?
Samuel Scarpino: Well, first I would say it depends quite a bit on what we mean by a prediction, right? So, I can certainly make a prediction, especially if you're not going to evaluate it, right? So if there's nothing on the line, I can make all the predictions that I want. And then, even if I was going to make a prediction... Like, let's say I'm going to make a prediction about whether the next coin flip that Michael does is heads or tails. And so is Dan. But Dan's going to lose$ 50 if he's wrong, I'm going to lose 10 cents if I'm wrong. We may have very different thresholds for error in our predictions under those different settings, right? And so, one of the things that we try to formalize is how we actually make those decisions, given that the accuracy of the prediction, or the necessary accuracy of the prediction is typically dependent upon what decision you're trying to make in your own risk tolerance.
Michael: Okay. I think a lot of security practitioners probably laughed at no cost to a bad prediction. I know Dan and I both did.
Michael: That's because in security, every decision or action you take either has a cost or some huge cost, if you made the wrong one. So, I'm thinking about making predictions on the defender side. Getting really tactical here, this is what we do in security. When we try to breach inaudible, we try to predict a type of malware, maybe it's polymorphic and it's changing and a new strand will come out, a new exploit will come out for a vulnerability. But what we're really doing there is we're predicting attacker behavior. There's huge payoff functions to that, right? If you're really good at it, you can save an organization billions of dollars. If you're bad at it, you might waste resources by having somebody remediate tons of vulnerabilities that don't end up mattering at all. But security is actually a game between an attacker and a defender. And in our case, unlike the CDC definition of epidemiology, our disease vector is actually a sanctioned opponent, it's somebody who's thinking and making predictions on their own. To them, there isn't any cost to a bad prediction. Because they rarely, if ever, get caught. And when we're talking about something like writing an exploit, you never get caught, right? You can spray and pray across the internet all day, and fire off an exploit against every box connected to the internet, and you might make some money that way. There's no cost to being wrong in that scenario. So, I'm wondering if that's a pretty unique system, or if that's something that we've thought about before? Where you've got two folks who are building models of the opponent, and one has a really high cost and one has a really low cost. Attacker asymmetry would be the SEO search term we want Google to mention when transcribing this podcast.
Samuel Scarpino: Well, there's certainly an analogy on the infection disease side, right? And so, you think about COVID- 19 and it's a little bit challenging, I think, to get the terminology right. And so, I'm just going to kind of... I'll be a little bit loose, with apologies to any evolutionary biologists that are listening. But there's relatively little to no cost to COVID for an individual virus dying. And let's just set aside the fact that I'm saying a virus is dying, and not talk about whether they were alive in the first place. But there's basically no cost to an individual virus dying. To the human population, we'll just stay at the population level. So, the human population, there is a cost. Often a very high cost, certainly an emotional and societal cost to an individual dying. And then once we start getting into much higher numbers, there's still probably little cost if a million viruses die. And there would be a massive cost, and there has been, when you have hundreds of thousands of individuals dying from a disease. And so, I do think that you often find these kinds of asymmetries, and those are really active and interesting areas of research. You know, evolutionary biology in complex systems, which is like, what kinds of evolutionary mechanisms do we see arising in human defenses against viral pathogens, where the asymmetry between the cost of an individual losing is vast. And I think also, to your question around modeling complex systems, those are one of the areas where we look for simplifications, right? And I would assume that there's probably a similar analogy in the security space, is that you want to have a model that's as complex as necessary, but no more. Right? I'm paraphrasing a famous quote. Because we want to try to, and this is going back to the epi side, we want to try to understand what's going on, what's causal, so that we can actually intervene. And again, we could probably have a different conversation around whether you need to actually understand causality to make good predictions or not. Maybe you don't. But, I think certainly if you understood the causality, you would at least know whether you could in principle make good predictions. And so, I think that that's something that is very similar.
Michael: Okay. Let's talk about operationalizing predictions. You're right, in security we try to make models as simply as they can be, so that we can hand off a prediction to somebody and they'll do something about it. And this might get into some tricky territory, but let's say I had a perfectly predictive model of vulnerability exploitation and I could demonstrate that, and I published that research paper and I had a product that did that, and then the security team was sending tons of tickets for somebody to go update a system or to switch out a box. But the underlying reason for that prediction? It would be very difficult to recover. Let's say it included 400 different variables, 300 of them coming from closed source sources that I'm purchasing somewhere, or scanning the dark web for. If you couldn't explain why they're doing that action, you would have a much smaller probability of the IT operations team actually executing your recommendations, that your really good models predicting. On the flip side, if you had a model that was 95% precise, but only included 10 variables, maybe they would look at those and say, " Oh, I intuitively understand the causality here, I understand why this prediction or recommendation is being made. I'll act on it." So, the question of will this get operationalized, is this helpful to the person actually acting on it, because... This is also something that I see as an analog, the epidemiologists are not the doctors treating patients. They're not even the patients themselves, they're making statements about the population as a whole. And then, hoping that those recommendations and policies and interventions get undertaken. I think security works the same way, we just don't think of it that way. We're making recommendations to a giant IT organization, and then hoping that those take hold. I'm wondering if you've found, in epidemiology, methods, practices, modeling techniques that make recommendations more likely to actually get taken or followed?
Samuel Scarpino: That is certainly one of the biggest challenges in the public health space, is translating what we learned from the data into actionable policies that will be implemented and followed. We've seen this with COVID- 19, right? That it's been... It was initially a scientific challenge to understand the benefits of mask wearing, and then after we understood the benefits of mask wearing, it became a very complex policy problem to try and convince individuals to wear masks. And part of the reason why, is it's not very much fun to wear a mask. And so, one of the things that we saw right from the beginning, was that we were using the same kinds of mechanisms to try to get people to wear masks that we know don't work for inaudible efforts like we've seen with safe sex. Right? So, it's well known that you can't shame people into wearing condoms more often. And for masks, we saw the same kinds of things, right? And so, there are lessons that we can learn. But oftentimes, it's been done through sort of a trial and error process. And then now, where we take something that we know in sort of a quasi inaudible way. We take something that we know works for something that is a little bit like mask wearing, and then we try those same policies out on the mask wearing side of things.
Michael: That makes sense. That reminds me of a trade- off in security between security, functionality, and usability. That triangle is kind of inaudible in security. And masks seems to fit really well into that. But any security policy isn't really something people want to be doing. It trades off with either functionality, how sweet would it be if you never had to log in to anything? You just show up to the site and can use all the functionality there. Or, it trades off with some of the functionality, let's say you blacklist a bunch of websites so you can't access those, and you can't use your browser to go to the dark web. But we know that those trade- offs are worth it, because essentially for every intervention, we're conducting ROI experiments on them. And that's where I think epidemiology is super useful in security. If we have large enough datasets of populations, which surprisingly, mass scanning of the internet is a pretty recent thing. Maybe past five, six years, is when folks have started doing it. Or, mass scanning an organization is also a pretty recent thing, maybe in the past 10 years. We now have these large datasets of organizations as a population, the finance industry as a population, or the entire internet that's publicly accessible as a population. We can start to test whether certain security protocols or recommendations actually are worth the ROI.
Samuel Scarpino: Well, and that's a really interesting observation, is I think perhaps a difference between epidemiology and public health security, and then the analogy back to the virus trying to replicate and spread through the population. Or maybe even to high frequency trading, where it's perhaps less important that the causality be understood, just that it's working. Right? So, the entire process kind of runs based on an objective function, as long as things are going up, you continue to do them. And if they start going down, then maybe things change very, very quickly, and it's less important that you understand the causality. It is interesting to think about whether causality is important because we need accurate predictions, or whether causality is important because we need to understand how to get buy- in to implement policy, so that we actually effect the change that we care about.
Michael: I'm trying to think about that objective function changing. The best example we have in security is probably the emergence of ransomware, where the pay- off to attackers changed once that started becoming common practice. And they could all of a sudden get a ransom from somebody instead of waiting to sell off an account, or sell off documents that they got back getting into a system. So, the types of vulnerabilities that were inaudible changed. That seems similar to how a virus behaves. There's a very simple pay- off function there. And it changed, so that the underlying mechanism or the underlying way that folks exploit vulnerabilities also started to change, or started getting used in different ways.
Samuel Scarpino: Yeah. I guess I would be interested in which side of the equation the host or the parasite or the attacker, or the defender, actually changed. Right? So, you could imagine... And this is probably true to a certain extent, and we actually know this is true, that the Coronavirus that's here today has been around in a very similar form for at least part of a year, and probably in a fairly similar form for quite some time. There are pretty constant exposure of Coronaviruses, novel Coronaviruses, all over parts of the world that are happening regularly. We know, for example, with HIV that there were probably regular exposures going back decades and decades before it took hold in the US. With the Coronavirus, we see that the strain that's here now, probably showed up in February, was not the one that originally came. It's the same strain, it's not the same introduction as back in January. And the question really is, whether there's something about the way that the hosts are behaving had changed, or whether there's something about the virus. Or, and I've used three different examples that illustrate all three of these, the host, the virus, or just bad luck. Right? And sometimes, that's the really hard part for all of us, is our brains are wired to see patterns, even if they're not there, is also convincing yourself that the reason that we're seeing Coronavirus now isn't just because of it's been a waiting game for 10 years and it finally came up tails.
Michael: Right. Low probability event actually occurring, causing the big security breach, even though you had a really good security policy and were fixing all the high probability vulnerabilities. Happens all the time.
Samuel Scarpino: That's right, that's right. And then you end up switching your entire security policy because you got unlucky once, right? But I guess I was thinking on the ransomware side is, I don't know how effective that would have been, even five years ago. But now, we're certainly in a situation where people have vast troves of highly valuable, sensitive documents that exist exclusively in digital form. And that by taking over those systems, you could extract a very high price, and often have people not wanting to admit that it happened. And I don't know if that scenario is as likely to be true, across nearly every organization 10 years ago, as it is today.
Dan: Well, and there's some unintentional technologies that are related to that proliferation as well, right? Like, would this be possible without the emergence of cryptocurrency, for example, right? A way to wash and transfer these kind of digital funds that may not have existed before, roughly eight, nine years ago. Right? So, how do you control, and/ or identify some of these unintended consequences that really do change the dynamic of some of these models?
Samuel Scarpino: Yeah. Well, I'm thinking back in high school, we used to play a game called tall tales, which is everybody was... Almost everybody was going to say something that wasn't true, and you kind of had to guess who was telling the truth. So, I'm going to do that right now. What I'll say is that, there are many other things that have changed. And one of them is related to Michael's earlier point about ease of use and installation versus security, right? So, I started using Zoom because nobody had to install anything. You just click on something, and it works for everybody, instead of no one being able to sign in to whatever video conferencing platform we were using back then. And we know that that introduced so many security vulnerabilities. And I think probably even those of us like me who don't really even have an armchair understanding of security realized that this thing probably wasn't a safe way to be doing business. But, it made our lives easier and we've got this fierce competition over video conferencing. And one way to win, is to make things super easy and frictionless, even if you lose on the security side for a little bit before the Michael Roymans of the world catch up to you. And so, we have had that push, right? And so, maybe there are just such a large diversity of ways in which an attacker could gain entry to one of these companies and take over a computer, that didn't exist before. And so, maybe where you could have an IBM patching your whole system fairly successfully 10 years ago, that's just never going to happen in most organizations now. And so, not only do you have mechanisms like cryptocurrency, but you also have that proliferation of security vulnerabilities through all of these apps and everything that we have installed on all of our different devices. And that's the complex systems piece there, right? Is how do we understand the way all of these different pieces are sliding together, to generate the massive growth in ransomware. And then if we actually care about intervening, how do we figure out which one of these things, or which combination of these things is actually the cause? And so, we don't shut down the Zooms, because really the issue is the cryptocurrency. In fact, it's all of the above. And so, you've got to target everything, or you're never going to make progress.
Michael: I mean, one way that this is all coming together too, is that COVID- 19 is changing the way people work, too. Work from home being inaudible to organizations that might not be used to it, is introducing an entirely new attack service, an attack vector. And that's changing the system too. So, to bring it all together, maybe more semantically than we like, the epidemiology is also affecting the security of certain systems and certain vulnerabilities. I want to take a second and move back up the chain, to something I think about very often, which is day zero. Day zero for vulnerabilities is when we first find out that they exist. And at that time, usually all we've got is a description, a document that somebody wrote up saying, " Hey, there's a vulnerability here. I've given it an identifier. This is some of the mechanism, this is some of the self inaudible effects. Good luck." Of course, over time, we build detection signatures, we build protocols around them, scanners will find them in systems when you run a scanner. IDS signatures will detect if somebody's attacking it, people will write exploits. There's a whole life cycle that happens. But I really care about day zero, because if I could at that moment tell you, " This one's going to be an important one, go remediate it," I am now using... epidemiology really, but I'm using security analytics, to be steps ahead of the attacker instead of just responding to them.
Samuel Scarpino: Why do you think the whole rest of the world, for the most part, is doing better at COVID than we are, right? So, the genome of the virus was published by the Chinese, just moments from all practical purposes, after we discovered these unidentified pneumonia cases in the seafood market. And immediately, governments, NGOs, businesses around the world started building molecular diagnostic tests to screen for COVID- 19. And in the United States, our strategy was, " Well, if we don't test, then it won't be here." Right? And I'm certainly, I'm sure there are companies that have that security policy, right? " If we don't ask Michael, nothing bad will happen." But, we know that that doesn't work. So, I think the analogy there is real.
Michael: What's interesting about that analogy is, you might have the best analytics practice in the world. But if you're assessing every vulnerability once that data's already available about it, if you're not doing anything predictive, if you're not actually using some of the more sophisticated analytic techniques that have been developed over the past 20, 30 years, then even still, you might be fighting an exponential battle way too far behind when you're reacting to vulnerabilities, instead of proactively getting them out of your system when you know very little about them.
Samuel Scarpino: Absolutely. People are going to be arguing for decades about when exactly we knew, with pretty high certainty, that COVID was likely to go pandemic. And certainly, we will decide that that was sometime by early February. And that's not when the United States started its COVID response. You would argue that we missed the early cases on the East Coast, and so the East Coast hit the panic button after it was almost too late. But then really, the United States didn't start doing much until it slammed into Florida, Texas, Arizona. And then really even still, we hadn't done very much. But we are very much in a situation where we just continue to react. And I would imagine the same is true on the security side, which is had we put in a little bit more investment back in March and April, we would be looking much more like Australia than the United States. And that's better for the economy, it's better for our society, the kids would be able to go back to school in person. But it's hard. It's really hard, even when you don't have the political landscape that we do in the United States. It's really hard to convince people to invest in preparation and proactive measures, even though they're orders of magnitude less expensive than a reactive measure.
Michael: Okay. So, here comes the most interesting part of this whole podcast, to me. And why I wanted to talk to Sam about this. You're right, it's very hard to convince people to invest in prophylactic care, or preventative care. It has been, in security for 30 years, very difficult to get people to invest in security. There have been times when Fortune 1, 000 organizations in the early 2000s had one security person on staff, if that. But now, because of a lot of wide scale breaches, most IT organizations have security departments, are investing in it. inaudible are joining the boards now. And that's largely because of the singularity events that have occurred, that have caused dramatic shifts in the way that boards view security policy. One could argue, we've had many diseases in the past, but this pandemic is also an example of that, in the public health sphere. I want to ask you about early warning systems, and how we can design them in ways that look ahead. So let's say, now security's had these events, public health has always had these events, but once top of mind, we know that we need to invest. We know that we need to convince people to invest more. What kinds of systems should we be building? What kind of data should we be getting to facilitate those systems? And how do we get stakeholders to engage? Not just now, but throughout time, even when there isn't a pandemic that's top of mind.
Samuel Scarpino: Well, certainly the answer to that question is something that many of us are working on actively now, and have been working on for years and years. And hopefully you're right, that there will be more collective understanding around the importance of surveillance of early warning systems, of effective interventions. I do think that a big part of the issue on the pandemic side, is that a lot of our collective wisdom is around influenza, where we sort of expect that we would be able to have a vaccine within six months. And we know for influenza, that because of the way it spreads through the population, it's very hard to stop flu until there's a vaccine. It just marches along too regularly. Whereas for a disease like COVID, I think that we've certainly heard it described as kind of the Goldilocks virus, where it's in this sweet spot of all these different epidemiological perimeters. And one of them is that it's just deterministic, just regular enough that it can cause these big outbreaks. But it's still pretty reliant on chance events, on these super spreading events, that we can tamper down with non- pharmaceutical interventions, that can actually be implemented, like mask wearing at 80%, like limiting gathering sizes to 15, 20 individuals. So, I think what it's going to take to convince individuals, is first, convincing individuals that there's something we can actually do if we're right about the early warning. And obviously on the... Maybe not obviously, I'm saying obviously because I don't understand security as well as I understand epidemiology. Anytime you hear me say obviously, that just means that I don't understand whatever I'm getting ready to say as well as something else. Obviously, on the security side, there's a lot more you could do with an early warning system. At least, that would be the conventional wisdom. But I think now we understand on the infectious disease side that there is a lot more that we can do, if what we want to have happen is to prevent a whole scale lockdown like we've experienced globally, right? We can have really effective, low cost, high throughput screening that can lead to case isolation. We can have mask wearing. We can have all kinds of different targeted interventions that are going to allow us to minimize the costs of an intervention. However, I do think the other piece is, actually figuring out how to generate those warnings. And that's something that we've struggled with for a long time. So, you can find papers back to the 1980s showing that individuals in West Africa have antibodies to Ebola. And that doesn't mean Ebola was spreading in Guinea, Sierra Leone and Liberia prior to the outbreak in the 2014/ 15 period, but it means that there was risk there. Either individuals are going to areas with Ebola and coming back, or there had been introductions. But the risk isn't zero. You think to the 2009 H1N1 pandemic, we are laser focused on wild bird flus, because those are highly pathogenic, it's your H5H1, your H7N9s that we hear about. But that's not where flu jumped from. Flu jumped from pigs in Mexico, and Mexico wasn't even really on the surveillance map for influenza. But we've known for decades that pigs are an important evolutionary intermediary for flu. So, part of it is actually building those warning systems, to generate enough true positives, without generating too many false positives so that we're not in the Homer Simpson, " Everything all right?" And people are listening. Coupling those with interventions that people will actually get behind, that we can do something about. And hopefully, we're going to coalesce around that.
Dan: So, what's the best bang for our buck?
Samuel Scarpino: Well, I think two things. One of the things that... Michael's probably understood about me, because we've known each other for a few years now, and I think people find out pretty quickly, is that I get super obsessed about something and the thing that I'm super obsessed about right now is waste water surveillance. And that's because even if you can't transmit COVID from feces, you're still shedding evidence of your COVID infection in your feces, in your urine perhaps. And that goes through the waste water systems, and we can detect that. At low cost, and because it's not a clinical diagnostic tool, it does not have to have FDA approval, because we're not diagnosing anyone with anything, we're just looking to find out if something is there.
Dan: Looking for the signal.
Samuel Scarpino: That's right. And so, you imagine in Boston, it probably doesn't matter too much if we have a waste water surveillance system right now for COVID, because we know it's here. Now, you might say if we had a really sophisticated system, that would still be a lot less expensive than a lot of the testing we're doing, we could narrow down to exactly where the COVID is. I mean, we could in the end have like, smart toilets that were telling us everything we're shedding as we're going to the bathroom. But, you think about the places of the world that had eliminated COVID. Like, what if we had a waste water surveillance system for a whole bunch of infectious diseases, in all the airports, hooked up to all the toilets. Now, we couldn't say who had it, but we could certainly say if it's moving through, no pun intended. And, this is also a place where I do think we can start... A lot of this is like, how long is it going to be until I say deep learning? Here it comes. Deep learning's really good at pattern recognition. You could convince me that you could build a pattern recognition system that could look for novel viral pathogens that could be a risk. And we could set that up in a waste water surveillance system, and start the screening process going. So, those are the kinds of... And maybe let's just think about this as an analogy, but these are the kinds of things that we need to be thinking about. Low cost, easy to implement, passive. In this case, it's not maybe as critical that the false positive rate be super low, so long as we have enough people in the middle to be able to kind of make some executive decisions around whether we care about the alerts or not.
Michael: So, low cost, passive decision support systems that work as early warning?
Samuel Scarpino: That's right.
Michael: We're thinking about this the same way. I think security is really similar too, there's a lot of metadata. And that's what waste water is, it's metadata about individuals that will generate great results and great predictions about vulnerabilities. Without identifying any individual company, organization, or machine, but that metadata has largely been discarded by the security industry because, I mean, just like in healthcare you can't medically bill for metadata. You can't remediate a piece of metadata about an organization. But, it turns out to be a really great leading indicator. This is awesome, because you've just done the other side of John Snow's cholera epidemic. Instead of the drinking water, it's the waste water. And you're using that as a signal for the spread.
Samuel Scarpino: Well, what's really interesting is that the individual who I'm collaborating with this on at Northeast University, Professor inaudible, his research area is into drinking water, primarily. He's retooled a bit to work on waste water with COVID, but one of the things you might imagine is like, what if we had passive surveillance systems set up in Flint, Michigan looking for lead contamination in the drinking water? And even if we didn't have low cost systems for looking for lead contamination, I'm sure that the microbial community in the water is different if there's a heavy lead contamination than if there isn't a heavy lead contamination, right? And so, that's the other reason why this like... the high volume of data, low cost systems, then you can play the pattern recognition game, right? Even if I couldn't tell you that there was a lead problem in Flint, I could surely be able to tell you that the water for the last six months, and hopefully this would have been 10 years ago, but the water for the last six months doesn't look like the water used to look six months before that, and we need to have somebody come out and figure out why.
Dan: Interesting. Well, and Michael, I mean, you kind of kicked off with this whole, " What's the best bang for the buck." But it seems like the whole metadata approach to cyber security, like you said, most companies kind of discard it, right? Thread intel's more fun. Right? We're looking for IOCs, we want to go thread hunting. Let's do the sexy stuff. What is the not sexy stuff in the metadata that has given you more insight when you're looking at trying to make predictions?
Michael: That's a great question. I think the number of IDS signatures for a particular vulnerability. Or even the number of people writing about that vulnerability early on, the number of references, links, things like that. That metadata is often more indicative than anything about the actual vulnerability, like the code execution or the type of vulnerability that it is. And that conveys all conventional wisdom. You would think that a Microsoft vulnerability that behaves in this particular way is what attackers are going after, but it turns out that a much better predictor, a much more effective mechanism, is that passive data about what are signature's being written for, how often are these signatures being triggered? Or, just how many links are there that mention this vulnerability across the internet on day two? Ends up being more useful.
Dan: So, it starts to imply activity and/ or velocity?
Michael: Yeah. I mean, it's the exact same thing as waste water, if you think about it. There's byproducts of security, those byproducts are usually hits or mentions or analysis. And just their existence in the waste water that is the internet, is often really indicative of problems forthcoming.
Samuel Scarpino: And I would guess that there's another similarity, which is if you're using molecular diagnostic tests on an individual to look for COVID- 19... And I'm going to put this in square quotes for the evolutionary biologists. If the test is highly effective, the virus is going to, quote, patch and will start evade those tests. Right? So, if you have a molecular diagnostic test, and if individuals get diagnosed fast enough so that they are quarantined, they will transmit the virus on to fewer individuals. If there is a mutational variant to the virus that cannot be detected by the test, that virus will spread more than other viruses and will sweep through the population, replacing the virus that could be detected by the test, rendering the test ineffective.
Dan: So like, a natural resistance?
Samuel Scarpino: That's right, yeah. You could think about it as the evolutionary resistance to antibiotics, or the evolutionary resistance to anything that interrupts transmission, could be the target of the evolution of resistance. So, on the infectious disease side, one of the benefits of the metadata surveillance, is that it is more likely to be effective for longer periods of time. So, one problem is it's not effective for intervention, at least in the way I've described it. It's much more effective for detection, surveillance, what we refer to as situational awareness in the public health side. And so, there are trade- offs, right? So, one reason you don't use antibiotics too often is you don't want to have resistance evolve. One reason we might not want to use too much in the way of active testing, is you don't want resistance to evolve, and so you want a mix between this high throughput, always on passive surveillance that we're less concerned about resistance evolving to, with your super powered interventions, where you want to use them and have them be highly effective. And obviously, molecular diagnostics is somewhere in the middle between waste water surveillance and antibiotics. We can deploy pretty broad scale molecular diagnostics, we have lots of ways of avoiding resistance by targeting multiple parts of the genome, et cetera. But I would assume that it's probably a similar analogy in the security space, with the benefits of leveraging metadata.
Michael: They just leave the systems vulnerable that hold the patterns that aren't that valuable. And then let attackers continue to harvest those.
Samuel Scarpino: Or, you make them look really valuable, right? So, you train... And here's the deep learning coming again. You train a neural network to write file names of beneficial sounding patterns, and fill up hard drives with those and just leave them on the side of the road, so to speak, and see what happens.
Michael: That is a strategy. Honey pots are around for a while, banks even set up accounts with large dollar amounts that don't actually belong to anybody, just to see if people will start to siphon that money or wire transfer it out. Yeah, that makes perfect sense. I mean, you also want to be... You don't want to change the system too much. It becomes a wicked problem if you're constantly telling attackers to shift their mindset, and innovate faster. And then you're moving in a very quick inaudible that you might not want to be moving in. You might instead want to say, " Great, Adobe Reader is continuously vulnerable to some things. We can build mitigations on the back end and leave this vulnerability open, because we can detect it passively."
Samuel Scarpino: Well, that's the other reason why we need continuous, always on data. Is that, even if we've built a model that gets the causality right, today, the causality might not be the same tomorrow. And that's because these systems are evolving. Either because they're actually evolving my natural selection, as what happens with pathogens like COVID, or they're evolving by analogous methods in terms of the predator or prey dynamics that happen between attackers and defenders in the cyber security space. But because, as you said, the rules are shifting, the landscape is shifting, your understanding from yesterday doesn't necessarily mean that you're going to understand what's going to happen tomorrow. And you have to continue to gather data, retrain, reevaluate, and adapt as the system shifts. And so, if anything, one of the things that we need to do a better job of on the epi side, and this is work that we've been very active in, and I suspect is something that Kenna and you are very active in, Michael, given your philosophy around data, is that part of the value proposition around data is that if we can be sure of one thing, it's that tomorrow will be different from today. And if we're not gathering data, we're just not going to be prepared.
Michael: Yep. I think until you said, " In the epi community," it just sounded like a pitch for the kind of data analysis that I think is right in security, too. If you score a vulnerability once, that's not very useful to anyone, because by the time an IT operator is actually looking at that system, everything has changed. The attacker behavior has changed, the pay- off functions, what that vulnerability could be paired with, what intelligence exists about it. Both the data pipeline needs to be real time, that passive and reactive collection, and your model needs to be real time and reactive too, to global changes and what people are doing. You might have had an amazing model in 2004, but ransomware didn't exist. And so, the best threat intelligence, best model built back then isn't going to give you great predictions today.
Samuel Scarpino: No, I guess it's like don't listen to anyone over 40, that's not a Forbes 30 under 30, or that didn't train their model in the last 48 hours, right? You really have to have this continual refining process, this reevaluation, this always on surveillance, because things are going to be shifting. And in fact, the better you are at intervening, the more important it's going to be that you have these real time data, because we know that the attackers are going to find a route. Because as you said, going back to the virus analogy, the cost is so low. Right? And so, you're never going to be able to stamp out all possible attacks, you're never going to be able to stamp out all possible viruses. We cannot prevent the next COVID from finding its way into a public market somewhere. We can, and it is an imperative, that we prevent COVIDs in the future from becoming a pandemic. And that's where the data come in, that's where the models and the predictions come in. And I, again, expect that it's not really just an analogy on the security side, it is very, very similar in the sense that you're never going to be able to stop all attackers from trying to attack or plug all vulnerabilities. But you can certainly prevent entire... Or work very hard to prevent entire organizations from going down for weeks, because they let an attack get out of hand.
Michael: That's right. You cannot fix all vulnerabilities, but you can certainly strive to fix the ones that are actually causing risk to an organization. If you have the right models, and if you're updating them fast enough.
Dan: Absolutely. Well, I think this has been awesome, but I figured we can start to wrap it up there. In closing, I definitely recommend people following Sam on Twitter. I'll link it on the podcast page, along with the blog that we use to promote this as well. I will also link Sam's Google Scholar page, just so you can see how much smarter he is than I am. And then, he also has a GitHub page, because apparently that's what the cool kids used to do. So, we'll link all that so you can follow Sam and see what he's up to. Other than that, check out Kenna Security and kennaresearch. com to follow the podcast. And if you need anything, feel free to reach out to any of us on Twitter. Thanks guys for joining us today.
Samuel Scarpino: Awesome. Thanks so much for having me, this was great.
Michael: Thanks, Sam.