SOC Metrics, Part III: Measures of Effectiveness

By Zac Szewczyk on 2022/05/07 10:27:01 EST in Cybersecurity

In part one of the SOC Metrics series, I introduced the idea that success requires the right person doing the right things in the right ways. That article also described several foundational metrics, ways to measure the SOC’s ability to produce meaningful results. Part two then focused on measures of performance (MOPs), which assess whether or not (and to what degree) the SOC is doing the right things. This article delves into measures of effectiveness, the last step in defining useful SOC metrics.

As in the last article, I based my selection of metrics on extensive research and on personal operational experience. I also include several sources for other metrics at the end of this article in the resources section.

Measures of Effectiveness #

Again, recall that definitions of MOPs and MOEs from JP 5-0: Joint Planning underlie this series. JP 5-0 defines a measure of performance as “an indicator used to measure a friendly action that is tied to measuring task accomplishment”, and a measure of effectiveness as “an indicator used to measure a current system state, with change indicated by comparing multiple observations over time.” MOPs concern themselves with friendly action (the right things), while MOEs concern themselves with those actions’ ability to change the system (the right things done in the right ways). This article describes several SOC-specific measures of effectiveness, selected for their ability to help answer the question, “Is the SOC doing the right things in the right ways?” Critically, while efficacy tests how well something works if done perfectly, effectiveness tests how well something works based on how we’re told to do it. These measures of effectiveness seek to assess how well security operations work based on how we do them, not necessarily to uncover the optimal state.

Recall also that “right”, here, depends on the SOC’s purpose, and that this series assumes that the SOC’s purpose is to efficiently detect, thoroughly investigate, and effectively remediate malicious activity. MOPs help ensure friendly actions support that goal, and MOEs help ensure those actions actually achieve it. It is not enough that the SOC “go through the motions”, and these measures of effectiveness serve as a guardrail against that.

As in part two, in the interest of time and since many of these measures require less explanation than the ones I described in part one, I erred on the side of brevity here. Subsequent articles may cover some of these in greater detail.

Alert Resolution #

Tracking alert volume over time helps understand the SOC’s allocation of resources, how its personnel spend their time, but that practice does not capture how effectively those analysts process those alerts. Once, during an incident response, I discovered that the local administrator had received so many alerts for Mimikatz that he marked them all as false positives, cleared all the alerts, and then disabled the rule. In order to be effective, a SOC should also track alert resolution using categories such as those described in Desiree Sacher’s paper Improving Security Incident Quality in SOCs with Resolution Categories. In the case of that administrator, the tens of alerts he classified as “false positives” would have caused an unusual spike in that category, which his manager could have then investigated.

Investigation Resolution #

For the same reason it is important to track alert resolution by category, it is also important to track investigation resolution by category as well. CJCSM 6510.01B: Cyber Incident Handling Program](https://www.jcs.mil/Library/CJCS-Manuals/), details a basic categorization scheme on page B-A-1 (page 51). Again, look for conspicuous concentrations here, too.

I also find it helpful to track number of investigations by category over time as well. Some organizations have a tendency to use “Category 8: Investigating” as the catch-all bin for anything without a clear answer, which means it only grows over time. Alternately, an inexperienced analyst might have a tendency to close investigations as “Category 5: Unsuccessful Activity Attempt” without performing the due diligence to reasonably support that conclusion.

Classification #

In this context, classification is the process by which investigations are arrayed into one of three categories:

True positives: the investigation was based on malicious activity.
False positives: the investigation was based on benign activity.
False negatives: the investigation failed to identify malicious activity despite its presence.

Although a fourth category exists, true negatives, it is infeasible to expect SOC analysts to prove the innocence of any activity in a sufficiently rigorous manner — especially at scale. The amorphous nature of the fifth domain, the level of abstraction inherent in the data collected from it, and the general lack of context necessary to support such a classification make this a poor use of limited SOC resources. As Matt Graeber pointed out years ago on Twitter, “A sufficiently advanced threat actor is indistinguishable from a competent system administrator.” To task an analyst to make that distinction would be to task them to do the impossible.

Classifying investigations into one of these three categories supports the SOC’s mission to thoroughly investigate malicious activity: it demonstrates rigor, to ensure that the SOC is not just processing events, alerts, and investigations, but that it is doing so well.

True Positives #

The percentage of investigations based on malicious activity. The SOC should strive for a high true positive rate to establish and maintain credibility.

False Positives #

The percentage of investigations based on benign activity. Although many consider false positives a waste of time, it is important not to try to eliminate false positives. That would mean (implicitly) accepting an irresponsibly high false negative rate. The SOC should strive to reduce false positives, but it must not reduce them to such an extent that it biases analysts away from investigating suspicious activity and in doing so increases false negatives. As Jared Atkinson explained in episode twenty-one of the Detecting Challenging Paradigms podcast, the danger of false negatives is too high:

“There’s no cap to the impact a false negative can have. The cost of a false positive is the equivalent of the amount of time an analyst spends looking into it. And so there’s a finite cost to false positives. Now, the criticism of that approach is, while the impact of false negatives is exponential and the impact of false positives is linear, the occurrence of false negatives is linear and the occurrence of false positives is exponential.” - 102:20:00.

False Negatives #

The percentage of investigations that failed to identify malicious activity despite its presence. These errors typically stem from a far too narrow rule in the case of alerts¹, or from inexperienced and unsupervised analysts in the case of investigations. Unlike false positives, the SOC should strive to eliminate false negatives through training and education.

In a Twitter thread on measuring SOC quality, Jon Hencinski recommended finding defects in the investigative process by assigning a third party (an analyst not involved in the initial classification) to review investigations using a standard, transparent, and well-reasoned process. The investigations themselves should identify true and false positives, but only this practice will identify false negatives.

Root Cause Remediation (RCR) #

In an as of yet unpublished paper, Understanding the Enemy: Techniques for Mapping Adversary Infrastructure, I explain the importance of identifying the root of a compromise in order to effectively remediate it:

In order to be effective, an incident response methodology must deny the adversary use of all avenues of approach: not just the domains and IP addresses the analysts initially identified, but also backup persistence mechanisms involving secondary command and control servers; in order to be effective, that response must focus not only on the original hosts, but other endpoints in the environment to which the adversary may have laterally moved as well. In MITRE’s technical report TTP-Based Hunting, authors Daszczyszak et. al. called this “pulling the thread”: “To pursue a malicious hit, the hunt team should ‘pull the thread’ both backwards and forwards to find the activity which caused the hit (ideally back to the initial infection), as well as subsequent activity to determine the scope and scale of the adversary’s actions.” Compared to the traditional incident response process in which administrators block individual IP addresses, re-image hosts, and then move on, this methodology may actually enable an effective response.

Every single metric in this series, from foundational metrics to measures of performance to the measures of effectiveness described above, was chosen to support effective remediation. Measures of performance help ensure the SOC is doing the right things, measures of effectiveness help ensure the SOC is doing them in the right ways, and foundational metrics ensure those two measures mean something — and they all exist to to ensure the SOC efficiently detects, thoroughly investigates, and effectively remediates malicious activity. Effective remediation depends on efficient detection and thorough investigation, which the rest of the metrics in this series assess; this final metric measures the percentage of investigations in which the root cause of the compromise was remediated.

RCR = (Investigations w/out root cause) / ((Investigations w/out root cause) + (Investigations w/ root cause))

At first, this may seem like an impossible metric. Many organizations lack the collection, retention, and visibility to trace a compromise back in time and throughout its environment — but that is the point. Every single metric in this series was chosen to support effective remediation, the cornerstone upon which successful security programs rest; if any of them fail, so will this one.

SOC Metrics #

This series described a collection of SOC-specific metrics. For ease of reference, this list contains each of those metrics along with a brief description.

Part I: Foundational Metrics

Data coverage: the percentage of the environment the SOC can observe. Monitored / (Monitored + Unmonitored + Shadow IT + Rogue devices
Data quality: a measure of the SOC's ability to detect malicious activity occurring within its environment.

MITRE ATT&CK coverage: the percentage of techniques the SOC has the ability to detect
presence: a binary, whether or not the SOC is receiving a particular data feed at any given time
latency: a leading indicator that would highlight bottlenecks before they become catastrophic failures
volume: a basic count of events.
constitution: a count of unique sources, such as systems for endpoint events or sensors for network events.

Part II: Measures of Performance

Events: a raw count of events, typically expressed over time.

Events per Feed: a raw count of events per feed.

Alerts: a raw count of alerts, typically expressed over time.

Alerts per Feed: a raw count of alerts per feed.
Alert Latency: the amount of time from alert to escalation as an incident or closure as a false positive.
Time Claimed: the amount of time claimed per alert.

Incidents: a raw count of incidents, typically expressed over time.

Incidents per feed: a raw count of incidents per feed.

Investigations: a raw count of investigations, typically expressed over time.

Investigations per feed: a raw count of investigations per feed.

Remediations: a raw count of remediations, typically expressed over time.
Time to Detection (TTD): the amount of time from the earliest evidence of related activity to the start of an investigation; also called the adversary's "dwell time".
Time to Investigate (TTI): the amount of time from the start of an investigation to its conclusion.
Time to Remediate (TTRem): the amount of time from the end of an investigation, which should automatically trigger some sort of remediation procedure, to fully remediated.
Time to Resolution (TTRes): the amount of time from the earliest evidence of related activity to fully remediated. TTRes = TTD + TTI + TTRem.
Time to Assess Exposure: The amount of time to sweep an environment for a specific vulnerability or particular configuration that exposes the organization.
Time to Onboard: The amount of time to onboard a new data feed into the SOC's SIEM.

Part III: Measures of Effectiveness

Alert Resolution: alert resolution by category over time.
Investigation Resolution: investigation resolution by category over time.
Classification:

True Positives: the percentage of alerts that correctly identified malicious activity, also called an incident.
False Positives: the percentage of alerts that flagged benign activity as malicious.
False Negatives: the percentage of alerts that failed to flag malicious activity.

Root Cause Remediation (RCR): the percentage of investigations in which the root cause of the compromise was remediated.

Mission command is a powerful tool. Like many military concepts, it translates well to the private sector. Empowerment, however, does not absolve decision makers of their duty to steward initiatives through appropriate guidance. In fact, it makes that guidance even more important. MOPs and MOEs are the guardrails against failure — without them, success becomes a game of chance. SOC metrics are hard, though, and so for years decision makers both in and out of the military have abdicated their responsibility to define them. At least now, nearly ten thousand words later, they will have somewhere to start.

Resources #

This section lists several resources for SOC metrics, some of which were cited throughout this article. This post contained the metrics that would have been most effective in my organization as judged by my personal experience working in a SOC, but you may find these articles helpful as well.

For his GIAC Gold Certification, Kirill Filatov wrote a good paper on information security metrics titlted Metrics-driven information security framework as part of information security management. While that paper focuses more on the importance of metrics rather than recommending specific metrics, especially for a SOC, his arguments are sound.

↩ Another way to identify false negatives is through threat hunting, the act of searching for evidence of malicious activity in an environment. The goal of threat hunting is to uncover the malicious activity that the extant detections did not identify either due to a gap or an incomplete rule — in other words, to identify false negatives.