Individual versus Program SOC Metrics
The other day, a member of SANS’s GIAC Advisory Board asked a question about Mean Time to Detection, or MTTD. In the ensuing discussion, another member cautioned against confusing key performance indicators used to assess individual efficiency, to then serve as the basis for penalties and rewards, with measures of program efficiency. Holding individuals accountable for MTTD, for example, made little sense when that detection relied on factors outside of their control such as effective data collection and efficient aggregation into a centralized platform, each of which were likely to be managed by a separate department. I alluded to the challenge of implementing metrics in part one, but did not deal with this issue specifically. This bonus article provides specific recommendations for applying SOC metrics to individuals versus an entire program.
In part one of the SOC Metrics series, I explained the danger of introducing metrics and that the true problem was not the metrics themselves, but rather poor management:
“... few use metrics in general — and SOC metrics in particular — well. This series does not ignore the potential downsides of introducing metrics into an organization not ready for them. As Mick Douglass alluded to in Rapid–er Incident Response: How Fast Should You Go?, organizations that prioritize speed over effectiveness just rush to failure. The root of this problem, however, lies in poor management. Any discussion of SOC metrics must separate their value as enablers of the positives I described in [part one] from the downsides that result from their improper implementation.”
In more specific terms, when the SOC manager sees a high Time to Investigate and above average false negatives, for example, does he or she see a lazy and incompetent analyst? Or does that manager see someone who could benefit from training (were they hired or promoted into a role for which they did not have sufficient knowledge and experience?) or a one-on-one (are outside factors impacting their work performance, such as trouble at home?)? Those in the former camp will struggle to run an effective SOC, regardless of the metrics in use — but even those in the latter camp will struggle when they apply the wrong metrics to individual analysts versus the entire program. Even the best SOC metrics cannot mitigate the impact of bad management, but good management will also struggle to overcome inappropriate metrics. The list below, taken from part three, now includes tags to denote metrics suitable for measuring individual performance, program performance, or both.
- Part I: Foundational Metrics
- Program: Data coverage: the percentage of the environment the SOC can observe. `Monitored / (Monitored + Unmonitored + Shadow IT + Rogue devices`
- Program: Data quality: a measure of the SOC's ability to detect malicious activity occurring within its environment.
- Program: MITRE ATT&CK coverage: the percentage of techniques the SOC has the ability to detect
- Program: presence: a binary, whether or not the SOC is receiving a particular data feed at any given time
- Program: latency: a leading indicator that would highlight bottlenecks before they become catastrophic failures
- Program: volume: a basic count of events.
- Program: constitution: a count of unique sources, such as systems for endpoint events or sensors for network events.
- Part II: Measures of Performance
- Program: Events: a raw count of events, typically expressed over time.
- Program: Events per Feed: a raw count of events per feed.
- Program: Alerts: a raw count of alerts, typically expressed over time.
- Program: Alerts per Feed: a raw count of alerts per feed.
- Program & Individual: Alert Latency: the amount of time from alert to escalation as an incident or closure as a false positive.
- Program: Time Claimed: the amount of time claimed per alert.
- Program: Incidents: a raw count of incidents, typically expressed over time.
- Program: Incidents per feed: a raw count of incidents per feed.
- Program: Investigations: a raw count of investigations, typically expressed over time.
- Program: Investigations per feed: a raw count of investigations per feed.
- Program: Remediations: a raw count of remediations, typically expressed over time.
- Program: Time to Detection (TTD): the amount of time from the earliest evidence of related activity to the start of an investigation; also called the adversary's "dwell time".
- Program & Individual: Time to Investigate (TTI): the amount of time from the start of an investigation to its conclusion.
- Program & Individual: Time to Remediate (TTRem): the amount of time from the end of an investigation, which should automatically trigger some sort of remediation procedure, to fully remediated.
- Program: Time to Resolution (TTRes): the amount of time from the earliest evidence of related activity to fully remediated. `TTRes = TTD + TTI + TTRem`.
- Program: Time to Assess Exposure: The amount of time to sweep an environment for a specific vulnerability or particular configuration that exposes the organization.
- Time to Onboard: The amount of time to onboard a new data feed into the SOC's SIEM.
- Part III: Measures of Effectiveness
- Individual: Alert Resolution: alert resolution by category over time.
- Program: Investigation Resolution: investigation resolution by category over time.
- Individual: Classification:
- True Positives: the percentage of alerts that correctly identified malicious activity, also called an incident.
- False Positives: the percentage of alerts that flagged benign activity as malicious.
- False Negatives: the percentage of alerts that failed to flag malicious activity.
- Program: Root Cause Remediation (RCR): the percentage of investigations in which the root cause of the compromise was remediated.
Most of the measures presented in the SOC Metrics series, then, should be applied to the program rather than the individual. Alert latency, time to investigate, time to remediate, alert resolution, and classification may be used to assess individual performance, but the majority seek to arm decision makers with an understanding of the program’s performance.