Analytic Methods

Time Series Searches

This method of anomaly detection tracks numeric values over time and looks for spikes in the numbers. Using the standard deviation in the stats command, you can look for data samples many standard deviations away from the average, allowing you to identify outliers over time. For example, use a time series analysis to identify spikes in the number of pages printed per user, the number of interactive logon sessions per account, and other statistics where if a spike is seen, would indicate suspicious behavior.

The time series analysis is also performed on a per-entity basis (e.g., per-user, per-system, per-file hash, etc.), leading to more accurate alerts. It is more helpful to know if a user prints more than 3 standard deviations above their personal average, but less useful to alert if more than 150 pages are printed. Using a time series analysis with Splunk, you can detect anomalies accurately.

The time series searches address use cases that detect spikes, such as “Increase in Pages Printed” or “Healthcare Worker Opening More Patient Records Than Normal” or any other use case you might describe with the word “Increase.”

Large Scale Version of Time Series Searches

In a large-scale environment, utilize summary indexing for searches of this type. The app allows you to save any time series use case in two ways:

For the High Scale / High Cardinality versions, the app schedules two searches. One search aggregates activity every day and stores that daily summary in a summary index. The second search actually does the anomaly detection, but rather that reviewing every single raw event it reviews the summary indexed data. This allows that search to analyze more data (such as terabytes instead of gigabytes), and a greater number of values (such as 300k usernames rather than 3k usernames).

For example, the small-scale version of the “Healthcare Worker Opening More Patient Records Than Normal” search runs across a time range and reviews raw events for each healthcare worker to pull the number of unique patient records per day, and then calculates the average and standard deviation all in one. If you use the large-scale version, the first search runs every day to calculate how many patient records were viewed yesterday, and then outputs one record per worker with a username, timestamp, and the number of patient records viewed for each healthcare worker to a summary index. Then the large-scale version of the search would run against the summary indexed data to calculate the average, standard deviation, and most recent value.

Considerations for implementing the large scale version

With lower cardinality to manage in the dataset and fewer raw records to retrieve each day, the amount of data that the Splunk platform has to store in memory is reduced, leading to better search performance and reduced indexer load.

However, summary indexing means that you have to manage two scheduled searches instead of just one. In addition, the data integrity of the summary index relies on the summary indexing search not being skipped. Summary indexed data also takes up storage space on your indexers, though generally not very much, and summary indexed data does not count against your Splunk license.

For more on how to use summary indexing to improve performance, see http://www.davidveuve.com/tech/how-i-do-summary-indexing-in-splunk/.

First Time Seen Searches

First time analysis detects the first time that an action is performed. This helps you identify out of the ordinary behavior that could indicate suspicious or malicious activity. For example, service accounts typically log in to the same set of servers. If a service account logs into a new device one day, or logs in interactively, that new behavior could indicate malicious activity. You typically want to see an alert of first time behavior if the last time that this activity has been seen is within the last 24 hours.

You can also perform first time analysis based on a peer group with this app. Filter out activity that is new for a particular person, but not for the people in their group or department. For example, if John Seyoto hasn’t checked out code from a particular git repo before, but John’s teammate Bob regularly checks out code from that repo, that first time activity might not be suspicious.

Detect first time behavior with the stats command and first() and last() functions. Integrate peer groups first seen activity using eventstats. In the app, the demo data compares against the most recent value of latest(), rather than “now” because events do not flow into the demo data in real time so there is no value for “now.”

The ability to detect first time seen behavior is a major feature of many security data science tools on the market, and you can replicate it with these searches in the Splunk platform out of the box, for free.

The first time seen searches address use cases that detect new values, such as the “First Logon to New Server” or “New Interactive Logon from Service Account” or any other search with “New” or “First” in the name.

Large Scale Version of First Time Seen Searches

In a large-scale deployment, use caching with a lookup for searches of this type. If you select a lookup from the “(Optional) Lookup to Cache Results” dropdown, it will automatically configure the search to use that lookup to cache the data. If you leave the value at “No Lookup Cache” then it will run over the raw data.

For example, to detect new interactive logons by service account, you would need to run a search against raw events with a time window of 30, 45, or even 100 days. The search might run against several tens of millions of events, and depending on the performance you expect from the search, it might make sense to cache the data locally.

The more performant version of these searches rely on a lookup to cache the historical data. The search then runs over the last 24 hours, adds the historical data from the lookup to recompute the earliest and latest times, updates the cache in the lookup, and finds the new values.

Considerations for implementing the large scale version

Implementing historical data caching can improve performance. For a baseline data comparison of 100 days, and assuming that some of that data is in cold storage, historical data caching could improve performance up to 100 times.

Relying on a cache also means storing a cache. The caches are stored in CSV lookup files stored on a search head. The more unique combinations of data that need to be stored, the more space needed on a search head. If a lookup has 300 million combinations to store, that lookup file can take up 230MB of space. If you implement the large-scale version of the searches, ensure that there is available storage on the search head for the lookups needed to provide historical data caching for these searches.

Lookups in this app are excluded from bundle replication to your indexers. This prevents your bundles from getting too large, and maintains Splunk reliability. However, if you move the searches or lookups to a different app, our configurations won’t protect them. In this case, make sure you replicate the settings in distsearch.conf so that those lookups are not distributed to the indexers. The risks associated with large bundles are that it can take longer for changes to go into effect, and in extreme cases can even take indexers offline (bundles too out of date).

General Splunk Searches

The remainder of the searches in the app are straightforward Splunk searches. The searches rely on tools included in Splunk platform to perform anomaly detection, such as the URL toolbox to detect Shannon entropy in URLs, the Levenshtein distance to identify filename mismatches, the transaction command to perform detection, and more. They typically don’t require a historical baseline of data, so can be run over the last half hour of data easily. You can get the most value from these searches if you copy-paste the raw search strings into your Splunk deployment and start using them.