Increase in Source Code (Git) Downloads

Description

Find users who have downloaded more files from git than normal.


Use Case

Insider Threat, Advanced Threat Detection

Category

Data Exfiltration

Security Impact

Similar to some of the other examples, like Increase in Pages Printed, the behavior of users with access to sensitive intellectual property like source code should be monitored for patterns of data exfiltration. Developers are always going to interact with source code repositories like Git, but if their accesses increase in a statistically significant manner this may represent the exfiltration of source code. It is particularly interesting to correlate this behavior to a watchlist which may contain the user IDs of personnel that are considered higher risk: contractors, new employees, employees that never go on vacation, employees with access to particularly sensitive source code.

Alert Volume

High (?)

SPL Difficulty

Hard

Journey

Stage 3

MITRE ATT&CK Tactics

Collection

MITRE ATT&CK Techniques

Data from Information Repositories

MITRE Threat Groups

APT28
Ke3chang

Kill Chain Phases

Actions on Objectives

Data Sources

Web Server

   How to Implement

Implementation of this example (or any of the Time Series Spike / Standard Deviation examples) is generally pretty simple.

  • Validate that you have the right data onboarded, and that the fields you want to monitor are properly extracted. If the base search you see in the box below returns results.
  • Save the search to run over a long period of time (recommended: at least 30 days).

For most environments, these searches can be run once a day, often overnight, without worrying too much about a slow search. If you wish to run this search more frequently, or if this search is too slow for your environment, we recommend using a summary index that first aggregates the data. We will have documentation for this process shortly, but for now you can look at Summary Indexing descriptions such as here and here.

   Known False Positives

This is a strictly behavioral search, so we define "false positive" slightly differently. Every time this fires, it will accurately a spike in the number we're monitoring... it's nearly impossible for the math to lie. But while there are really no "false positives" in a traditional sense, there is definitely lots of noise.

This search will have a high number of noise based on the bursty nature of source code access. When someone first clones a repository, they will go off the map. This search provides contextual data to record when these big bursts of activity occur.

   How To Respond

When this search returns values, initiate your incident response process and validate the user account accessing the specific repos. Contact the user and their manager to determine if it is authorized, and make a note if it is authorized and by whom. If not, the user credentials may have been used by another party and additional investigation is warranted as repositories hold sensitive source code.

Note: We include an accelerated version to show how this would work, but there is no data model for this out of the box, so you would need to build one yourself.

   Help

Increase in Source Code (Git) Downloads Help

This example leverages the Detect Spikes (standard deviation) search assistant. Our demo dataset is an anonymized collection of source code checkout logs from a git server. For this analysis, we are tracking the total number of files the user has downloaded per day 'count by user _time'. Then we calculate the average, standard deviation, and the most recent value, and filter out any users where the most recent is within the configurable number of standard deviations from average.

SPL for Increase in Source Code (Git) Downloads

Demo Data

First we pull in our demo dataset.
Bucket (aliased to bin) allows us to group events based on _time, effectively flattening the actual _time value to the same day.
Finally, we can count and aggregate per user, per day.
calculate the mean, standard deviation and most recent value
calculate the bounds as a multiple of the standard deviation

Live Data

First we pull in our Atlassian Git dataset.
Bucket (aliased to bin) allows us to group events based on _time, effectively flattening the actual _time value to the same day.
Finally, we can count and aggregate per user, per day.
calculate the mean, standard deviation and most recent value
calculate the bounds as a multiple of the standard deviation

Accelerated with Data Models

Here, tstats is pulling in one command a super-fast count per user, per day.
(self-explanatory)
calculate the mean, standard deviation and most recent value
calculate the bounds as a multiple of the standard deviation