SQL to Pandas with Windows Events 101¶
Author: Jose Rodriguez (@Cyb3rPandah)
Project: Infosec Jupyter Book
Public Organization: Open Threat Research
License: Creative Commons Attribution-ShareAlike 4.0 International
Reference: https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html
Description¶
This notebook was created as part of the Open Threat Research (OTR) initiative to empower others in the InfoSec community around the world.
With more security analysts getting into data analysis with Python and Jupyter Notebooks, I figured it would be a good idea to create a notebook and share some of my experience with Pandas. Since many analysts in our community are familiar with SQL syntax to craft detection rules, I used it in this notebook to show how easy it is to translate SQL logic to pandas statements. I leveraged this doc “Comparison with SQL” to organize the notebook and make sure I align my queries and descriptions to official pandas resources.
Pre-requisites - Reading¶
Importing Libraries¶
Pre-requisites:
pip install pandas
pip install openhunt
import pandas as pd
import numpy as np
from openhunt import mordorutils as mu
# Do not truncate Pandas output
pd.set_option('display.max_colwidth', None)
Importing Mordor Demo Dataset¶
Every time I practice or learn something new, I make sure I work on examples that are related to the field where I would implement what I learn to. Therefore, for this notebook, I used a dataset from the Mordor Project to show you how you can use Pandas to explore a few techniques that were emulated following the ATT&CK Evals Emulation Plans.
This dataset is a subset of the Mordor ATT&CK Evals APT29 - Day 1. It includes only a few events for the purpose of this notebook. Feel free to download the full dataset on your own. Make sure you allocate enough memory to your local Jupyter Notebooks server since we are using pandas to read the JSON file (dataset).
Let’s start by using the getMordorZipFile
method from the mordorutils
module to download and decompress the dataset:
mordorUrl = 'https://github.com/OTRF/infosec-jupyter-book/blob/master/datasets/apt29_evals_day1_manual_demo.zip'
mordorFilePath = mu.getMordorZipFile(mordorUrl)
Reading JSON Mordor File¶
Next, we use the read_json
method from pandas
to read the JSON file (Decompressed dataset) that returns a DataFrame. We are going to use that variable throughout the notebook.
apt29 = pd.read_json(mordorFilePath)
apt29.head(1)
EventTime | port | Message | EventID | SourceModuleName | tags | @version | SourceName | AccountType | host | ... | MandatoryLabel | ParentProcessName | TokenElevationType | NewProcessId | ActivityID | ServiceName | ServiceFileName | ServiceAccount | ServiceStartType | ServiceType | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-05-01 22:55:29 | 60737 | Registry value set:\r\nRuleName: -\r\nEventType: SetValue\r\nUtcTime: 2020-05-02 02:55:29.635\r\nProcessGuid: {5aa8ec29-cae0-5eac-4c00-000000000400}\r\nProcessId: 3340\r\nImage: C:\windows\system32\svchost.exe\r\nTargetObject: HKLM\System\CurrentControlSet\Services\W32Time\Config\LastKnownGoodTime\r\nDetails: QWORD (0x01d6202d-0x23915fdf) | 13 | eventlog | [mordorDataset] | 1 | Microsoft-Windows-Sysmon | User | wec.internal.cloudapp.net | ... | None | None | None | None | None | None | None | None | NaN | None |
1 rows × 91 columns
Leveraging Pandas for Security from a SQL perspective¶
SELECT¶
In SQL, selection is done using a comma-separated list of columns you’d like to select. You can use the *
character to select all the available columnsin the SQL table.
SELECT Hostname, Channel, EventTime, EventID
FROM apt29
LIMIT 5;
With pandas, column selection is done by passing a list of column names to your DataFrame. We can select a few columns from the APT29 dataset as shown below:
(
apt29[['Hostname','Channel','EventTime','EventID']]
.head(5)
)
Hostname | Channel | EventTime | EventID | |
---|---|---|---|---|
0 | NASHUA.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:55:29 | 13 |
1 | SCRANTON.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:55:30 | 13 |
2 | UTICA.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:55:30 | 12 |
3 | UTICA.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:55:31 | 13 |
4 | NASHUA.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:55:39 | 13 |
Calling the DataFrame without the list of column names would display all columns (Similar to SQL’s *).
Calculated Column¶
In SQL, you can add a calculated column or computed column. A computed column is a virtual column that is not physically stored in the table and can use data from other columns to calculate a new value.
SELECT Hostname, Channel, EventTime, EventID,
len(Hostname) as Hostname_Length
FROM apt29
LIMIT 5;
With pandas, you can use the DataFrame.assign() method of a DataFrame to append a new column. As a proof of concept, we can add a new column named Hostname_Length to the right of our DataFrame to calculate the number of characters in the Hostname field. We are going to show another example that could provide more context to our data analysis.
(
apt29[['Hostname','Channel','EventTime','EventID']]
.assign(Hostname_Length = apt29['Hostname'].str.len())
.head(5)
)
Hostname | Channel | EventTime | EventID | Hostname_Length | |
---|---|---|---|---|---|
0 | NASHUA.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:55:29 | 13 | 20 |
1 | SCRANTON.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:55:30 | 13 | 22 |
2 | UTICA.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:55:30 | 12 | 19 |
3 | UTICA.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:55:31 | 13 | 19 |
4 | NASHUA.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:55:39 | 13 | 20 |
WHERE¶
Filtering in SQL is done via a WHERE clause.
SELECT Hostname, Channel, EventTime, EventID,
len(Hostname) as Hostname_Length
FROM apt29
WHERE Channel = 'Windows PowerShell'
LIMIT 5;
DataFrames can be filtered in multiple ways. According to pandas docs, the most intuitive way is by using boolean indexing. We can filter our APT29 DataFrame and show only Windows events from the Microsoft-Windows-Sysmon/Operational provider.
(
apt29[['Hostname','Channel','EventTime','EventID']]
.assign(Hostname_Length = apt29['Hostname'].str.len())
[apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational']
.head(5)
)
Hostname | Channel | EventTime | EventID | Hostname_Length | |
---|---|---|---|---|---|
0 | NASHUA.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:55:29 | 13 | 20 |
1 | SCRANTON.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:55:30 | 13 | 22 |
2 | UTICA.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:55:30 | 12 | 19 |
3 | UTICA.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:55:31 | 13 | 19 |
4 | NASHUA.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:55:39 | 13 | 20 |
OR & AND Boolean Operators¶
Just like SQL’s OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and & (AND). We can use an AND operator in our APT29 DataFrame and only show Sysmon events of ID 1. In this example, I also show you how to use the calculated columns concept, that we learnerd earlier, to the CommandLine field to calculate the lenght of the command line used by the new process created.
SELECT Hostname, Channel, EventTime, EventID, CommandLine
len(CommandLine) as CommandLineLength
FROM apt29
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND EventID = 1
LIMIT 5;
(
apt29[['Hostname','Channel','EventTime','EventID','CommandLine']]
.assign(CommandLineLength = apt29['CommandLine'].str.len())
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational') & (apt29['EventID'] == 1)]
.head(5)
)
Hostname | Channel | EventTime | EventID | CommandLine | CommandLineLength | |
---|---|---|---|---|---|---|
44 | SCRANTON.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:55:56 | 1 | "C:\ProgramData\victim\‮cod.3aka3.scr" /S | 43.0 |
71 | SCRANTON.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:56:04 | 1 | \\?\C:\windows\system32\conhost.exe --headless --width 80 --height 25 --signal 0x54c --server 0x540 | 99.0 |
73 | SCRANTON.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:56:04 | 1 | "C:\windows\system32\cmd.exe" | 29.0 |
91 | SCRANTON.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:56:14 | 1 | powershell | 10.0 |
1222 | SCRANTON.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:57:01 | 1 | "C:\windows\system32\SearchProtocolHost.exe" Global\UsGthrFltPipeMssGthrPipe6_ Global\UsGthrCtrlFltPipeMssGthrPipe6 1 -2147483646 "Software\Microsoft\Windows Search" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT; MS Search 4.0 Robot)" "C:\ProgramData\Microsoft\Search\Data\Temp\usgthrsvc" "DownLevelDaemon" | 308.0 |
We can practice also combinations of AND and OR boolean operators with Sysmon registry events. Let’s use Event ID 12 (Object create and delete) and 13 (Value Set).
SELECT Channel, EventID, ProcessGuid, ProcessId, Image, TargetObject
FROM apt29
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND (EventID = 12 OR eventID = 13)
LIMIT 5;
(
apt29[['Channel','EventID','ProcessGuid','ProcessId','Image','TargetObject']]
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational')
& ((apt29['EventID'] == 12) | (apt29['EventID'] == 13))]
.head()
)
Channel | EventID | ProcessGuid | ProcessId | Image | TargetObject | |
---|---|---|---|---|---|---|
0 | Microsoft-Windows-Sysmon/Operational | 13 | {5aa8ec29-cae0-5eac-4c00-000000000400} | 3340 | C:\windows\system32\svchost.exe | HKLM\System\CurrentControlSet\Services\W32Time\Config\LastKnownGoodTime |
1 | Microsoft-Windows-Sysmon/Operational | 13 | {47ab858c-cae1-5eac-4b00-000000000400} | 3292 | C:\windows\system32\svchost.exe | HKLM\System\CurrentControlSet\Services\W32Time\Config\LastKnownGoodTime |
2 | Microsoft-Windows-Sysmon/Operational | 12 | {6bbf237a-cb01-5eac-4d00-000000000400} | 3356 | C:\WindowsAzure\Packages\GuestAgent\WindowsAzureGuestAgent.exe | HKU\.DEFAULT\Software\Microsoft\Windows\CurrentVersion\Internet Settings\Connections |
3 | Microsoft-Windows-Sysmon/Operational | 13 | {6bbf237a-cb01-5eac-4700-000000000400} | 3280 | C:\windows\system32\svchost.exe | HKLM\System\CurrentControlSet\Services\W32Time\Config\LastKnownGoodTime |
4 | Microsoft-Windows-Sysmon/Operational | 13 | {5aa8ec29-cad7-5eac-0c00-000000000400} | 728 | C:\windows\system32\lsass.exe | HKLM\System\CurrentControlSet\Services\W32Time\SecureTimeLimits\SecureTimeHigh |
Looking for new applications being executed for the first time¶
We can use what we have learned so far and look for new application in the AppCompat registery keys. This is an indicator of applications executing for the first time.
(
apt29[['EventID','ProcessId','Image','TargetObject']]
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational')
& ((apt29['EventID'] == 12) | (apt29['EventID'] == 13))
& ((apt29['TargetObject'].str.contains('.*AppCompatFlags\\\\Compatibility Assistant.*', regex=True)))]
.head(10)
)
EventID | ProcessId | Image | TargetObject | |
---|---|---|---|---|
45 | 12 | 1144 | C:\windows\system32\svchost.exe | HKU\S-1-5-21-1830255721-3727074217-2423397540-1107\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant\Store |
46 | 12 | 1144 | C:\windows\system32\svchost.exe | HKU\S-1-5-21-1830255721-3727074217-2423397540-1107\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant\Store |
14215 | 12 | 1144 | C:\windows\system32\svchost.exe | HKU\S-1-5-21-1830255721-3727074217-2423397540-1107\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant\Store |
14217 | 12 | 1144 | C:\windows\system32\svchost.exe | HKU\S-1-5-21-1830255721-3727074217-2423397540-1107\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant\Store |
14219 | 12 | 1144 | C:\windows\system32\svchost.exe | HKU\S-1-5-21-1830255721-3727074217-2423397540-1107\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant\Store |
14221 | 13 | 1144 | C:\windows\system32\svchost.exe | HKU\S-1-5-21-1830255721-3727074217-2423397540-1107\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant\Store\C:\ProgramData\victim\‮cod.3aka3.scr |
14427 | 12 | 1144 | C:\windows\system32\svchost.exe | HKU\S-1-5-21-1830255721-3727074217-2423397540-1107\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant\Store |
77148 | 12 | 8460 | C:\windows\system32\svchost.exe | HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant |
77161 | 12 | 8460 | C:\windows\system32\svchost.exe | HKU\S-1-5-21-1830255721-3727074217-2423397540-1107\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant\Store |
77162 | 13 | 8460 | C:\windows\system32\svchost.exe | HKU\S-1-5-21-1830255721-3727074217-2423397540-1107\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant\Store\C:\Windows\System32\hostui.exe |
NULL Checking¶
NULL checking is done using the notna() and isna() methods.
We can show records where CommandLine field IS NULL with the following query. This query is built on the top of previous ones.
SELECT Hostname, Channel, EventTime, EventID, CommandLine
len(CommandLine) as CommandLineLength
FROM apt29
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND EventID = 1 AND CommandLine IS NULL
LIMIT 5;
(
apt29[['Hostname','Channel','EventTime','EventID','CommandLine']]
.assign(CommandLineLength = apt29['CommandLine'].str.len())
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational')
& (apt29['EventID'] == 1) & (apt29['CommandLine'].isna())]
.head(5)
)
Hostname | Channel | EventTime | EventID | CommandLine | CommandLineLength | |
---|---|---|---|---|---|---|
77736 | SCRANTON.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 23:21:29 | 1 | None | NaN |
Getting items where CommandLine field IS NOT NULL can be done with notna().
SELECT Hostname, Channel, EventTime, EventID, CommandLine
len(CommandLine) as CommandLineLength
FROM apt29
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND EventID = 1 AND CommandLine IS NOT NULL
LIMIT 5;
(
apt29[['Hostname','Channel','EventTime','EventID','CommandLine']]
.assign(CommandLineLength = apt29['CommandLine'].str.len())
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational')
& (apt29['EventID'] == 1) & (apt29['CommandLine'].notna())]
.head(5)
)
Hostname | Channel | EventTime | EventID | CommandLine | CommandLineLength | |
---|---|---|---|---|---|---|
44 | SCRANTON.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:55:56 | 1 | "C:\ProgramData\victim\‮cod.3aka3.scr" /S | 43.0 |
71 | SCRANTON.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:56:04 | 1 | \\?\C:\windows\system32\conhost.exe --headless --width 80 --height 25 --signal 0x54c --server 0x540 | 99.0 |
73 | SCRANTON.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:56:04 | 1 | "C:\windows\system32\cmd.exe" | 29.0 |
91 | SCRANTON.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:56:14 | 1 | powershell | 10.0 |
1222 | SCRANTON.dmevals.local | Microsoft-Windows-Sysmon/Operational | 2020-05-01 22:57:01 | 1 | "C:\windows\system32\SearchProtocolHost.exe" Global\UsGthrFltPipeMssGthrPipe6_ Global\UsGthrCtrlFltPipeMssGthrPipe6 1 -2147483646 "Software\Microsoft\Windows Search" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT; MS Search 4.0 Robot)" "C:\ProgramData\Microsoft\Search\Data\Temp\usgthrsvc" "DownLevelDaemon" | 308.0 |
GROUP BY¶
In pandas, SQL’s GROUP BY operations are performed using the similarly named groupby() method. groupby() typically refers to a process where we’d like to split a dataset into groups, apply some function (typically aggregation) , and then combine the groups together.
We can count and group our APT29 DataFrame by the Hostname field.
SELECT Hostname, Channel, EventTime, EventID, CommandLine, Image,
len(CommandLine) as CommandLineLength, count(*)
FROM apt29
GROUP BY Hostname
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND EventID = 1;
The pandas equvalent would be:
(
apt29[['Hostname','Channel','EventTime','EventID','CommandLine','Image']]
.assign(CommandLineLength = apt29['CommandLine'].str.len())
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational') & (apt29['EventID'] == 1)]
.groupby('Hostname').size()
)
Hostname
NASHUA.dmevals.local 61
NEWYORK.dmevals.local 18
SCRANTON.dmevals.local 359
UTICA.dmevals.local 9
dtype: int64
We can use this technique and group Sysmon Event ID 1 events by the process name.
(
apt29[['Hostname','Channel','EventTime','EventID','CommandLine','Image']]
.assign(CommandLineLength = apt29['CommandLine'].str.len())
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational') & (apt29['EventID'] == 1)]
.groupby('Image').size()
).to_frame()
0 | |
---|---|
Image | |
C:\Packages\Plugins\Microsoft.Azure.NetworkWatcher.NetworkWatcherAgentWindows\1.4.1421.1\NetworkWatcherAgent\NetworkWatcherAgent.exe | 4 |
C:\Packages\Plugins\Microsoft.Azure.Security.IaaSAntimalware\1.5.5.9\AntimalwareConfig.exe | 1 |
C:\Packages\Plugins\Microsoft.Compute.CustomScriptExtension\1.10.5\bin\CustomScriptHandler.exe | 1 |
C:\Packages\Plugins\Microsoft.Compute.JsonADDomainExtension\1.3.2\bin\JsonADDomainExtension.exe | 1 |
C:\Program Files (x86)\Google\Update\1.3.35.452\GoogleCrashHandler.exe | 2 |
... | ... |
C:\Windows\Temp\python.exe | 5 |
C:\Windows\Temp\sdelete64.exe | 3 |
C:\Windows\WinSxS\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_10.0.18362.710_none_5f52d84058d0677f\TiWorker.exe | 3 |
C:\Windows\explorer.exe | 1 |
C:\Windows\servicing\TrustedInstaller.exe | 3 |
106 rows × 1 columns
In the previous example, I used size() and not count(). This is because count() applies the function to every column and the result does not consider null records.
(
apt29[['Hostname','Channel','EventTime','EventID','CommandLine','Image']]
.assign(CommandLineLength = apt29['CommandLine'].str.len())
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational') & (apt29['EventID'] == 1)]
.groupby('Hostname').count()
)
Channel | EventTime | EventID | CommandLine | Image | CommandLineLength | |
---|---|---|---|---|---|---|
Hostname | ||||||
NASHUA.dmevals.local | 61 | 61 | 61 | 61 | 61 | 61 |
NEWYORK.dmevals.local | 18 | 18 | 18 | 18 | 18 | 18 |
SCRANTON.dmevals.local | 359 | 359 | 359 | 358 | 358 | 358 |
UTICA.dmevals.local | 9 | 9 | 9 | 9 | 9 | 9 |
Alternatively, we can also use the count() method to a specific column. In the example below, we are doing it on the CommandLine column.
(
apt29[['Hostname','Channel','EventTime','EventID','CommandLine','Image']]
.assign(CommandLineLength = apt29['CommandLine'].str.len())
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational') & (apt29['EventID'] == 1)]
.groupby('Hostname')['CommandLine'].count()
)
Hostname
NASHUA.dmevals.local 61
NEWYORK.dmevals.local 18
SCRANTON.dmevals.local 358
UTICA.dmevals.local 9
Name: CommandLine, dtype: int64
Multiple Functions¶
We can use multiple functions at once for a specific column. For instance, we can apply the len() and avg() functions to the field CommandLineLength. Then, we can group the results by the Hostname field.
SELECT Hostname, Channel, EventTime, EventID, CommandLine, Image,
len(CommandLine) as CommandLineLength, AVG(CommandLine_Length), COUNT(*)
FROM apt29
GROUP BY Hostname
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND EventID = 1;
(
apt29[['Hostname','Channel','EventTime','EventID','CommandLine','Image']]
.assign(CommandLineLength = apt29['CommandLine'].str.len())
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational') & (apt29['EventID'] == 1)]
.groupby('Hostname').agg({'CommandLineLength': np.mean, 'Channel': np.size})
)
CommandLineLength | Channel | |
---|---|---|
Hostname | ||
NASHUA.dmevals.local | 60.360656 | 61 |
NEWYORK.dmevals.local | 50.888889 | 18 |
SCRANTON.dmevals.local | 88.296089 | 359 |
UTICA.dmevals.local | 51.111111 | 9 |
Grouping By more than one column¶
Grouping by more than one column is done by passing a list of columns to the groupby() method.
SELECT Hostname, Channel, EventTime, EventID, CommandLine, Image,
len(CommandLine) as CommandLineLength, COUNT(*), AVG(CommandLine_Length)
FROM apt29
GROUP BY Hostname, Image
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND EventID = 1
LIMIT 20;
(
apt29[['Hostname','Channel','EventTime','EventID','CommandLine','Image']]
.assign(CommandLineLength = apt29['CommandLine'].str.len())
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational') & (apt29['EventID'] == 1)]
.groupby(['Hostname','Image']).agg({'CommandLineLength': [np.mean, np.size]})
.head(20)
)
CommandLineLength | |||
---|---|---|---|
mean | size | ||
Hostname | Image | ||
NASHUA.dmevals.local | C:\WindowsAzure\Packages\CollectGuestLogs.exe | 119.000000 | 1.0 |
C:\Windows\PSEXESVC.exe | 23.000000 | 4.0 | |
C:\Windows\System32\BackgroundTransferHost.exe | 65.000000 | 2.0 | |
C:\Windows\System32\RuntimeBroker.exe | 48.000000 | 4.0 | |
C:\Windows\System32\SearchFilterHost.exe | 66.000000 | 1.0 | |
C:\Windows\System32\SearchProtocolHost.exe | 308.000000 | 1.0 | |
C:\Windows\System32\UsoClient.exe | 43.000000 | 1.0 | |
C:\Windows\System32\WerFault.exe | 56.000000 | 2.0 | |
C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe | 14.000000 | 1.0 | |
C:\Windows\System32\backgroundTaskHost.exe | 101.000000 | 7.0 | |
C:\Windows\System32\cmd.exe | 16.000000 | 2.0 | |
C:\Windows\System32\conhost.exe | 55.000000 | 8.0 | |
C:\Windows\System32\consent.exe | 37.000000 | 1.0 | |
C:\Windows\System32\dllhost.exe | 81.000000 | 1.0 | |
C:\Windows\System32\dsregcmd.exe | 56.000000 | 1.0 | |
C:\Windows\System32\gpupdate.exe | 29.000000 | 1.0 | |
C:\Windows\System32\rundll32.exe | 140.000000 | 1.0 | |
C:\Windows\System32\sppsvc.exe | 30.000000 | 1.0 | |
C:\Windows\System32\svchost.exe | 52.000000 | 3.0 | |
C:\Windows\System32\taskhostw.exe | 20.333333 | 3.0 |
JOIN¶
JOINs can be performed with join() or merge(). By default, join() will join the DataFrames on their indices if a type of join is not specified. Each method has parameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or the columns to join on (column names or indices).
# Creating a dataframe with information of Security event 4624: An account was successfully logged on
Security4624 = apt29[(apt29['Channel'].str.lower() == 'security') & (apt29['EventID'] == 4624)].dropna(axis = 1, how = 'all')
Security4624.shape
(297, 55)
# Creating a dataframe with information of Security event 4688: A new process has been created
Security4688 = apt29[(apt29['Channel'].str.lower() == 'security') & (apt29['EventID'] == 4688)].dropna(axis = 1, how = 'all')
Security4688.shape
(460, 42)
# Creating a dataframe with information of Security event 4697: A service was installed in the system
Security4697 = apt29[(apt29['Channel'].str.lower() == 'security') & (apt29['EventID'] == 4697)].dropna(axis = 1, how = 'all')
Security4697.shape
(23, 37)
INNER JOIN¶
In the example below, we are looking for New Processes being created by accounts that were authenticated over the network (Logon Type 3). This query looks for potential Lateral Movement techniques. Without looking for “wsmprovhost.exe” which is an indication of PSRemoting, I was able to get there by focusing on the behavior of accounts authenticating over the network and creating new processes.
# merge performs an INNER JOIN by default
(
pd.merge(Security4688, Security4624[Security4624['LogonType'] == 3],
on = 'TargetLogonId', how = 'inner')
[['NewProcessId','NewProcessName','ProcessId_x','ParentProcessName','TargetUserName_y','IpAddress']]
)
NewProcessId | NewProcessName | ProcessId_x | ParentProcessName | TargetUserName_y | IpAddress | |
---|---|---|---|---|---|---|
0 | 0x1e28 | C:\Windows\System32\wsmprovhost.exe | 0x374 | C:\Windows\System32\svchost.exe | pbeesly | - |
When the column names where we are performing the join on are different, we need to use the left_on and right_on parameters as shown below. In this example, we are looking for New Services being installed by accounts that were authenticated over the network (Logon Type 3). This query looks for potential Lateral Movement techniques. I was able to identify the use pf PSEXEC in the dataset. Once again, I was not looking for the specific service name. I was focusing on the behavior of accounts that authenticate over the network and installing new services.
SELECT x.ServiceName, x.ServiceFileName, y.IpAddress
FROM Security4697 x
INNER JOIN (FROM Security4624 WHERE LogonType = 3) y
ON x.SubjectLogonId = y.TargetLogonId;
# merge performs an INNER JOIN by default
(
pd.merge(Security4697, Security4624[Security4624['LogonType'] == 3],
left_on = 'SubjectLogonId', right_on = 'TargetLogonId', how = 'inner')
[['ServiceName', 'ServiceFileName','IpAddress']]
)
ServiceName | ServiceFileName | IpAddress | |
---|---|---|---|
0 | PSEXESVC | %SystemRoot%\PSEXESVC.exe | 10.0.1.4 |
1 | PSEXESVC | %SystemRoot%\PSEXESVC.exe | 10.0.1.4 |
2 | PSEXESVC | %SystemRoot%\PSEXESVC.exe | 10.0.1.4 |
3 | PSEXESVC | %SystemRoot%\PSEXESVC.exe | 10.0.1.4 |
LEFT OUTER JOIN¶
Same query as before, but in the example below we use the LEFT join type for the how parameter. We can see a new service installed named javamtsup. If you look at the emulation plan for Day 1, we know tha the javamtsup service was installed locally and not over the network. That service was used for persistence.
(
pd.merge(Security4697, Security4624[Security4624['LogonType'] == 3],
left_on = 'SubjectLogonId', right_on = 'TargetLogonId', how = 'left')
[['ServiceName', 'ServiceFileName','IpAddress']]
)
ServiceName | ServiceFileName | IpAddress | |
---|---|---|---|
0 | javamtsup | C:\Windows\System32\javamtsup.exe | NaN |
1 | PSEXESVC | %SystemRoot%\PSEXESVC.exe | 10.0.1.4 |
2 | PSEXESVC | %SystemRoot%\PSEXESVC.exe | 10.0.1.4 |
3 | PSEXESVC | %SystemRoot%\PSEXESVC.exe | 10.0.1.4 |
4 | PSEXESVC | %SystemRoot%\PSEXESVC.exe | 10.0.1.4 |
5 | AarSvc_13619b | C:\windows\system32\svchost.exe -k AarSvcGroup -p | NaN |
6 | BcastDVRUserService_13619b | C:\windows\system32\svchost.exe -k BcastDVRUserService | NaN |
7 | BluetoothUserService_13619b | C:\windows\system32\svchost.exe -k BthAppGroup -p | NaN |
8 | CaptureService_13619b | C:\windows\system32\svchost.exe -k LocalService -p | NaN |
9 | cbdhsvc_13619b | C:\windows\system32\svchost.exe -k ClipboardSvcGroup -p | NaN |
10 | CDPUserSvc_13619b | C:\windows\system32\svchost.exe -k UnistackSvcGroup | NaN |
11 | ConsentUxUserSvc_13619b | C:\windows\system32\svchost.exe -k DevicesFlow | NaN |
12 | CredentialEnrollmentManagerUserSvc_13619b | C:\windows\system32\CredentialEnrollmentManager.exe | NaN |
13 | DeviceAssociationBrokerSvc_13619b | C:\windows\system32\svchost.exe -k DevicesFlow -p | NaN |
14 | DevicePickerUserSvc_13619b | C:\windows\system32\svchost.exe -k DevicesFlow | NaN |
15 | DevicesFlowUserSvc_13619b | C:\windows\system32\svchost.exe -k DevicesFlow | NaN |
16 | MessagingService_13619b | C:\windows\system32\svchost.exe -k UnistackSvcGroup | NaN |
17 | OneSyncSvc_13619b | C:\windows\system32\svchost.exe -k UnistackSvcGroup | NaN |
18 | PimIndexMaintenanceSvc_13619b | C:\windows\system32\svchost.exe -k UnistackSvcGroup | NaN |
19 | PrintWorkflowUserSvc_13619b | C:\windows\system32\svchost.exe -k PrintWorkflow | NaN |
20 | UnistoreSvc_13619b | C:\windows\System32\svchost.exe -k UnistackSvcGroup | NaN |
21 | UserDataSvc_13619b | C:\windows\system32\svchost.exe -k UnistackSvcGroup | NaN |
22 | WpnUserService_13619b | C:\windows\system32\svchost.exe -k UnistackSvcGroup | NaN |
RIGHT JOIN¶
Same query as before, but in the example below we use the RIGHT join type for the how parameter. We can see also a new service installing
(
pd.merge(Security4697, Security4624[Security4624['LogonType'] == 3],
left_on = 'SubjectLogonId', right_on = 'TargetLogonId', how = 'right')
[['ServiceName', 'ServiceFileName','IpAddress']]
)
ServiceName | ServiceFileName | IpAddress | |
---|---|---|---|
0 | PSEXESVC | %SystemRoot%\PSEXESVC.exe | 10.0.1.4 |
1 | PSEXESVC | %SystemRoot%\PSEXESVC.exe | 10.0.1.4 |
2 | PSEXESVC | %SystemRoot%\PSEXESVC.exe | 10.0.1.4 |
3 | PSEXESVC | %SystemRoot%\PSEXESVC.exe | 10.0.1.4 |
4 | NaN | NaN | ::1 |
... | ... | ... | ... |
213 | NaN | NaN | fe80::e40c:95b6:b0a7:6429 |
214 | NaN | NaN | fe80::e40c:95b6:b0a7:6429 |
215 | NaN | NaN | fe80::e40c:95b6:b0a7:6429 |
216 | NaN | NaN | ::1 |
217 | NaN | NaN | ::1 |
218 rows × 3 columns
FULL JOIN¶
pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not the joined columns find a match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).
(
pd.merge(Security4697, Security4624[Security4624['LogonType'] == 3],
left_on = 'SubjectLogonId', right_on = 'TargetLogonId', how = 'outer')
[['ServiceName', 'ServiceFileName','IpAddress']]
)
ServiceName | ServiceFileName | IpAddress | |
---|---|---|---|
0 | javamtsup | C:\Windows\System32\javamtsup.exe | NaN |
1 | PSEXESVC | %SystemRoot%\PSEXESVC.exe | 10.0.1.4 |
2 | PSEXESVC | %SystemRoot%\PSEXESVC.exe | 10.0.1.4 |
3 | PSEXESVC | %SystemRoot%\PSEXESVC.exe | 10.0.1.4 |
4 | PSEXESVC | %SystemRoot%\PSEXESVC.exe | 10.0.1.4 |
... | ... | ... | ... |
232 | NaN | NaN | fe80::e40c:95b6:b0a7:6429 |
233 | NaN | NaN | fe80::e40c:95b6:b0a7:6429 |
234 | NaN | NaN | fe80::e40c:95b6:b0a7:6429 |
235 | NaN | NaN | ::1 |
236 | NaN | NaN | ::1 |
237 rows × 3 columns