SQL to Pandas with Windows Events 101

Description

This notebook was created as part of the Open Threat Research (OTR) initiative to empower others in the InfoSec community around the world.

With more security analysts getting into data analysis with Python and Jupyter Notebooks, I figured it would be a good idea to create a notebook and share some of my experience with Pandas. Since many analysts in our community are familiar with SQL syntax to craft detection rules, I used it in this notebook to show how easy it is to translate SQL logic to pandas statements. I leveraged this doc “Comparison with SQL” to organize the notebook and make sure I align my queries and descriptions to official pandas resources.

Pre-requisites - Reading

Importing Libraries

Pre-requisites:

  • pip install pandas

  • pip install openhunt

import pandas as pd
import numpy as np

from openhunt import mordorutils as mu

# Do not truncate Pandas output
pd.set_option('display.max_colwidth', None)

Importing Mordor Demo Dataset

  • Every time I practice or learn something new, I make sure I work on examples that are related to the field where I would implement what I learn to. Therefore, for this notebook, I used a dataset from the Mordor Project to show you how you can use Pandas to explore a few techniques that were emulated following the ATT&CK Evals Emulation Plans.

  • This dataset is a subset of the Mordor ATT&CK Evals APT29 - Day 1. It includes only a few events for the purpose of this notebook. Feel free to download the full dataset on your own. Make sure you allocate enough memory to your local Jupyter Notebooks server since we are using pandas to read the JSON file (dataset).

Let’s start by using the getMordorZipFile method from the mordorutils module to download and decompress the dataset:

mordorUrl = 'https://github.com/OTRF/infosec-jupyter-book/blob/master/datasets/apt29_evals_day1_manual_demo.zip'
mordorFilePath = mu.getMordorZipFile(mordorUrl)

Reading JSON Mordor File

Next, we use the read_json method from pandas to read the JSON file (Decompressed dataset) that returns a DataFrame. We are going to use that variable throughout the notebook.

apt29 = pd.read_json(mordorFilePath)
apt29.head(1)
EventTime port Message EventID SourceModuleName tags @version SourceName AccountType host ... MandatoryLabel ParentProcessName TokenElevationType NewProcessId ActivityID ServiceName ServiceFileName ServiceAccount ServiceStartType ServiceType
0 2020-05-01 22:55:29 60737 Registry value set:\r\nRuleName: -\r\nEventType: SetValue\r\nUtcTime: 2020-05-02 02:55:29.635\r\nProcessGuid: {5aa8ec29-cae0-5eac-4c00-000000000400}\r\nProcessId: 3340\r\nImage: C:\windows\system32\svchost.exe\r\nTargetObject: HKLM\System\CurrentControlSet\Services\W32Time\Config\LastKnownGoodTime\r\nDetails: QWORD (0x01d6202d-0x23915fdf) 13 eventlog [mordorDataset] 1 Microsoft-Windows-Sysmon User wec.internal.cloudapp.net ... None None None None None None None None NaN None

1 rows × 91 columns

Leveraging Pandas for Security from a SQL perspective

SELECT

In SQL, selection is done using a comma-separated list of columns you’d like to select. You can use the * character to select all the available columnsin the SQL table.

SELECT Hostname, Channel, EventTime, EventID
FROM apt29
LIMIT 5;

With pandas, column selection is done by passing a list of column names to your DataFrame. We can select a few columns from the APT29 dataset as shown below:

(
apt29[['Hostname','Channel','EventTime','EventID']]
.head(5)
)
Hostname Channel EventTime EventID
0 NASHUA.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:55:29 13
1 SCRANTON.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:55:30 13
2 UTICA.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:55:30 12
3 UTICA.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:55:31 13
4 NASHUA.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:55:39 13

Calling the DataFrame without the list of column names would display all columns (Similar to SQL’s *).

Calculated Column

In SQL, you can add a calculated column or computed column. A computed column is a virtual column that is not physically stored in the table and can use data from other columns to calculate a new value.

SELECT Hostname, Channel, EventTime, EventID,
       len(Hostname) as Hostname_Length
FROM apt29
LIMIT 5;

With pandas, you can use the DataFrame.assign() method of a DataFrame to append a new column. As a proof of concept, we can add a new column named Hostname_Length to the right of our DataFrame to calculate the number of characters in the Hostname field. We are going to show another example that could provide more context to our data analysis.

(
apt29[['Hostname','Channel','EventTime','EventID']]

.assign(Hostname_Length = apt29['Hostname'].str.len())

.head(5)
)
Hostname Channel EventTime EventID Hostname_Length
0 NASHUA.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:55:29 13 20
1 SCRANTON.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:55:30 13 22
2 UTICA.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:55:30 12 19
3 UTICA.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:55:31 13 19
4 NASHUA.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:55:39 13 20

WHERE

Filtering in SQL is done via a WHERE clause.

SELECT Hostname, Channel, EventTime, EventID,
       len(Hostname) as Hostname_Length
FROM apt29
WHERE Channel = 'Windows PowerShell'
LIMIT 5;

DataFrames can be filtered in multiple ways. According to pandas docs, the most intuitive way is by using boolean indexing. We can filter our APT29 DataFrame and show only Windows events from the Microsoft-Windows-Sysmon/Operational provider.

(
apt29[['Hostname','Channel','EventTime','EventID']]

.assign(Hostname_Length = apt29['Hostname'].str.len())
    
[apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational']
    
.head(5)
)
Hostname Channel EventTime EventID Hostname_Length
0 NASHUA.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:55:29 13 20
1 SCRANTON.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:55:30 13 22
2 UTICA.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:55:30 12 19
3 UTICA.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:55:31 13 19
4 NASHUA.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:55:39 13 20

OR & AND Boolean Operators

Just like SQL’s OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and & (AND). We can use an AND operator in our APT29 DataFrame and only show Sysmon events of ID 1. In this example, I also show you how to use the calculated columns concept, that we learnerd earlier, to the CommandLine field to calculate the lenght of the command line used by the new process created.

SELECT Hostname, Channel, EventTime, EventID, CommandLine
       len(CommandLine) as CommandLineLength
FROM apt29
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND EventID = 1
LIMIT 5;
(
apt29[['Hostname','Channel','EventTime','EventID','CommandLine']]

.assign(CommandLineLength = apt29['CommandLine'].str.len())
    
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational') & (apt29['EventID'] == 1)]

.head(5)
)
Hostname Channel EventTime EventID CommandLine CommandLineLength
44 SCRANTON.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:55:56 1 "C:\ProgramData\victim\‮cod.3aka3.scr" /S 43.0
71 SCRANTON.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:56:04 1 \\?\C:\windows\system32\conhost.exe --headless --width 80 --height 25 --signal 0x54c --server 0x540 99.0
73 SCRANTON.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:56:04 1 "C:\windows\system32\cmd.exe" 29.0
91 SCRANTON.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:56:14 1 powershell 10.0
1222 SCRANTON.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:57:01 1 "C:\windows\system32\SearchProtocolHost.exe" Global\UsGthrFltPipeMssGthrPipe6_ Global\UsGthrCtrlFltPipeMssGthrPipe6 1 -2147483646 "Software\Microsoft\Windows Search" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT; MS Search 4.0 Robot)" "C:\ProgramData\Microsoft\Search\Data\Temp\usgthrsvc" "DownLevelDaemon" 308.0

We can practice also combinations of AND and OR boolean operators with Sysmon registry events. Let’s use Event ID 12 (Object create and delete) and 13 (Value Set).

SELECT Channel, EventID, ProcessGuid, ProcessId, Image, TargetObject
FROM apt29
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND (EventID = 12 OR eventID = 13)
LIMIT 5;
(
apt29[['Channel','EventID','ProcessGuid','ProcessId','Image','TargetObject']]
    
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational') 
 
     & ((apt29['EventID'] == 12) | (apt29['EventID'] == 13))]

.head()
)
Channel EventID ProcessGuid ProcessId Image TargetObject
0 Microsoft-Windows-Sysmon/Operational 13 {5aa8ec29-cae0-5eac-4c00-000000000400} 3340 C:\windows\system32\svchost.exe HKLM\System\CurrentControlSet\Services\W32Time\Config\LastKnownGoodTime
1 Microsoft-Windows-Sysmon/Operational 13 {47ab858c-cae1-5eac-4b00-000000000400} 3292 C:\windows\system32\svchost.exe HKLM\System\CurrentControlSet\Services\W32Time\Config\LastKnownGoodTime
2 Microsoft-Windows-Sysmon/Operational 12 {6bbf237a-cb01-5eac-4d00-000000000400} 3356 C:\WindowsAzure\Packages\GuestAgent\WindowsAzureGuestAgent.exe HKU\.DEFAULT\Software\Microsoft\Windows\CurrentVersion\Internet Settings\Connections
3 Microsoft-Windows-Sysmon/Operational 13 {6bbf237a-cb01-5eac-4700-000000000400} 3280 C:\windows\system32\svchost.exe HKLM\System\CurrentControlSet\Services\W32Time\Config\LastKnownGoodTime
4 Microsoft-Windows-Sysmon/Operational 13 {5aa8ec29-cad7-5eac-0c00-000000000400} 728 C:\windows\system32\lsass.exe HKLM\System\CurrentControlSet\Services\W32Time\SecureTimeLimits\SecureTimeHigh

Looking for new applications being executed for the first time

We can use what we have learned so far and look for new application in the AppCompat registery keys. This is an indicator of applications executing for the first time.

(
apt29[['EventID','ProcessId','Image','TargetObject']]
    
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational') 
 
     & ((apt29['EventID'] == 12) | (apt29['EventID'] == 13))
     
     & ((apt29['TargetObject'].str.contains('.*AppCompatFlags\\\\Compatibility Assistant.*', regex=True)))]

.head(10)
)
EventID ProcessId Image TargetObject
45 12 1144 C:\windows\system32\svchost.exe HKU\S-1-5-21-1830255721-3727074217-2423397540-1107\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant\Store
46 12 1144 C:\windows\system32\svchost.exe HKU\S-1-5-21-1830255721-3727074217-2423397540-1107\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant\Store
14215 12 1144 C:\windows\system32\svchost.exe HKU\S-1-5-21-1830255721-3727074217-2423397540-1107\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant\Store
14217 12 1144 C:\windows\system32\svchost.exe HKU\S-1-5-21-1830255721-3727074217-2423397540-1107\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant\Store
14219 12 1144 C:\windows\system32\svchost.exe HKU\S-1-5-21-1830255721-3727074217-2423397540-1107\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant\Store
14221 13 1144 C:\windows\system32\svchost.exe HKU\S-1-5-21-1830255721-3727074217-2423397540-1107\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant\Store\C:\ProgramData\victim\‮cod.3aka3.scr
14427 12 1144 C:\windows\system32\svchost.exe HKU\S-1-5-21-1830255721-3727074217-2423397540-1107\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant\Store
77148 12 8460 C:\windows\system32\svchost.exe HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant
77161 12 8460 C:\windows\system32\svchost.exe HKU\S-1-5-21-1830255721-3727074217-2423397540-1107\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant\Store
77162 13 8460 C:\windows\system32\svchost.exe HKU\S-1-5-21-1830255721-3727074217-2423397540-1107\Software\Microsoft\Windows NT\CurrentVersion\AppCompatFlags\Compatibility Assistant\Store\C:\Windows\System32\hostui.exe

NULL Checking

NULL checking is done using the notna() and isna() methods.

We can show records where CommandLine field IS NULL with the following query. This query is built on the top of previous ones.

SELECT Hostname, Channel, EventTime, EventID, CommandLine
       len(CommandLine) as CommandLineLength
FROM apt29
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND EventID = 1 AND CommandLine IS NULL
LIMIT 5;
(
apt29[['Hostname','Channel','EventTime','EventID','CommandLine']]

.assign(CommandLineLength = apt29['CommandLine'].str.len())
    
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational') 
     & (apt29['EventID'] == 1) & (apt29['CommandLine'].isna())]
    
.head(5)
)
Hostname Channel EventTime EventID CommandLine CommandLineLength
77736 SCRANTON.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 23:21:29 1 None NaN

Getting items where CommandLine field IS NOT NULL can be done with notna().

SELECT Hostname, Channel, EventTime, EventID, CommandLine
       len(CommandLine) as CommandLineLength
FROM apt29
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND EventID = 1 AND CommandLine IS NOT NULL
LIMIT 5;
(
apt29[['Hostname','Channel','EventTime','EventID','CommandLine']]

.assign(CommandLineLength = apt29['CommandLine'].str.len())
    
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational') 
     & (apt29['EventID'] == 1) & (apt29['CommandLine'].notna())]
    
.head(5)
)
Hostname Channel EventTime EventID CommandLine CommandLineLength
44 SCRANTON.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:55:56 1 "C:\ProgramData\victim\‮cod.3aka3.scr" /S 43.0
71 SCRANTON.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:56:04 1 \\?\C:\windows\system32\conhost.exe --headless --width 80 --height 25 --signal 0x54c --server 0x540 99.0
73 SCRANTON.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:56:04 1 "C:\windows\system32\cmd.exe" 29.0
91 SCRANTON.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:56:14 1 powershell 10.0
1222 SCRANTON.dmevals.local Microsoft-Windows-Sysmon/Operational 2020-05-01 22:57:01 1 "C:\windows\system32\SearchProtocolHost.exe" Global\UsGthrFltPipeMssGthrPipe6_ Global\UsGthrCtrlFltPipeMssGthrPipe6 1 -2147483646 "Software\Microsoft\Windows Search" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT; MS Search 4.0 Robot)" "C:\ProgramData\Microsoft\Search\Data\Temp\usgthrsvc" "DownLevelDaemon" 308.0

GROUP BY

In pandas, SQL’s GROUP BY operations are performed using the similarly named groupby() method. groupby() typically refers to a process where we’d like to split a dataset into groups, apply some function (typically aggregation) , and then combine the groups together.

We can count and group our APT29 DataFrame by the Hostname field.

SELECT Hostname, Channel, EventTime, EventID, CommandLine, Image,
       len(CommandLine) as CommandLineLength, count(*)
FROM apt29
GROUP BY Hostname
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND EventID = 1;

The pandas equvalent would be:

(
apt29[['Hostname','Channel','EventTime','EventID','CommandLine','Image']]

.assign(CommandLineLength = apt29['CommandLine'].str.len())
    
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational') & (apt29['EventID'] == 1)]
    
.groupby('Hostname').size()
)
Hostname
NASHUA.dmevals.local       61
NEWYORK.dmevals.local      18
SCRANTON.dmevals.local    359
UTICA.dmevals.local         9
dtype: int64

We can use this technique and group Sysmon Event ID 1 events by the process name.

(
apt29[['Hostname','Channel','EventTime','EventID','CommandLine','Image']]

.assign(CommandLineLength = apt29['CommandLine'].str.len())
    
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational') & (apt29['EventID'] == 1)]
    
.groupby('Image').size()
).to_frame()
0
Image
C:\Packages\Plugins\Microsoft.Azure.NetworkWatcher.NetworkWatcherAgentWindows\1.4.1421.1\NetworkWatcherAgent\NetworkWatcherAgent.exe 4
C:\Packages\Plugins\Microsoft.Azure.Security.IaaSAntimalware\1.5.5.9\AntimalwareConfig.exe 1
C:\Packages\Plugins\Microsoft.Compute.CustomScriptExtension\1.10.5\bin\CustomScriptHandler.exe 1
C:\Packages\Plugins\Microsoft.Compute.JsonADDomainExtension\1.3.2\bin\JsonADDomainExtension.exe 1
C:\Program Files (x86)\Google\Update\1.3.35.452\GoogleCrashHandler.exe 2
... ...
C:\Windows\Temp\python.exe 5
C:\Windows\Temp\sdelete64.exe 3
C:\Windows\WinSxS\amd64_microsoft-windows-servicingstack_31bf3856ad364e35_10.0.18362.710_none_5f52d84058d0677f\TiWorker.exe 3
C:\Windows\explorer.exe 1
C:\Windows\servicing\TrustedInstaller.exe 3

106 rows × 1 columns

In the previous example, I used size() and not count(). This is because count() applies the function to every column and the result does not consider null records.

(
apt29[['Hostname','Channel','EventTime','EventID','CommandLine','Image']]

.assign(CommandLineLength = apt29['CommandLine'].str.len())
    
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational') & (apt29['EventID'] == 1)]
    
.groupby('Hostname').count()
)
Channel EventTime EventID CommandLine Image CommandLineLength
Hostname
NASHUA.dmevals.local 61 61 61 61 61 61
NEWYORK.dmevals.local 18 18 18 18 18 18
SCRANTON.dmevals.local 359 359 359 358 358 358
UTICA.dmevals.local 9 9 9 9 9 9

Alternatively, we can also use the count() method to a specific column. In the example below, we are doing it on the CommandLine column.

(
apt29[['Hostname','Channel','EventTime','EventID','CommandLine','Image']]

.assign(CommandLineLength = apt29['CommandLine'].str.len())
    
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational') & (apt29['EventID'] == 1)]
    
.groupby('Hostname')['CommandLine'].count()
)
Hostname
NASHUA.dmevals.local       61
NEWYORK.dmevals.local      18
SCRANTON.dmevals.local    358
UTICA.dmevals.local         9
Name: CommandLine, dtype: int64

Multiple Functions

We can use multiple functions at once for a specific column. For instance, we can apply the len() and avg() functions to the field CommandLineLength. Then, we can group the results by the Hostname field.

SELECT Hostname, Channel, EventTime, EventID, CommandLine, Image,
       len(CommandLine) as CommandLineLength, AVG(CommandLine_Length), COUNT(*)
FROM apt29
GROUP BY Hostname
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND EventID = 1;
(
apt29[['Hostname','Channel','EventTime','EventID','CommandLine','Image']]

.assign(CommandLineLength = apt29['CommandLine'].str.len())
    
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational') & (apt29['EventID'] == 1)]
    
.groupby('Hostname').agg({'CommandLineLength': np.mean, 'Channel': np.size})
)
CommandLineLength Channel
Hostname
NASHUA.dmevals.local 60.360656 61
NEWYORK.dmevals.local 50.888889 18
SCRANTON.dmevals.local 88.296089 359
UTICA.dmevals.local 51.111111 9

Grouping By more than one column

Grouping by more than one column is done by passing a list of columns to the groupby() method.

SELECT Hostname, Channel, EventTime, EventID, CommandLine, Image,
       len(CommandLine) as CommandLineLength, COUNT(*), AVG(CommandLine_Length)
FROM apt29
GROUP BY Hostname, Image
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND EventID = 1
LIMIT 20;
(
apt29[['Hostname','Channel','EventTime','EventID','CommandLine','Image']]

.assign(CommandLineLength = apt29['CommandLine'].str.len())
    
[(apt29['Channel'] == 'Microsoft-Windows-Sysmon/Operational') & (apt29['EventID'] == 1)]
    
.groupby(['Hostname','Image']).agg({'CommandLineLength': [np.mean, np.size]})

.head(20)
)
CommandLineLength
mean size
Hostname Image
NASHUA.dmevals.local C:\WindowsAzure\Packages\CollectGuestLogs.exe 119.000000 1.0
C:\Windows\PSEXESVC.exe 23.000000 4.0
C:\Windows\System32\BackgroundTransferHost.exe 65.000000 2.0
C:\Windows\System32\RuntimeBroker.exe 48.000000 4.0
C:\Windows\System32\SearchFilterHost.exe 66.000000 1.0
C:\Windows\System32\SearchProtocolHost.exe 308.000000 1.0
C:\Windows\System32\UsoClient.exe 43.000000 1.0
C:\Windows\System32\WerFault.exe 56.000000 2.0
C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe 14.000000 1.0
C:\Windows\System32\backgroundTaskHost.exe 101.000000 7.0
C:\Windows\System32\cmd.exe 16.000000 2.0
C:\Windows\System32\conhost.exe 55.000000 8.0
C:\Windows\System32\consent.exe 37.000000 1.0
C:\Windows\System32\dllhost.exe 81.000000 1.0
C:\Windows\System32\dsregcmd.exe 56.000000 1.0
C:\Windows\System32\gpupdate.exe 29.000000 1.0
C:\Windows\System32\rundll32.exe 140.000000 1.0
C:\Windows\System32\sppsvc.exe 30.000000 1.0
C:\Windows\System32\svchost.exe 52.000000 3.0
C:\Windows\System32\taskhostw.exe 20.333333 3.0

JOIN

JOINs can be performed with join() or merge(). By default, join() will join the DataFrames on their indices if a type of join is not specified. Each method has parameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or the columns to join on (column names or indices).

# Creating a dataframe with information of Security event 4624: An account was successfully logged on
Security4624 = apt29[(apt29['Channel'].str.lower() == 'security') & (apt29['EventID'] == 4624)].dropna(axis = 1, how = 'all')
Security4624.shape
(297, 55)
# Creating a dataframe with information of Security event 4688: A new process has been created
Security4688 = apt29[(apt29['Channel'].str.lower() == 'security') & (apt29['EventID'] == 4688)].dropna(axis = 1, how = 'all')
Security4688.shape
(460, 42)
# Creating a dataframe with information of Security event 4697: A service was installed in the system
Security4697 = apt29[(apt29['Channel'].str.lower() == 'security') & (apt29['EventID'] == 4697)].dropna(axis = 1, how = 'all')
Security4697.shape
(23, 37)

INNER JOIN

In the example below, we are looking for New Processes being created by accounts that were authenticated over the network (Logon Type 3). This query looks for potential Lateral Movement techniques. Without looking for “wsmprovhost.exe” which is an indication of PSRemoting, I was able to get there by focusing on the behavior of accounts authenticating over the network and creating new processes.

# merge performs an INNER JOIN by default
(
pd.merge(Security4688, Security4624[Security4624['LogonType'] == 3],
         on = 'TargetLogonId', how = 'inner')
[['NewProcessId','NewProcessName','ProcessId_x','ParentProcessName','TargetUserName_y','IpAddress']]
)
NewProcessId NewProcessName ProcessId_x ParentProcessName TargetUserName_y IpAddress
0 0x1e28 C:\Windows\System32\wsmprovhost.exe 0x374 C:\Windows\System32\svchost.exe pbeesly -

When the column names where we are performing the join on are different, we need to use the left_on and right_on parameters as shown below. In this example, we are looking for New Services being installed by accounts that were authenticated over the network (Logon Type 3). This query looks for potential Lateral Movement techniques. I was able to identify the use pf PSEXEC in the dataset. Once again, I was not looking for the specific service name. I was focusing on the behavior of accounts that authenticate over the network and installing new services.

SELECT x.ServiceName, x.ServiceFileName, y.IpAddress
FROM Security4697 x
INNER JOIN (FROM Security4624 WHERE LogonType = 3) y
  ON x.SubjectLogonId = y.TargetLogonId;
# merge performs an INNER JOIN by default
(
pd.merge(Security4697, Security4624[Security4624['LogonType'] == 3],
         left_on = 'SubjectLogonId', right_on = 'TargetLogonId', how = 'inner')
[['ServiceName', 'ServiceFileName','IpAddress']]
)
ServiceName ServiceFileName IpAddress
0 PSEXESVC %SystemRoot%\PSEXESVC.exe 10.0.1.4
1 PSEXESVC %SystemRoot%\PSEXESVC.exe 10.0.1.4
2 PSEXESVC %SystemRoot%\PSEXESVC.exe 10.0.1.4
3 PSEXESVC %SystemRoot%\PSEXESVC.exe 10.0.1.4

LEFT OUTER JOIN

Same query as before, but in the example below we use the LEFT join type for the how parameter. We can see a new service installed named javamtsup. If you look at the emulation plan for Day 1, we know tha the javamtsup service was installed locally and not over the network. That service was used for persistence.

(
pd.merge(Security4697, Security4624[Security4624['LogonType'] == 3],
         left_on = 'SubjectLogonId', right_on = 'TargetLogonId', how = 'left')
[['ServiceName', 'ServiceFileName','IpAddress']]
)
ServiceName ServiceFileName IpAddress
0 javamtsup C:\Windows\System32\javamtsup.exe NaN
1 PSEXESVC %SystemRoot%\PSEXESVC.exe 10.0.1.4
2 PSEXESVC %SystemRoot%\PSEXESVC.exe 10.0.1.4
3 PSEXESVC %SystemRoot%\PSEXESVC.exe 10.0.1.4
4 PSEXESVC %SystemRoot%\PSEXESVC.exe 10.0.1.4
5 AarSvc_13619b C:\windows\system32\svchost.exe -k AarSvcGroup -p NaN
6 BcastDVRUserService_13619b C:\windows\system32\svchost.exe -k BcastDVRUserService NaN
7 BluetoothUserService_13619b C:\windows\system32\svchost.exe -k BthAppGroup -p NaN
8 CaptureService_13619b C:\windows\system32\svchost.exe -k LocalService -p NaN
9 cbdhsvc_13619b C:\windows\system32\svchost.exe -k ClipboardSvcGroup -p NaN
10 CDPUserSvc_13619b C:\windows\system32\svchost.exe -k UnistackSvcGroup NaN
11 ConsentUxUserSvc_13619b C:\windows\system32\svchost.exe -k DevicesFlow NaN
12 CredentialEnrollmentManagerUserSvc_13619b C:\windows\system32\CredentialEnrollmentManager.exe NaN
13 DeviceAssociationBrokerSvc_13619b C:\windows\system32\svchost.exe -k DevicesFlow -p NaN
14 DevicePickerUserSvc_13619b C:\windows\system32\svchost.exe -k DevicesFlow NaN
15 DevicesFlowUserSvc_13619b C:\windows\system32\svchost.exe -k DevicesFlow NaN
16 MessagingService_13619b C:\windows\system32\svchost.exe -k UnistackSvcGroup NaN
17 OneSyncSvc_13619b C:\windows\system32\svchost.exe -k UnistackSvcGroup NaN
18 PimIndexMaintenanceSvc_13619b C:\windows\system32\svchost.exe -k UnistackSvcGroup NaN
19 PrintWorkflowUserSvc_13619b C:\windows\system32\svchost.exe -k PrintWorkflow NaN
20 UnistoreSvc_13619b C:\windows\System32\svchost.exe -k UnistackSvcGroup NaN
21 UserDataSvc_13619b C:\windows\system32\svchost.exe -k UnistackSvcGroup NaN
22 WpnUserService_13619b C:\windows\system32\svchost.exe -k UnistackSvcGroup NaN

RIGHT JOIN

Same query as before, but in the example below we use the RIGHT join type for the how parameter. We can see also a new service installing

(
pd.merge(Security4697, Security4624[Security4624['LogonType'] == 3],
         left_on = 'SubjectLogonId', right_on = 'TargetLogonId', how = 'right')
[['ServiceName', 'ServiceFileName','IpAddress']]
)
ServiceName ServiceFileName IpAddress
0 PSEXESVC %SystemRoot%\PSEXESVC.exe 10.0.1.4
1 PSEXESVC %SystemRoot%\PSEXESVC.exe 10.0.1.4
2 PSEXESVC %SystemRoot%\PSEXESVC.exe 10.0.1.4
3 PSEXESVC %SystemRoot%\PSEXESVC.exe 10.0.1.4
4 NaN NaN ::1
... ... ... ...
213 NaN NaN fe80::e40c:95b6:b0a7:6429
214 NaN NaN fe80::e40c:95b6:b0a7:6429
215 NaN NaN fe80::e40c:95b6:b0a7:6429
216 NaN NaN ::1
217 NaN NaN ::1

218 rows × 3 columns

FULL JOIN

pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not the joined columns find a match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).

(
pd.merge(Security4697, Security4624[Security4624['LogonType'] == 3],
         left_on = 'SubjectLogonId', right_on = 'TargetLogonId', how = 'outer')
[['ServiceName', 'ServiceFileName','IpAddress']]
)
ServiceName ServiceFileName IpAddress
0 javamtsup C:\Windows\System32\javamtsup.exe NaN
1 PSEXESVC %SystemRoot%\PSEXESVC.exe 10.0.1.4
2 PSEXESVC %SystemRoot%\PSEXESVC.exe 10.0.1.4
3 PSEXESVC %SystemRoot%\PSEXESVC.exe 10.0.1.4
4 PSEXESVC %SystemRoot%\PSEXESVC.exe 10.0.1.4
... ... ... ...
232 NaN NaN fe80::e40c:95b6:b0a7:6429
233 NaN NaN fe80::e40c:95b6:b0a7:6429
234 NaN NaN fe80::e40c:95b6:b0a7:6429
235 NaN NaN ::1
236 NaN NaN ::1

237 rows × 3 columns

Thank you! I hope you enjoyed it!