Data Analysis with Spark.SQL: Correlating¶
Author: Jose Rodriguez (@Cyb3rPandah)
Project: Infosec Jupyter Book
Public Organization: Open Threat Research
License: Creative Commons Attribution-ShareAlike 4.0 International
Reference: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html
Creating SQL view from Mordor APT29 dataset¶
Create Spark session¶
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Spark_Data_Analysis") \
.config("spark.sql.caseSensitive","True") \
.getOrCreate()
Expose the dataframe as a SQL view¶
apt29Json = '../datasets/apt29_evals_day1_manual_2020-05-01225525.json'
apt29Df = spark.read.json(apt29Json)
apt29Df.createOrReplaceTempView('apt29')
Correlating data¶
Get new Processes created by an Account that Logged On over the network¶
lateralMovement = spark.sql(
'''
SELECT b.SubjectUserName, b.TargetUserName, b.NewProcessName, b.ParentProcessName, a.IpAddress
FROM apt29 b
INNER JOIN(
SELECT TargetLogonId, LogonType, IpAddress
FROM apt29
WHERE lower(Channel) LIKE '%security%'
AND EventID = 4624
AND LogonType = 3
)a
ON a.TargetLogonId = b.TargetLogonId
WHERE lower(b.Channel) LIKE '%security%'
AND b.EventID = 4688
''')
print('This dataframe has {} records!!'.format(lateralMovement.count()))
lateralMovement.show(truncate = False)
This dataframe has 1 records!!
+---------------+--------------+-----------------------------------+-------------------------------+---------+
|SubjectUserName|TargetUserName|NewProcessName |ParentProcessName |IpAddress|
+---------------+--------------+-----------------------------------+-------------------------------+---------+
|NASHUA$ |pbeesly |C:\Windows\System32\wsmprovhost.exe|C:\Windows\System32\svchost.exe|- |
+---------------+--------------+-----------------------------------+-------------------------------+---------+
Add context (Parent Process) to Network Connection events¶
parentProcess = spark.sql(
'''
SELECT b.Image, b.SourceIp, b.DestinationIp, a.ParentImage
FROM apt29 b
LEFT JOIN(
SELECT ProcessGuid, ParentImage
FROM apt29
WHERE lower(Channel) LIKE '%sysmon%'
AND EventID = 1
)a
ON a.ProcessGuid = b.ProcessGuid
WHERE lower(b.Channel) LIKE '%sysmon%'
AND b.EventID = 3
''')
print('This dataframe has {} records!!'.format(parentProcess.count()))
parentProcess.show(n = 5, truncate = 25)
This dataframe has 1229 records!!
+-------------------------+---------------+---------------+-----------------------+
| Image| SourceIp| DestinationIp| ParentImage|
+-------------------------+---------------+---------------+-----------------------+
|C:\Windows\System32\dn...| 10.0.0.4| 172.18.39.2| null|
|C:\Windows\ADWS\Micros...|0:0:0:0:0:0:0:1|0:0:0:0:0:0:0:1| null|
|C:\Windows\System32\ls...|0:0:0:0:0:0:0:1|0:0:0:0:0:0:0:1| null|
|C:\ProgramData\victim\...| 10.0.1.4| 192.168.0.5|C:\Windows\explorer.exe|
|C:\Windows\System32\sv...| 10.0.1.6| 10.0.0.4| null|
+-------------------------+---------------+---------------+-----------------------+
only showing top 5 rows
Add context (Parent Process) to Processes that made a Network Connection and modified a Registry Value¶
modifyRegistry = spark.sql(
'''
SELECT d.ParentImage, c.Image, c.SourceIp, c.DestinationIp, c.TargetObject
FROM apt29 d
RIGHT JOIN(
SELECT b.ProcessGuid, b.Image, b.SourceIp, b.DestinationIp, a.TargetObject
FROM apt29 b
INNER JOIN(
SELECT ProcessGuid, TargetObject
FROM apt29
WHERE lower(Channel) LIKE '%sysmon%'
AND EventID = 13
)a
ON b.ProcessGuid = a.ProcessGuid
WHERE lower(b.Channel) LIKE '%sysmon%'
AND b.EventID = 3
)c
ON d.ProcessGuid = c.ProcessGuid
WHERE lower(d.Channel) LIKE '%sysmon%'
AND d.EventID = 1
''')
print('This dataframe has {} records!!'.format(modifyRegistry.count()))
modifyRegistry.show(n = 1, vertical = True,truncate = False)
This dataframe has 3524 records!!
-RECORD 0----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ParentImage | C:\Windows\System32\control.exe
Image | C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe
SourceIp | 10.0.1.4
DestinationIp | 192.168.0.5
TargetObject | HKLM\System\CurrentControlSet\Services\bam\State\UserSettings\S-1-5-21-1830255721-3727074217-2423397540-1107\\Device\HarddiskVolume2\Windows\System32\WindowsPowerShell\v1.0\powershell.exe
only showing top 1 row