Data Analysis with Spark.SQL: Correlating

Creating SQL view from Mordor APT29 dataset

Create Spark session

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Spark_Data_Analysis") \
    .config("spark.sql.caseSensitive","True") \
    .getOrCreate()

Expose the dataframe as a SQL view

apt29Json = '../datasets/apt29_evals_day1_manual_2020-05-01225525.json'

apt29Df = spark.read.json(apt29Json)

apt29Df.createOrReplaceTempView('apt29')

Correlating data

Get new Processes created by an Account that Logged On over the network

lateralMovement = spark.sql(
'''
SELECT b.SubjectUserName, b.TargetUserName, b.NewProcessName, b.ParentProcessName, a.IpAddress
FROM apt29 b
INNER JOIN(
    SELECT TargetLogonId, LogonType, IpAddress
    FROM apt29
    WHERE lower(Channel) LIKE '%security%'
        AND EventID = 4624
        AND LogonType = 3
    )a
ON a.TargetLogonId = b.TargetLogonId
WHERE lower(b.Channel) LIKE '%security%'
    AND b.EventID = 4688
''')

print('This dataframe has {} records!!'.format(lateralMovement.count()))
lateralMovement.show(truncate = False)
This dataframe has 1 records!!
+---------------+--------------+-----------------------------------+-------------------------------+---------+
|SubjectUserName|TargetUserName|NewProcessName                     |ParentProcessName              |IpAddress|
+---------------+--------------+-----------------------------------+-------------------------------+---------+
|NASHUA$        |pbeesly       |C:\Windows\System32\wsmprovhost.exe|C:\Windows\System32\svchost.exe|-        |
+---------------+--------------+-----------------------------------+-------------------------------+---------+

Add context (Parent Process) to Network Connection events

parentProcess = spark.sql(
'''
SELECT b.Image, b.SourceIp, b.DestinationIp, a.ParentImage
FROM apt29 b
LEFT JOIN(
    SELECT ProcessGuid, ParentImage
    FROM apt29
    WHERE lower(Channel) LIKE '%sysmon%'
        AND EventID = 1
    )a
ON a.ProcessGuid = b.ProcessGuid
WHERE lower(b.Channel) LIKE '%sysmon%'
    AND b.EventID = 3
''')

print('This dataframe has {} records!!'.format(parentProcess.count()))
parentProcess.show(n = 5, truncate = 25)
This dataframe has 1229 records!!
+-------------------------+---------------+---------------+-----------------------+
|                    Image|       SourceIp|  DestinationIp|            ParentImage|
+-------------------------+---------------+---------------+-----------------------+
|C:\Windows\System32\dn...|       10.0.0.4|    172.18.39.2|                   null|
|C:\Windows\ADWS\Micros...|0:0:0:0:0:0:0:1|0:0:0:0:0:0:0:1|                   null|
|C:\Windows\System32\ls...|0:0:0:0:0:0:0:1|0:0:0:0:0:0:0:1|                   null|
|C:\ProgramData\victim\...|       10.0.1.4|    192.168.0.5|C:\Windows\explorer.exe|
|C:\Windows\System32\sv...|       10.0.1.6|       10.0.0.4|                   null|
+-------------------------+---------------+---------------+-----------------------+
only showing top 5 rows

Add context (Parent Process) to Processes that made a Network Connection and modified a Registry Value

modifyRegistry = spark.sql(
'''
SELECT d.ParentImage, c.Image, c.SourceIp, c.DestinationIp, c.TargetObject
FROM apt29 d
RIGHT JOIN(
    SELECT b.ProcessGuid, b.Image, b.SourceIp, b.DestinationIp, a.TargetObject
    FROM apt29 b
    INNER JOIN(
        SELECT ProcessGuid, TargetObject
        FROM apt29
        WHERE lower(Channel) LIKE '%sysmon%'
            AND EventID = 13
        )a
    ON b.ProcessGuid = a.ProcessGuid
    WHERE lower(b.Channel) LIKE '%sysmon%'
        AND b.EventID = 3
)c
ON d.ProcessGuid = c.ProcessGuid
WHERE lower(d.Channel) LIKE '%sysmon%'
    AND d.EventID = 1
''')

print('This dataframe has {} records!!'.format(modifyRegistry.count()))
modifyRegistry.show(n = 1, vertical = True,truncate = False)
This dataframe has 3524 records!!
-RECORD 0----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 ParentImage   | C:\Windows\System32\control.exe                                                                                                                                                             
 Image         | C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe                                                                                                                                   
 SourceIp      | 10.0.1.4                                                                                                                                                                                    
 DestinationIp | 192.168.0.5                                                                                                                                                                                 
 TargetObject  | HKLM\System\CurrentControlSet\Services\bam\State\UserSettings\S-1-5-21-1830255721-3727074217-2423397540-1107\\Device\HarddiskVolume2\Windows\System32\WindowsPowerShell\v1.0\powershell.exe 
only showing top 1 row

Thank you! I hope you enjoyed it!