Data Analysis with Spark.SQL: Filtering & Summarizing

Creating SQL view from Mordor APT29 dataset

Create Spark session

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Spark_Data_Analysis") \
    .config("spark.sql.caseSensitive","True") \
    .getOrCreate()

Expose the dataframe as a SQL view

apt29Json = '../datasets/apt29_evals_day1_manual_2020-05-01225525.json'

apt29Df = spark.read.json(apt29Json)

apt29Df.createOrReplaceTempView('apt29')

Filtering & Summarizing data

Filter Sysmon event 8 (Create Remote Thread) data

sysmon8 = spark.sql(
'''
SELECT SourceImage, TargetImage, StartFunction
FROM apt29
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND EventID = 8
''')

print('This dataframe has {} records!!'.format(sysmon8.count()))
sysmon8.show(n = 5, truncate = False)
This dataframe has 95 records!!
+---------------------------------------------------------+-------------------------------+-------------+
|SourceImage                                              |TargetImage                    |StartFunction|
+---------------------------------------------------------+-------------------------------+-------------+
|C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe|C:\Windows\System32\lsass.exe  |-            |
|C:\Windows\System32\csrss.exe                            |C:\Windows\System32\svchost.exe|CtrlRoutine  |
|C:\Windows\System32\csrss.exe                            |C:\Windows\System32\svchost.exe|CtrlRoutine  |
|C:\Windows\System32\csrss.exe                            |C:\Windows\System32\svchost.exe|CtrlRoutine  |
|C:\Windows\System32\csrss.exe                            |C:\Windows\System32\svchost.exe|CtrlRoutine  |
+---------------------------------------------------------+-------------------------------+-------------+
only showing top 5 rows

Filter PowerShell processes within Sysmon event 8 (Create Remote Thread) data

sysmon8 = spark.sql(
'''
SELECT SourceImage, TargetImage, StartFunction
FROM apt29
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational'
    AND EventID = 8
    AND SourceImage LIKE '%powershell.exe%'
''')

print('This dataframe has {} records!!'.format(sysmon8.count()))
sysmon8.show(truncate = False)
This dataframe has 1 records!!
+---------------------------------------------------------+-----------------------------+-------------+
|SourceImage                                              |TargetImage                  |StartFunction|
+---------------------------------------------------------+-----------------------------+-------------+
|C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe|C:\Windows\System32\lsass.exe|-            |
+---------------------------------------------------------+-----------------------------+-------------+

SUMMARIZING data

Stack Count event logs by source of data and event id

eventLogs = spark.sql(
'''
SELECT Channel, EventID, COUNT(*)
FROM apt29
GROUP BY Channel, EventID
ORDER BY COUNT(*) DESC
''')

print('This dataframe has {} records!!'.format(eventLogs.count()))
eventLogs.show(truncate = False)
This dataframe has 203 records!!
+----------------------------------------+-------+--------+
|Channel                                 |EventID|count(1)|
+----------------------------------------+-------+--------+
|Microsoft-Windows-Sysmon/Operational    |12     |61151   |
|Microsoft-Windows-Sysmon/Operational    |10     |39283   |
|Microsoft-Windows-Sysmon/Operational    |7      |20259   |
|Microsoft-Windows-Sysmon/Operational    |13     |17541   |
|Security                                |4658   |8561    |
|Windows PowerShell                      |800    |5113    |
|Microsoft-Windows-PowerShell/Operational|4103   |5080    |
|Security                                |4690   |4269    |
|Security                                |4656   |4260    |
|Security                                |4663   |4197    |
|Security                                |5156   |2679    |
|security                                |5447   |2579    |
|security                                |4658   |2412    |
|Microsoft-Windows-Sysmon/Operational    |11     |1649    |
|Security                                |5158   |1465    |
|security                                |4656   |1237    |
|Microsoft-Windows-Sysmon/Operational    |3      |1229    |
|security                                |4690   |1202    |
|security                                |4663   |1140    |
|Security                                |4703   |902     |
+----------------------------------------+-------+--------+
only showing top 20 rows

Filtering event logs groups with frequency less or equal to 500

eventLogsLess = spark.sql(
'''
SELECT Channel, EventID, COUNT(*) as Count
FROM apt29
GROUP BY Channel, EventID
HAVING Count <= 500
ORDER BY Count DESC
''')

print('This dataframe has {} records!!'.format(eventLogsLess.count()))
eventLogsLess.show(truncate = False)
This dataframe has 180 records!!
+----------------------------------------+-------+-----+
|Channel                                 |EventID|Count|
+----------------------------------------+-------+-----+
|security                                |5156   |484  |
|Microsoft-Windows-Sysmon/Operational    |1      |447  |
|security                                |5158   |431  |
|Microsoft-Windows-Sysmon/Operational    |23     |422  |
|Microsoft-Windows-PowerShell/Operational|4104   |414  |
|security                                |4673   |409  |
|Microsoft-Windows-Sysmon/Operational    |5      |401  |
|Microsoft-Windows-Sysmon/Operational    |18     |362  |
|security                                |5154   |362  |
|security                                |4688   |279  |
|Security                                |4689   |238  |
|Security                                |4627   |234  |
|Security                                |4624   |234  |
|Security                                |4634   |233  |
|Microsoft-Windows-Sysmon/Operational    |2      |209  |
|Security                                |4688   |181  |
|security                                |4945   |176  |
|security                                |4689   |160  |
|Security                                |4672   |154  |
|Windows PowerShell                      |600    |138  |
+----------------------------------------+-------+-----+
only showing top 20 rows

Thank you! I hope you enjoyed it!