Data Analysis with Spark.SQL: Filtering & Summarizing¶
Author: Jose Rodriguez (@Cyb3rPandah)
Project: Infosec Jupyter Book
Public Organization: Open Threat Research
License: Creative Commons Attribution-ShareAlike 4.0 International
Reference: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html
Creating SQL view from Mordor APT29 dataset¶
Create Spark session¶
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Spark_Data_Analysis") \
.config("spark.sql.caseSensitive","True") \
.getOrCreate()
Expose the dataframe as a SQL view¶
apt29Json = '../datasets/apt29_evals_day1_manual_2020-05-01225525.json'
apt29Df = spark.read.json(apt29Json)
apt29Df.createOrReplaceTempView('apt29')
Filtering & Summarizing data¶
Filter Sysmon event 8 (Create Remote Thread) data¶
sysmon8 = spark.sql(
'''
SELECT SourceImage, TargetImage, StartFunction
FROM apt29
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational' AND EventID = 8
''')
print('This dataframe has {} records!!'.format(sysmon8.count()))
sysmon8.show(n = 5, truncate = False)
This dataframe has 95 records!!
+---------------------------------------------------------+-------------------------------+-------------+
|SourceImage |TargetImage |StartFunction|
+---------------------------------------------------------+-------------------------------+-------------+
|C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe|C:\Windows\System32\lsass.exe |- |
|C:\Windows\System32\csrss.exe |C:\Windows\System32\svchost.exe|CtrlRoutine |
|C:\Windows\System32\csrss.exe |C:\Windows\System32\svchost.exe|CtrlRoutine |
|C:\Windows\System32\csrss.exe |C:\Windows\System32\svchost.exe|CtrlRoutine |
|C:\Windows\System32\csrss.exe |C:\Windows\System32\svchost.exe|CtrlRoutine |
+---------------------------------------------------------+-------------------------------+-------------+
only showing top 5 rows
Filter PowerShell processes within Sysmon event 8 (Create Remote Thread) data¶
sysmon8 = spark.sql(
'''
SELECT SourceImage, TargetImage, StartFunction
FROM apt29
WHERE Channel = 'Microsoft-Windows-Sysmon/Operational'
AND EventID = 8
AND SourceImage LIKE '%powershell.exe%'
''')
print('This dataframe has {} records!!'.format(sysmon8.count()))
sysmon8.show(truncate = False)
This dataframe has 1 records!!
+---------------------------------------------------------+-----------------------------+-------------+
|SourceImage |TargetImage |StartFunction|
+---------------------------------------------------------+-----------------------------+-------------+
|C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe|C:\Windows\System32\lsass.exe|- |
+---------------------------------------------------------+-----------------------------+-------------+
SUMMARIZING data¶
Stack Count event logs by source of data and event id¶
eventLogs = spark.sql(
'''
SELECT Channel, EventID, COUNT(*)
FROM apt29
GROUP BY Channel, EventID
ORDER BY COUNT(*) DESC
''')
print('This dataframe has {} records!!'.format(eventLogs.count()))
eventLogs.show(truncate = False)
This dataframe has 203 records!!
+----------------------------------------+-------+--------+
|Channel |EventID|count(1)|
+----------------------------------------+-------+--------+
|Microsoft-Windows-Sysmon/Operational |12 |61151 |
|Microsoft-Windows-Sysmon/Operational |10 |39283 |
|Microsoft-Windows-Sysmon/Operational |7 |20259 |
|Microsoft-Windows-Sysmon/Operational |13 |17541 |
|Security |4658 |8561 |
|Windows PowerShell |800 |5113 |
|Microsoft-Windows-PowerShell/Operational|4103 |5080 |
|Security |4690 |4269 |
|Security |4656 |4260 |
|Security |4663 |4197 |
|Security |5156 |2679 |
|security |5447 |2579 |
|security |4658 |2412 |
|Microsoft-Windows-Sysmon/Operational |11 |1649 |
|Security |5158 |1465 |
|security |4656 |1237 |
|Microsoft-Windows-Sysmon/Operational |3 |1229 |
|security |4690 |1202 |
|security |4663 |1140 |
|Security |4703 |902 |
+----------------------------------------+-------+--------+
only showing top 20 rows
Filtering event logs groups with frequency less or equal to 500¶
eventLogsLess = spark.sql(
'''
SELECT Channel, EventID, COUNT(*) as Count
FROM apt29
GROUP BY Channel, EventID
HAVING Count <= 500
ORDER BY Count DESC
''')
print('This dataframe has {} records!!'.format(eventLogsLess.count()))
eventLogsLess.show(truncate = False)
This dataframe has 180 records!!
+----------------------------------------+-------+-----+
|Channel |EventID|Count|
+----------------------------------------+-------+-----+
|security |5156 |484 |
|Microsoft-Windows-Sysmon/Operational |1 |447 |
|security |5158 |431 |
|Microsoft-Windows-Sysmon/Operational |23 |422 |
|Microsoft-Windows-PowerShell/Operational|4104 |414 |
|security |4673 |409 |
|Microsoft-Windows-Sysmon/Operational |5 |401 |
|Microsoft-Windows-Sysmon/Operational |18 |362 |
|security |5154 |362 |
|security |4688 |279 |
|Security |4689 |238 |
|Security |4627 |234 |
|Security |4624 |234 |
|Security |4634 |233 |
|Microsoft-Windows-Sysmon/Operational |2 |209 |
|Security |4688 |181 |
|security |4945 |176 |
|security |4689 |160 |
|Security |4672 |154 |
|Windows PowerShell |600 |138 |
+----------------------------------------+-------+-----+
only showing top 20 rows