# Creating a Spark Dataframe

* **Author**: Jose Rodriguez (@Cyb3rPandah)
* **Project**: Infosec Jupyter Book
* **Public Organization**: [Open Threat Research](https://github.com/OTRF)
* **License**: [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/)
* **Reference**: https://mordordatasets.com/introduction.html

## Importing Spark libraries

In [1]:
from pyspark.sql import SparkSession

## Creating Spark session

In [2]:
spark = SparkSession \
    .builder \
    .appName("Spark_example") \
    .config("spark.sql.caseSensitive","True") \
    .getOrCreate()

In [3]:
spark

## Creating a Spark Sample DataFrame

### Create sample data

Security event logs

In [4]:
eventLogs = [('Sysmon',1,'Process creation'),
             ('Sysmon',2,'A process changed a file creation time'),
             ('Sysmon',3,'Network connection'),
             ('Sysmon',4,'Sysmon service state changed'),
             ('Sysmon',5,'Process terminated'),
             ('Security',4688,'A process has been created'),
             ('Security',4697,'A service was installed in the system')]

In [5]:
type(eventLogs)

list

### Define dataframe schema

In [6]:
from pyspark.sql.types import *

In [7]:
schema = StructType([
   StructField("Channel", StringType(), True),
   StructField("Event_Id", IntegerType(), True),
   StructField("Description", StringType(), True)])

### Create Spark datarame

In [8]:
eventLogsDf = spark.createDataFrame(eventLogs,schema)

In [9]:
eventLogsDf.show(truncate = False)

+--------+--------+--------------------------------------+
|Channel |Event_Id|Description                           |
+--------+--------+--------------------------------------+
|Sysmon  |1       |Process creation                      |
|Sysmon  |2       |A process changed a file creation time|
|Sysmon  |3       |Network connection                    |
|Sysmon  |4       |Sysmon service state changed          |
|Sysmon  |5       |Process terminated                    |
|Security|4688    |A process has been created            |
|Security|4697    |A service was installed in the system |
+--------+--------+--------------------------------------+



In [10]:
type(eventLogsDf)

pyspark.sql.dataframe.DataFrame

## Exposing Spark DataFrame as a SQL View

In [11]:
eventLogsDf.createOrReplaceTempView('eventLogs')

## Testing a SQL-like Query

Filtering on **Sysmon** event logs

In [12]:
sysmonEvents = spark.sql(
'''
SELECT *
FROM eventLogs
WHERE Channel = 'Sysmon'
''')

In [13]:
sysmonEvents.show(truncate = False)

+-------+--------+--------------------------------------+
|Channel|Event_Id|Description                           |
+-------+--------+--------------------------------------+
|Sysmon |1       |Process creation                      |
|Sysmon |2       |A process changed a file creation time|
|Sysmon |3       |Network connection                    |
|Sysmon |4       |Sysmon service state changed          |
|Sysmon |5       |Process terminated                    |
+-------+--------+--------------------------------------+



## Thank you! I hope you enjoyed it!