Data Visualization with Pandas and Matplotlib¶
Author: Pete Bryan (@MSSPete)
Project: Infosec Jupyter Book
Public Organization: Microsoft
When working with data visualziations can be a powerful tool to help you understand the data that you have, and they can be much more effective at helping you identify patterns, trends, and anomalies than raw data analysis alone can.
Python has a wide range of excellent, flexible, and powerful data visualization libraries however when working with data in Pandas the built in integration between Pandas and Matplotlib provides the fastest, and easiest way to simply plot your data. If you are planning on creating a complex, interactive visualization you are better placed using something like Bokeh but whilst learning or experimenting the features provided by pandas and matplotlib are likely to be enough.
pip install pandas
pip install matplotlib
import pandas as pd import matplotlib.pyplot as plt
# Import our network data - originally from https://github.com/hunters-forge/mordor/tree/master/datasets/large/apt29/day1/zeek plotting_df = pd.read_csv("../../../datasets/flow_logs.csv", index_col=0, parse_dates=True)
The simplest plot is a line chart, you can plot a line chart of all numerical columns in your Pandas DataFrame with df.plot(). Depending on your dataset this may not produce a very interesting or usable visualization. Often (as with this dataset) this is not the best chart to plot.
<matplotlib.axes._subplots.AxesSubplot at 0x1427df3bac8>
You can plot data in veritcal or horizontal bars. You can also plot on a range of data summarizations such as value_counts() as we are using here.
<matplotlib.axes._subplots.AxesSubplot at 0x142000a6278>
<matplotlib.axes._subplots.AxesSubplot at 0x14200159518>
There are a number of customizations for each chart such as showing a legend, which is useful for a pie chart.
plotting_df['dst_port'].value_counts().plot.pie(legend=True, title="Destination ports observed")
<matplotlib.axes._subplots.AxesSubplot at 0x1420386e2b0>
There are a number of other chart types avalibale, some of which are shown below. You can select axis to plot either before calling the plot function or as x and y parameters Avaliable charts:
Kernel Density Estimate
plotting_df.plot.scatter(y="dst_port", x="dst_bytes", title="Bytes by destination port")
<matplotlib.axes._subplots.AxesSubplot at 0x142038094e0>
Some plots can also be done on a Pandas Series:
plotting_df[plotting_df["dst_port"] <1023]["dst_packets"].plot.box(title="Packet volume distribution")
<matplotlib.axes._subplots.AxesSubplot at 0x142037396a0>
<matplotlib.axes._subplots.AxesSubplot at 0x142004ac048>
You can also apply Pandas conditional statements prior to plotting data:
plotting_df[plotting_df['dst_port']==53]['dst_bytes'].plot.hist(title="DNS sent bytes distribution")
<matplotlib.axes._subplots.AxesSubplot at 0x1420379f630>
By default if plots will include all columns in the DataFrame provided overlayed into one chart. You can set subplots=True parameter to split each column into a seperate plot. You can also resize your plots with the figsize parameter.
plotting_df[['dst_bytes','dst_packets']].plot.area(figsize=(12, 5), subplots=True)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x00000142024548D0>, <matplotlib.axes._subplots.AxesSubplot object at 0x00000142024886A0>], dtype=object)
As you can see Pandas and Matplotlib gives you a very easy and simple way of creating a number of visualizations. Also as is clear with some of these visualizations some fit certain data types better than others, when creating visualizations its important to consider what data you you have and what analysis you are looking to get from the visualization. For more details on plotting data with Pandas refer to the documentation.