Data Visualization with Pandas and Matplotlib

Description

When working with data visualziations can be a powerful tool to help you understand the data that you have, and they can be much more effective at helping you identify patterns, trends, and anomalies than raw data analysis alone can.

Python has a wide range of excellent, flexible, and powerful data visualization libraries however when working with data in Pandas the built in integration between Pandas and Matplotlib provides the fastest, and easiest way to simply plot your data. If you are planning on creating a complex, interactive visualization you are better placed using something like Bokeh but whilst learning or experimenting the features provided by pandas and matplotlib are likely to be enough.

Pre-requisites - Reading

Importing Libraries

Pre-requisites:

  • pip install pandas

  • pip install matplotlib

import pandas as pd
import matplotlib.pyplot as plt
# Import our network data - originally from https://github.com/hunters-forge/mordor/tree/master/datasets/large/apt29/day1/zeek
plotting_df = pd.read_csv("../../../datasets/flow_logs.csv", index_col=0, parse_dates=True)

Line Plot

The simplest plot is a line chart, you can plot a line chart of all numerical columns in your Pandas DataFrame with df.plot(). Depending on your dataset this may not produce a very interesting or usable visualization. Often (as with this dataset) this is not the best chart to plot.

plotting_df.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1427df3bac8>
../../_images/01_data_viz_pandas_6_1.png

Bar Plots

You can plot data in veritcal or horizontal bars. You can also plot on a range of data summarizations such as value_counts() as we are using here.

plotting_df['src_ip'].value_counts().plot.bar(title="Source IPs")
<matplotlib.axes._subplots.AxesSubplot at 0x142000a6278>
../../_images/01_data_viz_pandas_8_1.png
plotting_df['src_ip'].value_counts().plot.barh(title="Source IPs")
<matplotlib.axes._subplots.AxesSubplot at 0x14200159518>
../../_images/01_data_viz_pandas_9_1.png

Pie Charts

There are a number of customizations for each chart such as showing a legend, which is useful for a pie chart.

plotting_df['dst_port'].value_counts().plot.pie(legend=True, title="Destination ports observed")
<matplotlib.axes._subplots.AxesSubplot at 0x1420386e2b0>
../../_images/01_data_viz_pandas_11_1.png

Other charts

There are a number of other chart types avalibale, some of which are shown below. You can select axis to plot either before calling the plot function or as x and y parameters
Avaliable charts:

  • Area

  • Bar

  • Horizontal Bar

  • Box

  • Density

  • Hexbin

  • Histogram

  • Line

  • Pie

  • Scatter

  • Box Plot

  • Kernel Density Estimate

plotting_df.plot.scatter(y="dst_port", x="dst_bytes", title="Bytes by destination port")
<matplotlib.axes._subplots.AxesSubplot at 0x142038094e0>
../../_images/01_data_viz_pandas_13_1.png

Some plots can also be done on a Pandas Series:

plotting_df[plotting_df["dst_port"] <1023]["dst_packets"].plot.box(title="Packet volume distribution")
<matplotlib.axes._subplots.AxesSubplot at 0x142037396a0>
../../_images/01_data_viz_pandas_15_1.png
plotting_df["dst_packets"].plot.kde(2)
<matplotlib.axes._subplots.AxesSubplot at 0x142004ac048>
../../_images/01_data_viz_pandas_16_1.png

You can also apply Pandas conditional statements prior to plotting data:

plotting_df[plotting_df['dst_port']==53]['dst_bytes'].plot.hist(title="DNS sent bytes distribution")
<matplotlib.axes._subplots.AxesSubplot at 0x1420379f630>
../../_images/01_data_viz_pandas_18_1.png

Subplots

By default if plots will include all columns in the DataFrame provided overlayed into one chart. You can set subplots=True parameter to split each column into a seperate plot. You can also resize your plots with the figsize parameter.

plotting_df[['dst_bytes','dst_packets']].plot.area(figsize=(12, 5), subplots=True)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x00000142024548D0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x00000142024886A0>],
      dtype=object)
../../_images/01_data_viz_pandas_20_1.png

Summary

As you can see Pandas and Matplotlib gives you a very easy and simple way of creating a number of visualizations. Also as is clear with some of these visualizations some fit certain data types better than others, when creating visualizations its important to consider what data you you have and what analysis you are looking to get from the visualization.
For more details on plotting data with Pandas refer to the documentation.