
The Basics of Data Visualization in Python
Explore the dynamic world of data visualization in Python, as we delve into fundamental techniques to transform raw data into compelling visual narratives for enhanced insights.
Data Visualization
In the vast world of data analysis, the ability to effectively present results is paramount. Enter the world of data visualization, where Python and its tools shine brightly! Whether you’re a data scientist, analyst, or enthusiast, mastering the art of visualization is essential. Here, I aim to guide you through fundamental techniques that lay the foundation for impactful visualization. Buckle up and put away the graph paper and rulers, it’s typing time!
First Things First
It’s important to mention that the type of data you’re visualizing should determine the visual you use. For example, categorical data can be best conveyed with the use of a bar chart, whereas a histogram can be the most useful to see the distribution of numerical data. Below, I will talk about more types and representations of data.
Three Steps
Here are the three basic steps to make visualizations in python:
-
Importing necessary packages
-
Importing the data
-
Create the appropriate visualizations
Step 1 - Import Packages
There’s a wide variety of packages that can assist you in your visualizations, but the primary packages include Pandas (for data processing), Matplotlib (for basic visuals), Numpy (for calculations), and Seaborn (for more advanced visuals).
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy.stats import norm
from matplotlib import style
style.use('ggplot')
Here, not only did I import the neccessary packages, but I set the graph style to ‘ggplot’, although there’s a variety of styles available.
Step 2 - Import the Data
For simplicity’s sake, I’m going to use two different data sets that comes loaded in the Seaborn library. In certain cases, you may need to take further steps to clean and arrange your data in a way that makes it more usable. This could include renaming columns or rows, getting rid of missing values, adding or removing columns, and other procesudes that makes the data more accessible.
iris = sns.load_dataset('iris')
titanic = sns.load_dataset('titanic')
Step 3 - Visualizations
Categorical Data
Categorical data is a type of data that can be divided into groups based on qualitative characteristics. Categorical observations fall into distinct categories, rarely overlapping.
#### Bar Chart
Bar charts are probably the most recognizable visual for categorical data. They are able to offer clear ways to display the relationships and differences in the included categories.
# categorical data - Iris
species_counts = iris['species']
sns.barplot(x=species_counts.index, y=species_counts.values, palette = 'viridis')
plt.title('Frequency of Iris Species')
plt.xlabel('Species')
plt.ylabel('Count')
plt.savefig('IrisBarchart.jpg')
plt.show()
As you can see, bar charts are easy ways to show qualitative characteristics within a data set.
#### Pie Chart
Another effective way of showing the frequency or count of a categorical variable is a pie chart. These charts can be tricky, because peoeple don’t always realize that the circle must add up to 100% - I have seen many, many instances where one ‘slice’ of the pie is ~60%, and another is ~20%, and a thid is ~35%, which is an issue! This chart is exclusively to show the break-up of one category within another. Below, I made a chart that displays what percentage from the survivors of the Titanic belonged to each class, either first, second, or third.
There are many more ways to display categorical data, such as tree maps, stacked bar charts, and violin plots, just to name a few.
Numerical Data
Histogram
Histograms provide visual representations of distributions of numerical data. Here, the data are divided into ‘bins’, which are the rectangles you see on the graph, and the height of each bar corresponds to the frequency (or sometimes, the probability), of data falling within that bin.
sepal_length_data = iris['sepal_length']
plt.hist(sepal_length_data, bins=20, color='skyblue', edgecolor='black', alpha=0.7, density=True)
mu, std = norm.fit(sepal_length_data)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
plt.title('Histogram with Normal Distribution Fit')
plt.xlabel('Sepal Length')
plt.ylabel('Frequency')
plt.show()
Box Plot
The box plot provides a visual representation of the five-number summary one can obtain from a data set. The statistics included are typically the median, upper and lower quartiles, the minimum and maximum observation, and any outliers. Box plots also offer insights into the shape of a data set.
plt.figure(figsize=(10, 6))
sns.boxplot(x='class', y='fare', data=titanic, palette='viridis')
plt.title('Boxplot of Fare Across Passenger Classes in Titanic')
plt.xlabel('Passenger Class')
plt.ylabel('Fare')
plt.show()
Scatterplot
The last visual that I will mention for now is the scatterplot. This is a display of individual data points over a space, with one variable on the x-axis and a different one on the y-axis. This is a great way to gauge the relationship between two variables by observing the pattern of the points in the scatterplot.
plt.figure(figsize=(10, 6))
sns.scatterplot(x='sepal_length', y='sepal_width', hue='species', data=iris, palette='Set2', s=80)
plt.title('Scatterplot of Sepal Length vs Sepal Width in Iris Dataset')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.legend()
plt.show()
Scatterplots are also great because you’re really not confined to just two variables; I added a third by color-coding the points by species.
Last Few Tips
- Make sure your data is clean!
- Understand your data before you try to visualize it
- Add descriptive titles and labels
Wrapping it Up!
I truly just scratched the surface with the three visualization methods I showed in this post. There are many more such as single or multiple line graphs, box plots, heat maps, pie charts, and so on. I hope that this was enough to pique your interest in data visualization tactics, and implore you to continue practicing and researching the methods that best suit your data’s needs!