Recipes for the Visualizations of Data Distributions (2024)

Visualization

Recipes for the Visualizations of Data Distributions (1)

·

Follow

9 min read

·

Oct 22, 2019

--

Histograms, KDE plots, box(en) plots and violin plots and more…

Recipes for the Visualizations of Data Distributions (3)

As a budding data scientist, I realized that the first piece of code is always written to understand the distribution of one or several variables in the data set during project initiation. Visualizing the distribution of a variable is important to immediately grasp valuable parameters such as frequencies, peaks, skewness, center, modality, and how variables and outliers behave in the data range.

With the excitement of sharing knowledge, I created this blog post about summarized explanations of single-variable (univariate) distributions to share my deductions from several articles and documentations. I will provide steps to draw the distribution functions without going deep in theories and keep my post simple.

I will start by explaining the functions to visualize data distributions with Python using Matplotlib and Seaborn libraries. Code behind the visualizations can be found in this notebook.

For illustrations, I used Gapminder life expectancy data, the cleaned version can be found in this GitHub repository.

The data set shows 142 countries’ life expectancy at birth, population and GDP per capita between the years 1952 and 2007. I will plot the life expectancy at birth using:

  1. Histogram
  2. Kernel Density Estimation and Distribution Plot
  3. Box Plot
  4. Boxen Plot
  5. Violin Plot

Histograms are the simplest way to show how data is spread. Here is the recipe for making a histogram:

  • Create buckets (bins) by dividing your data range into equal sizes, the number of subsets in your data is the number of bins you have.
  • Record the count of the data points that fall into each bin.
  • Visualize each bucket side by side on the x-axis.
  • Count values will be shown on the y-axis, showing how many items are there in each bin.

And you have a brand-new histogram!

It is the easiest and most intuitive way. However, one drawback is to decide on the number of bins necessary.

In this graph, I determined 25 bins, which seems to be optimal after playing around with the bins parameter in the Matplotlib hist function.

# set the histogram
plt.hist(df.life_expectancy,
range=(df.life_expectancy.min(),
df.life_expectancy.max()+1),
bins=25,
alpha=0.5)
# set title and labels
plt.xlabel(“Life Expectancy”)
plt.ylabel(“Count”)
plt.title(“Histogram of Life Expectancy between 1952 and 2007 in the World”)
plt.show()
Recipes for the Visualizations of Data Distributions (4)

Different number of bins can significantly change how your data distribution looks. Here is the same data distribution with 5 bins, it looks like a totally different data set, right?

Recipes for the Visualizations of Data Distributions (5)

If you don’t want to be bothered by the number of bins determination, then let’s jump to the kernel density estimation functions and distribution plots.

Kernel Density Estimation (KDE) plots save you from the hassle of deciding on the bin size by smoothing the histogram. Follow the below logic to create a KDE plot:

  • Plot a Gaussian (normal) curve around each data point.
  • Sum the curves to create a density at each point.
  • Normalize the final curve, so that the area under it equals to 1, resulting in a probability density function. Here is a visual example of those 3 steps:
Recipes for the Visualizations of Data Distributions (6)
  • You will find the range of the data on the x-axis and probability density function of the random variable on the y-axis. Probability density function is defined in this article by Will Koehrsen as follows:

You may think of the y-axis on a density plot as a value only for relative comparisons between different categories.

Luckily, you don’t have to remember and apply all these steps manually. Seaborn’s KDE plot function completes all these steps for you, just pass the column of your data frame or Numpy array to it!

# set KDE plot, title and labels
ax = sns.kdeplot(df.life_expectancy, shade=True, color=”b”)
plt.title(“KDE Plot of Life Expectancy between 1952 and 2007 in the World”)
plt.ylabel(“Density”)
Recipes for the Visualizations of Data Distributions (7)

If you want to combine histograms and KDE plot, Seaborn has another cool way to show both histograms and KDE plots in one graph: Distribution plot which draws KDE Plot with the flexibility of turning on and off the histograms by changing the hist parameter in the function.

# set distribution plot, title and labels
ax = sns.distplot(df.life_expectancy, hist=True, color=”b”)
plt.title(“Distribution Plot of Life Expectancy between 1952 and 2007 in the World”)
plt.ylabel(“Density”)
Recipes for the Visualizations of Data Distributions (8)

KDE plots are also capable of showing distributions among different categories:

# create list of continents 
continents = df[‘continent’].value_counts().index.tolist()
# set kde plot for each continent
for c in continents:
subset = df[df[‘continent’] == c]
sns.kdeplot(subset[“life_expectancy”], label=c, linewidth=2)
# set title, x and y labels
plt.title(“KDE Plot of Life Expectancy Among Continents Between 1952 and 2007”)
plt.ylabel(“Density”)
plt.xlabel(“Life Expectancy”)
Recipes for the Visualizations of Data Distributions (9)

Although KDE plots or distribution plots have more computations and mathematics behind compared to histograms, it is easier to understand modality, symmetry, skewness and center of the distribution by looking at a continuous line. One disadvantage may be, lacking information about summary statistics.

If you wish to provide summary statistics of your distribution visually, then let’s move to the box plots.

Box plots show data distributions with the five-number summary statistics (minimum, first quartile Q1, median the second quartile, third quartile Q3, maximum). Here are the steps to draw them:

  • Sort your data to determine the minimum, quartiles (first, second and third) and maximum.
  • Draw a box between the first and third quartile, then draw a vertical line in the box corresponding to the median.
  • Draw a horizontal line outside of the box halving the box into two and put the minimum and maximum at the edge. These lines will be your whiskers.
  • The end of the whiskers are equal to the minimum and maximum of the data and, if you see any, the little diamonds set aside is interpreted as “outliers”.

Steps are straightforward to create a box plot manually, but I prefer to get some support from Seaborn box plot function.

# set the box plot and title 
sns.boxplot(x=”life_expectancy”, data=df, palette=”Set3") plt.title(“Boxplot of Life Expectancy between 1952 and 2007 in the World”)
Recipes for the Visualizations of Data Distributions (10)

There are several different ways to calculate the length of whiskers, Seaborn box plot function determines whiskers by extending the 1.5 times the interquartile range (IQR) from the first and third quartiles by default. Thus, any data point bigger than Q3+(1.5*IQR) or smaller than Q1-(1.5*IQR) will be visualized as outliers. You can change the calculation of whiskers by adjusting the whis parameter.

Like KDE plots, box plots are also suitable for visualizing the distributions among categories:

# set the box plot with the ordered continents and title sns.boxplot(x=”continent”, y=”life_expectancy”, data=df,
palette=”Set3",
order=[“Africa”, “Asia”, “Americas”, “Europe”,
“Oceania”])
plt.title(“Boxplot of Life Expectancy Among Continents Between 1952 and 2007”)
Recipes for the Visualizations of Data Distributions (11)

Box plots provide the story of the statistics, where half of the data lies, and the whole range of data by looking at the box shape and whiskers. On the other hand, you don’t have the visibility of the story of the data outside the box. That is the reason why some scientists published a paper about boxen plots, known as extended box plots.

Boxen plots, or letter value plots or extended box plots, might be the least used method for data distribution visualizations, yet they convey more information on large data sets.

To create a boxen plot, let’s first understand what a letter value summary is. Letter value summary is about continually determining the middle value of a sorted data.

First, determine the middle value for all the data, and create two slices. Then, determine the median of those two slices and iterate on this process when the stopping criteria is reached or no more data is left to be separated.

First middle value determined is the median. Middle values determined in the second iteration are called fourths, and middle values determined in the third iteration are called eights.

Now let’s draw a box plot and visualize letter value summaries outside the box plot instead of whiskers. In other words, plot a box plot with extended box edges corresponding to the middle value of the slices (eights, sixteenths and so on..)

# set boxen plot and title 
sns.boxenplot(x=”life_expectancy”, data=df,palette=”Set3") plt.title(“Boxenplot of Life Expectancy Among Continents Between 1952 and 2007”)
Recipes for the Visualizations of Data Distributions (12)

They are also effective in telling the data story for different categories:

# set boxen plot with ordered continents and title sns.boxenplot(x=”continent”, y=”life_expectancy”, data=df,
palette=”Set3",
order=[“Africa”, “Asia”, “Americas”, “Europe”,
“Oceania”])
plt.title(“Boxenplot of Life Expectancy Among Continents Between 1952 and 2007”)
Recipes for the Visualizations of Data Distributions (13)

Boxen plots emerged to visualize the larger data sets more effectively by showing how data is spread outside of the main box and putting more emphasis on the outliers because the importance of outliers and data outside the IQR is more significant in larger data sets.

There are two perspectives that give clues about data distribution, the shape of the data distribution and the summary statistics. To explain a distribution from both perspectives at the same time, let’s learn to cook some Violin plots.

Violin plots are the perfect combination of the box plots and KDE plots. They deliver the summary statistics with the box plot inside and shape of distribution with the KDE plot on the sides.

It is my favorite plot because data is expressed with all the details it has. Do you remember the life expectancy distribution shape and summary statistics we plotted earlier? Seaborn violin plot function will blend it for us now.

Et voilà !

# set violin plot and title 
sns.violinplot(x=”life_expectancy”, data=df, palette=”Set3") plt.title(“Violinplot of Life Expectancy between 1952 and 2007 in the World”)
Recipes for the Visualizations of Data Distributions (14)

You can observe the peak of the data around 70 by looking at the distribution on the sides, and half of the data points gathered between 50 and 70 by noticing the slim box inside.

These beautiful violins can be used to visualize data with categories, and you can express summary statistics with dots, dashed lines or lines if you wish, by changing the inner parameter.

Recipes for the Visualizations of Data Distributions (15)

The advantage is obvious: Visualize the shape of the distribution and summary statistics simultaneously!

Bonus points with Violin plots: By setting scale parameter to count, you can also show how many data points you have in each category, thus emphasizing the importance of each category. When I change scale, Africa and Asia expanded and Oceania shrank, concluding there are fewer data points in Oceania and more in Africa and Asia.

# set the violin plot with different scale, inner parameter and title 
sns.violinplot(x=”continent”, y=”life_expectancy”, data=df,
palette=”Set3",
order=[“Africa”, “Asia”, “Americas”, “Europe”,
“Oceania”],
inner=None, scale=”count”)
plt.title(“Violinplot of Life Expectancy Among Continents Between 1952 and 2007”)
Recipes for the Visualizations of Data Distributions (16)

So, these recipes about visualizing distributions explained the core idea behind each plot. There are plenty of options to show single-variable, or univariate, distributions.

Histogram, KDE plot and distribution plot are explaining the data shape very well. Additionally, distribution plots can combine histograms and KDE plots.

Box plot and boxen plot are best to communicate summary statistics, boxen plots work better on the large data sets and violin plot does it all.

They are all effective communicators and each of them can be built quickly with Seaborn library in Python. Your visualization choice depends on your project (data set) and what information you want to transfer to your audience. If you are thrilled by this post and want to learn more, you can check the Seaborn and Matplotlib documentation.

Last but not least, this is my first contribution for Towards Data Science, I hope you enjoyed reading! I appreciate your constructive feedback and would like to hear your opinions about this blog post in the responses or on Twitter.

Recipes for the Visualizations of Data Distributions (2024)

FAQs

How do you Visualise distribution of data? ›

Visualization methods that display frequency, how data spread out over an interval or is grouped.
  1. Box & Whisker Plot.
  2. Bubble Chart.
  3. Density Plot.
  4. Dot Matrix Chart.
  5. Histogram.
  6. Multi-set Bar Chart.
  7. Parallel Sets.
  8. Pictogram Chart.

What are best used to visualize how data is distributed? ›

Scatter charts: distribution and relationships

Scatter charts present categories of data by circle color and the volume of the data by circle size; they're used to visualize the distribution of, and relationship between, two variables.

What are the 4 pillars of data visualization? ›

The foundation of data visualization is built upon four pillars: distribution, relationship, comparison, and composition.

How is data visualization used to compare two different data distributions? ›

Visualizing data is key in comparing distributions. Use graphs or charts to reveal distribution characteristics like shape, center, and variability, plus any gaps, clusters, or outliers. Histograms, box plots, dot plots, and scatter plots are great options.

How do you explain the distribution of data? ›

A data distribution is a graphical representation of data that was collected from a sample or population. It is used to organize and disseminate large amounts of information in a way that is meaningful and simple for audiences to digest. For example: Figure 1 - Histogram example.

What does a data distribution look like? ›

Normal Distributions are one of the most commonly used data distributions. This distribution measures data points in a bell-shaped curve, with an equal number of data points to the left and right of the mean value.

How to make a data visualization? ›

Steps for Creating a Visualization
  1. Know your data & purpose. ...
  2. Select a chart type that best accomplishes your purpose, given the data you have. ...
  3. Choose the software or tool you will use to create your visualization. ...
  4. Refine your visualization according to best practices and your purpose.
Mar 22, 2024

What is the most common distribution visualization? ›

A histogram is the most commonly used plot type for visualizing distribution. It shows the frequency of values in data by grouping it into equal-sized intervals or classes (so-called bins). In such a way, it gives you an idea about the approximate probability distribution of your quantitative data.

What are the 3 C's of visualization? ›

Clarity, consistency, and context.

I think if you can provide these 3 things to your dashboard, you're 95% on your way to a great story with data. This doesn't mean to say these are the only things to worry about - far from it - but, it's a good starting point especially for those new to the BI space.

What are the 3 rules of data visualization? ›

To recap, here are the three most effective data visualization techniques you can use to deliver presentations that people understand and remember: compare to a real object, include a visual, and give context to your numbers.

What are the 3 main goals of data visualization? ›

The three main goals of data visualization are to help organizations and individuals explore, monitor and explain insights within data.

What is the best chart to show distribution? ›

Box plots show distribution based on a statistical summary, while column histograms are great for finding the frequency of an occurrence. Scatter plots are best for showing distribution in large data sets.

How do you visualize multiple distributions? ›

Boxplots are simple yet informative, and they work well when plotted next to each other to visualize many distributions at once.

How to interpret data visualization? ›

Tips for reading charts, graphs & more
  1. Identify what information the chart is meant to convey. ...
  2. Identify information contained on each axis.
  3. Identify range covered by each axis.
  4. Look for patterns or trends. ...
  5. Look for averages and/or exceptions.
  6. Look for bold or highlighted data.
  7. Read the specific data.
Aug 17, 2023

How do you show distribution? ›

Distributions can be displayed using a variety of graphical tools. The proper choice of graph depends on the nature of the variable. ❖ Pie charts show the distribution of a categorical variable as a “pie” whose slices are sized by the counts or percentage of individuals belonging to that category.

What chart to use to show distribution? ›

Bar Histogram – Shows the distribution or relationship of a single variable over a set of ranges or categories.

How do you identify the distribution of the data? ›

Answer: To identify your data's distribution, analyze its shape and characteristics using descriptive statistics and visualization techniques such as histograms or density plots. Identifying the distribution of your data involves understanding the underlying shape and characteristics of its frequency distribution.

References

Top Articles
Latest Posts
Article information

Author: Frankie Dare

Last Updated:

Views: 6243

Rating: 4.2 / 5 (73 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Frankie Dare

Birthday: 2000-01-27

Address: Suite 313 45115 Caridad Freeway, Port Barabaraville, MS 66713

Phone: +3769542039359

Job: Sales Manager

Hobby: Baton twirling, Stand-up comedy, Leather crafting, Rugby, tabletop games, Jigsaw puzzles, Air sports

Introduction: My name is Frankie Dare, I am a funny, beautiful, proud, fair, pleasant, cheerful, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.