Data Science

Understanding Box Plots

Plotting And Interpreting A Box Plot

Himani Gulati
3 min readMay 26, 2022
Image Source
Index OF Contents
· Introduction
Median
Quartiles
‘Lowest’&‘Greatest’data point
Plotting
Matplotlib
Seaborn
Conclusion
References

Introduction

Box plots/Whiskers plots are a crucial way to summarize numerical data statistically and to illustrate the measures of dispersion, position, and central tendency. In this tutorial, we will cover how to interpret and create a box plot using pandas.

Box plot demonstration

They are especially useful when we need to compare distributions and look for any potential skew in the data. For example, We can easily compare the performance of a class in each subject.

A single plot represents 5 major positions:

  • Median ‘M’
  • First Quartile ‘Q1’
  • Third Quartile ‘Q2'
  • The lowest data point ‘L’
  • The greatest data point ‘G’

Median

The median is a measure of central tendency. For a set of numbers, the median is the middle number of the set, after the numbers have been arranged in increasing order. For E.g. the median of the numbers 3, 7, 10, 55, and 97 is 10.

When the size of the set is even, there’s no single middle number. In this case, the median is the average of the two middle numbers.

Median

Quartiles

Quartiles are numbers that divide a dataset into 4 equal parts (or quarters). A dataset has three quartiles:

  • 1st Quartile or Q1: 25% of values lie below it i.e. it’s the 25th percentile
  • 2nd Quartile or Q2: 50% of values lie below it i.e. it’s the median or 50th percentile
  • 3rd Quartile or Q3: 75% of values lie below it i.e. it’s the 75th percentile
Quartile division

‘L’ and ‘G’:

These are the lowest and the greatest data points in the dataset. Subtracting them gives us the range ‘R’ of our dataset.

Plotting

To plot a box plot using pandas, we can directly call the plot.box function. By default, pandas make the box plots vertically, we can set the vert attribute to False to make our plots appear horizontal.

import pandas as pd
import numpy as np

Box plot showing the distribution of marks in 5 subjects

Let’s break down the plot above:

  • For each separate box plot, the green line in the center is the median mark in that particular subject.
  • The starting mark is for the lowest marks obtained in that subject L and similarly, the terminating mark of the plot depicts the greatest score for each subject G.
  • We can calculate the interquartile range by subtracting Q3-Q1 which will give us the central spread of scores obtained.

Without directly looking at the data, we can establish the following:

  • We now know that for a subject like ‘Hindi’, ‘Political Science’ and ‘Science’, maximum people have obtained a score above 50. i.e 25% of the student’s score appears after Q1 (First Quartile).
  • We know now that if 30 is the passing score, there is a student or student who has failed the ‘science’ subject.
  • We can look at the interquartile range and see where do most scores in a particular subject lie. For example, in ‘Maths’, most scores lie between 60–80.

We can plot box plots with other visualization libraries such as matplotlib and seaborn which provide features to make our visualization more appealing.

Matplotlib

To create box plots with matplotlib, we will use matplotlib's pyplot module using theboxplot() function.

Seaborn

Conclusion:

This was a basic tutorial on reading and interpreting a box plot. There is a lot more to box plots mathematically which we shall cover in some further tutorials.

References

--

--

Himani Gulati
Himani Gulati

Written by Himani Gulati

Here to share some views and gather some insights. Find me here: https://www.linkedin.com/in/himani-gulati-958b3119a/

Responses (1)