Data Science
Understanding Box Plots
Plotting And Interpreting A Box Plot
Index OF Contents
· Introduction
∘ Median
∘ Quartiles
∘ ‘Lowest’&‘Greatest’data point
∘ Plotting
∘ Matplotlib
∘ Seaborn
∘ Conclusion
∘ References
Introduction
Box plots/Whiskers plots are a crucial way to summarize numerical data statistically and to illustrate the measures of dispersion, position, and central tendency. In this tutorial, we will cover how to interpret and create a box plot using pandas.
They are especially useful when we need to compare distributions and look for any potential skew in the data. For example, We can easily compare the performance of a class in each subject.
A single plot represents 5 major positions:
- Median ‘M’
- First Quartile ‘Q1’
- Third Quartile ‘Q2'
- The lowest data point ‘L’
- The greatest data point ‘G’
Median
The median is a measure of central tendency. For a set of numbers, the median is the middle number of the set, after the numbers have been arranged in increasing order. For E.g. the median of the numbers 3, 7, 10, 55, and 97 is 10.
When the size of the set is even, there’s no single middle number. In this case, the median is the average of the two middle numbers.
Quartiles
Quartiles are numbers that divide a dataset into 4 equal parts (or quarters). A dataset has three quartiles:
- 1st Quartile or Q1: 25% of values lie below it i.e. it’s the 25th percentile
- 2nd Quartile or Q2: 50% of values lie below it i.e. it’s the median or 50th percentile
- 3rd Quartile or Q3: 75% of values lie below it i.e. it’s the 75th percentile
‘L’ and ‘G’:
These are the lowest and the greatest data points in the dataset. Subtracting them gives us the range ‘R’ of our dataset.
Plotting
To plot a box plot using pandas, we can directly call the plot.box
function. By default, pandas make the box plots vertically, we can set the vert
attribute to False
to make our plots appear horizontal.
import pandas as pd
import numpy as np
Let’s break down the plot above:
- For each separate box plot, the green line in the center is the median mark in that particular subject.
- The starting mark is for the lowest marks obtained in that subject L and similarly, the terminating mark of the plot depicts the greatest score for each subject G.
- We can calculate the interquartile range by subtracting Q3-Q1 which will give us the central spread of scores obtained.
Without directly looking at the data, we can establish the following:
- We now know that for a subject like ‘Hindi’, ‘Political Science’ and ‘Science’, maximum people have obtained a score above 50. i.e 25% of the student’s score appears after Q1 (First Quartile).
- We know now that if 30 is the passing score, there is a student or student who has failed the ‘science’ subject.
- We can look at the interquartile range and see where do most scores in a particular subject lie. For example, in ‘Maths’, most scores lie between 60–80.
We can plot box plots with other visualization libraries such as matplotlib and seaborn which provide features to make our visualization more appealing.
Matplotlib
To create box plots with matplotlib, we will use matplotlib's pyplot module using theboxplot()
function.
Seaborn
Conclusion:
This was a basic tutorial on reading and interpreting a box plot. There is a lot more to box plots mathematically which we shall cover in some further tutorials.