How to Lie with Statistics - The Well-Chosen Average explained with Python

written by Guido Percú on May 17, 2020

Como-mentir-com-estatistica.png

Be careful when you read news about "average wage". As we learn from the classic book How to Lie with Statistics, there are different types of averages and their different meanings can be used to mislead and deceive.

When you read an announcement describing the average payment of people working on a company (or an industry), that number may mean something different than you think. If that average is a median, then it means that half the employees earn less than it and half earn more than that. But if that number is a mean, you may be getting nothing more relevant than the average of the CEO with his underpaid employees.

the-well-chosen-average.png

To better understand this concept, let's open our Python shell and experiment with the packages Numpy and Scipy. First, we import the necessary packages:

import numpy as np
from scipy import stats

Then we define the incomes that we're studying as a numpy array. I'm going to use the same example from the book.

income = np.array([45000, 15000, 10000, 10000, 5700, 5000, 5000, 5000, 3700, 3700, 3700, 3700,
                   3000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000])

Now we're going to get our different types of averages.

mean = np.mean(income)# arithmetic average
median = np.median(income) # half earns more, half earns less
mode = stats.mode(income) # most common income

Printing those values (on a locale base currency format) will show you their differences.

import locale
locale.setlocale( locale.LC_ALL, '' )

print("Mean: {}".format(locale.currency( mean, grouping=True )))
print("Median: {}".format(locale.currency(median, grouping=True)))
print("Mode: {}".format(locale.currency(mode.mode[0], grouping=True)))
Mean: R$ 5.700,00
Median: R$ 3.000,00
Mode: R$ 2.000,00

Learn more