COURSE 19 | SAMPLING AND POINT IN PYTHON

Author
Affiliation

Lawal’s Note

Associate Data Science Course in Python by DataCamp Inc

Published

December 26, 2024

1 Chapter 1: Introduction to Sampling

Learn what sampling is and why it is so powerful. You’ll also learn about the problems caused by convenience sampling and the differences between true randomness and pseudo-randomness.

1.1 Chapter 1.1: Sampling and point estimates

Hi! Welcome to the course! I’m James, and I’ll be your host as we delve into the world of sampling data with Python. To start, let’s look at what sampling is and why it might be useful.

Estimating the population of France

Let’s consider the problem of counting how many people live in France. The standard approach is to take a census. This means contacting every household and asking how many people live there. There are lots of people in France. Since there are millions of people in France, this is a really expensive process. Even with modern data collection technology, most countries will only conduct a census every five or ten years due to the cost.

Sampling households

In 1786, Pierre-Simon Laplace realized you could estimate the population with less effort. Rather than asking every household who lived there, he asked a small number of households and used statistics to estimate the number of people in the whole population. This technique of working with a subset of the whole population is called sampling.

Population vs. sample

Two definitions are important for this course. The population is the complete set of data that we are interested in. The previous example involved the literal population of France, but in statistics, it doesn’t have to refer to people. One thing to bear in mind is that there is usually no equivalent of the census, so typically, we won’t know what the whole population is like - more on this in a moment. The sample is the subset of data that we are working with.

Coffee rating dataset

Picture a dataset of professional ratings of coffees. Each row corresponds to one coffee, and there are thirteen hundred and thirty-eight rows in the dataset. The coffee is given a score from zero to one hundred, which is stored in the total_cup_points column. Other columns contain contextual information like the variety and country of origin and scores between zero and ten for attributes of the coffee such as aroma and body. These scores are averaged across all the reviewers for that particular coffee. It doesn’t contain every coffee in the world, so we don’t know exactly what the population of coffees is. However, there are enough here that we can think of it as our population of interest.

Points vs. flavor: population

Let’s consider the relationship between cup points and flavor by selecting those two columns. This dataset contains all thirteen hundred and thirty-eight rows from the original dataset.

Points vs. flavor: 10 row sample

The pandas .sample method returns a random subset of rows. Setting n to ten means ten random rows are returned. By default, rows from the original dataset can’t appear in the sample dataset multiple times, so we are guaranteed to have ten unique rows in our sample.

Python sampling for Series

The .sample method also works on pandas Series. Here, using square-bracket subsetting retrieves the total_cup_points column as a Series, and the n argument specifies how many random values to return.

Population parameters & point estimates

A population parameter is a calculation made on the population dataset. We aren’t limited to counting values either; here, we calculate the mean of the cup points using NumPy. By contrast, a point estimate, or sample statistic, is a calculation based on the sample dataset. Here, the mean of the total cup points is calculated on the sample. Notice that the means are very similar but not identical.

Point estimates with pandas

Working with pandas can be easier than working with NumPy. These mean calculations can be performed using the .mean pandas method.

1.2 Exercise 1.1.1

Simple sampling with pandas

Throughout this chapter, you’ll be exploring song data from Spotify. Each row of this population dataset represents a song, and there are over 40,000 rows. Columns include the song name, the artists who performed it, the release year, and attributes of the song like its duration, tempo, and danceability. You’ll start by looking at the durations.

Your first task is to sample the Spotify dataset and compare the mean duration of the population with the sample.

Instructions

  1. Sample 1000 rows from spotify, assigning to spotify_sample.
  2. Calculate the mean duration in minutes from spotify using pandas.
  3. Calculate the mean duration in minutes from spotify_sample using pandas.
Code
# Importing pandas
import pandas as pd

# Importing the course arrays
spotify = pd.read_feather("datasets/spotify_2000_2020.feather")


# Sample 1000 rows from spotify_population
spotify_sample = spotify.sample(n=1000)

# Print the sample
print(spotify_sample)

# Calculate the mean duration in mins from spotify_population
mean_dur_pop = spotify['duration_minutes'].mean()

# Calculate the mean duration in mins from spotify_sample
mean_dur_samp = spotify_sample['duration_minutes'].mean()

# Print the means
print(mean_dur_pop)
print(mean_dur_samp)
       acousticness                                            artists  \
6301       0.847000  ['Johann Sebastian Bach', 'Karl Kaiser', 'Koln...   
4932       0.001090                                      ['The Shins']   
10011      0.567000                                ['Imagine Dragons']   
9734       0.012300                    ['Jhené Aiko', 'Vince Staples']   
7704       0.023600                                       ['Skrillex']   
...             ...                                                ...   
9330       0.000066                        ['Five Finger Death Punch']   
23603      0.027300                                     ['Steve Holy']   
6567       0.746000                                ['Sammy Davis Jr.']   
22665      0.676000                                     ['Luke Combs']   
31567      0.000038                                       ['Godsmack']   

       danceability  duration_ms  duration_minutes  energy  explicit  \
6301          0.515      90107.0          1.501783   0.269       0.0   
4932          0.565     221840.0          3.697333   0.881       0.0   
10011         0.462     261080.0          4.351333   0.387       0.0   
9734          0.612     210773.0          3.512883   0.494       0.0   
7704          0.658     237680.0          3.961333   0.930       0.0   
...             ...          ...               ...     ...       ...   
9330          0.378     174160.0          2.902667   0.898       1.0   
23603         0.669     220187.0          3.669783   0.861       0.0   
6567          0.536     343120.0          5.718667   0.182       0.0   
22665         0.552     193200.0          3.220000   0.402       0.0   
31567         0.310     247067.0          4.117783   0.973       1.0   

                           id  instrumentalness   key  liveness  loudness  \
6301   5rvxeH1d6EctGOog8UEvFx          0.089700  11.0    0.0757   -21.859   
4932   3P0fLlpjHwB6zGb7OT9dbJ          0.000022   0.0    0.1060    -4.475   
10011  4kDTvLhGF29gFsqceuxBSC          0.000000   0.0    0.1640    -7.401   
9734   7mnJgJhtuBEMRNoERn1OOa          0.000027  11.0    0.1020   -10.365   
7704   1RJZLVGpBG9nNZiHRQSWTp          0.000000   5.0    0.5780    -2.912   
...                       ...               ...   ...       ...       ...   
9330   0U3aP0QocitGhyqIZOfe4R          0.000082  11.0    0.2800    -3.855   
23603  4mpUaApNea2QhQshM4xyr4          0.000000   2.0    0.0586    -3.547   
6567   1DGaqkDcIYIBZSpW5vyoOL          0.000118   0.0    0.1110   -13.262   
22665  2rxQMGVafnNaRaXlRMWPde          0.000000  11.0    0.0928    -7.431   
31567  2Uilp8alSjAxV0IXorUk9l          0.000001   0.0    0.0973    -4.721   

       mode                                               name  popularity  \
6301    0.0  Orchestral Suite No. 2 in B Minor, BWV 1067: V...        48.0   
4932    1.0                                         Turn On Me        45.0   
10011   1.0                                          Not Today        68.0   
9734    0.0                                         The Vapors        50.0   
7704    0.0                                             Recess        59.0   
...     ...                                                ...         ...   
9330    0.0                                      Crossing Over        42.0   
23603   1.0                               Brand New Girlfriend        59.0   
6567    0.0                     Mr. Bojangles - Single Version        54.0   
22665   1.0                                    Beautiful Crazy        81.0   
31567   1.0                                 I Fucking Hate You        49.0   

      release_date  speechiness    tempo  valence    year  
6301    2000-12-02       0.0370  130.998   0.9640  2000.0  
4932    2007-01-23       0.0356  121.972   0.4270  2007.0  
10011   2016-06-03       0.0292  122.698   0.0428  2016.0  
9734    2013-01-01       0.0606  147.964   0.1470  2013.0  
7704    2014-03-14       0.1410  104.018   0.1320  2014.0  
...            ...          ...      ...      ...     ...  
9330    2009-09-22       0.0639  195.747   0.1280  2009.0  
23603   2006-08-08       0.0929  133.820   0.7200  2006.0  
6567    2002-01-01       0.0458  117.361   0.1320  2002.0  
22665   2018-06-01       0.0262  103.313   0.3820  2018.0  
31567   2003-04-08       0.1850  190.270   0.2260  2003.0  

[1000 rows x 20 columns]
3.8521519140900073
3.9113962
Note

Notice that the mean song duration in the sample is similar, but not identical to the mean song duration in the whole population.

1.3 Exercise 1.1.2

Simple sampling and calculating with NumPy

You can also use numpy to calculate parameters or statistics from a list or pandas Series.

You’ll be turning it up to eleven and looking at the loudness property of each song.

Instructions

  1. Create a pandas Series, loudness_pop, by subsetting the loudness column from spotify.
  • Sample loudness_pop to get 100 random values, assigning to loudness_samp.
  1. Calculate the mean of loudness_pop using numpy.
  2. Calculate the mean of loudness_samp using numpy.
Code
# Importing pandas
import pandas as pd
import numpy as np

# Importing the course arrays
spotify = pd.read_feather("datasets/spotify_2000_2020.feather")

# Create a pandas Series from the loudness column of spotify_population
loudness_pop = spotify['loudness']

# Sample 100 values of loudness_pop
loudness_samp = loudness_pop.sample(n=100)

print(loudness_samp)

# Calculate the mean of loudness_pop
mean_loudness_pop = np.mean(loudness_pop)

# Calculate the mean of loudness_samp
mean_loudness_samp = np.mean(loudness_samp)

print(mean_loudness_pop)
print(mean_loudness_samp)
17126    -2.535
24473    -4.464
31830    -8.839
7261     -6.206
36359    -5.428
          ...  
22905    -6.266
33343    -7.180
4955     -4.311
13845   -15.764
33324    -5.749
Name: loudness, Length: 100, dtype: float64
-7.366856851353947
-7.9239999999999995
Note

Again, notice that the calculated value (the mean) is close but not identical in each case.

1.4 Chapter 1.2: Convenience sampling

The point estimates you calculated in the previous exercises were very close to the population parameters that they were based on, but is this always the case?

The Literary Digest election prediction

In 1936, a newspaper called The Literary Digest ran an extensive poll to try to predict the next US presidential election. They phoned ten million voters and had over two million responses. About one-point-three million people said they would vote for Landon, and just under one million people said they would vote for Roosevelt. That is, Landon was predicted to get fifty-seven percent of the vote, and Roosevelt was predicted to get forty-three percent of the vote. Since the sample size was so large, it was presumed that this poll would be very accurate. However, in the election, Roosevelt won by a landslide with sixty-two percent of the vote. So what went wrong? Well, in 1936, telephones were a luxury, so the only people who had been contacted by The Literary Digest were relatively rich. The sample of voters was not representative of the whole population of voters, and so the poll suffered from sample bias. The data was collected by the easiest method, in this case, telephoning people. This is called convenience sampling and is often prone to sample bias. Before sampling, we need to think about our data collection process to avoid biased results.

Finding the mean age of French people

Let’s look at another example. While on vacation at Disneyland Paris, you start wondering about the mean age of French people. To get an answer, you ask ten people stood nearby about their ages. Their mean age is twenty-four-point-six years old. Do you think this will be a good estimate of the mean age of all French citizens?

How accurate was the survey?

On the left, you can see mean ages taken from the French census. Notice that the population has been gradually getting older as birth rates decrease and life expectancy increases. In 2015, the mean age was over forty, so our estimate of twenty-four-point-six is way off. The problem is that the family-friendly fun at Disneyland means that the sample ages weren’t representative of the general population. There are generally more eight-year-olds than eighty-year-olds riding rollercoasters.

Convenience sampling coffee ratings

Let’s return to the coffee ratings dataset and look at the mean cup points population parameter. The mean is about eighty-two. One form of convenience sampling would be to take the first ten rows, rather than the random rows we saw in the previous video. We can take the first 10 rows with the pandas head method. The mean cup points from this sample is higher at eighty-nine. The discrepancy suggests that coffees with higher cup points appear near the start of the dataset. Again, the convenience sample isn’t representative of the whole population.

Visualizing selection bias

Histograms are a great way to visualize the selection bias. We can create a histogram of the total cup points from the population, which contains values ranging from around 59 to around 91. The np.arange function can be used to create bins of width 2 from 59 to 91. Recall that the stop value in np.arange is exclusive, so we specify 93, not 91. Here’s the same code to generate a histogram for the convenience sample.

Distribution of a population and of a convenience sample

Comparing the two histograms, it is clear that the distribution of the sample is not the same as the population: all of the sample values are on the right-hand side of the plot.

Visualizing selection bias for a random sample

This time, we’ll compare the total_cup_points distribution of the population with a random sample of 10 coffees.

Distribution of a population and of a simple random sample

Notice how the shape of the distributions is more closely aligned when random sampling is used.

1.5 Exercise 1.2.1

Are findings from the sample generalizable?

You just saw how convenience sampling—collecting data using the easiest method—can result in samples that aren’t representative of the population. Equivalently, this means findings from the sample are not generalizable to the population. Visualizing the distributions of the population and the sample can help determine whether or not the sample is representative of the population.

The Spotify dataset contains an acousticness column, which is a confidence measure from zero to one of whether the track was made with instruments that aren’t plugged in. You’ll compare the acousticness distribution of the total population of songs with a sample of those songs.

Instructions

  1. Plot a histogram of the acousticness from spotify with bins of width 0.01 from 0 to 1 using pandas .hist().
  2. Update the histogram code to use the spotify_mysterious_sample dataset.
Code
# Importing pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Importing the course arrays
spotify = pd.read_feather("datasets/spotify_2000_2020.feather")

# Visualize the distribution of acousticness with a histogram
spotify['acousticness'].hist(bins=np.arange(0,1.01,0.01))
plt.show()

# Generate a convenience sample where acousticness is consistently higher
spotify_high_acousticness = spotify[(spotify['acousticness'] >= 0.85) & (spotify['acousticness'] <= 1.0)]

# Sample 1107 entries from the high acousticness subset
spotify_mysterious_sample = spotify_high_acousticness.sample(n=1107)

# Update the histogram to use spotify_mysterious_sample
spotify_mysterious_sample['acousticness'].hist(bins=np.arange(0, 1.01, 0.01))
plt.show()

1.6 Question

Compare the two histograms you drew. Are the acousticness values in the sample generalizable to the general population?

No. The acousticness samples are consistently higher than those in the general population.

The acousticness values in the sample are all greater than 0.85, whereas they range from 0 to 1 in the whole population.

1.7 Exercise 1.2.2

Are these findings generalizable?

Let’s look at another sample to see if it is representative of the population. This time, you’ll look at the duration_minutes column of the Spotify dataset, which contains the length of the song in minutes.

Instructions

  • Plot a histogram of duration_minutes from spotify with bins of width 0.5 from 0 to 15 using pandas .hist().
  • Update the histogram code to use the spotify_mysterious_sample2 dataset.
Code
# Importing pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Importing the course arrays
spotify = pd.read_feather("datasets/spotify_2000_2020.feather")

# Generate a convenience sample where duration_minutes is within the specified range
spotify_duration_range = spotify[(spotify['duration_minutes'] >= 0.8079999999) & (spotify['duration_minutes'] <= 9.822)]

# Sample 50 entries from the spotify_mysterious_sample2 dataset
spotify_mysterious_sample2 = spotify_duration_range.sample(n=50)

# Visualize the distribution of duration_minutes in the population with a histogram
spotify['duration_minutes'].hist(bins=np.arange(0,15.5,0.5))
plt.show()

# Visualize the distribution of duration_minutes as a histogram
spotify_mysterious_sample2['duration_minutes'].hist(bins=np.arange(0, 15.5, 0.5))
plt.show()

1.8 Question

Compare the two histograms you drew. Are the duration values in the sample generalizable to the general population?

1.8.1 Answer

Yes. The sample selected is likely a random sample of all songs in the population.

The duration values in the sample show a similar distribution to those in the whole population, so the results are generalizable.

1.9 Chapter 1.3: Pseudo-random number generation

You previously saw how to use a random sample to get results similar to those in the population. But how does a computer actually do this random sampling?

What does random mean?

There are several meanings of random in English. This definition from Oxford Languages is the most interesting for us. If we want to choose data points at random from a population, we shouldn’t be able to predict which data points would be selected ahead of time in some systematic way.

True random numbers

To generate truly random numbers, we typically have to use a physical process like flipping coins or rolling dice. The Hotbits service generates numbers from radioactive decay, and RANDOM.ORG generates numbers from atmospheric noise, which are radio signals generated by lightning. Unfortunately, these processes are fairly slow and expensive for generating random numbers.

https://www.fourmilab.ch/hotbits

https://www.random.org

Pseudo-random number generation

For most use cases, pseudo-random number generation is better since it is cheap and fast. Pseudo-random means that although each value appears to be random, it is actually calculated from the previous random number. Since you have to start the calculations somewhere, the first random number is calculated from what is known as a seed value. The word random is in quotes to emphasize that this process isn’t really random. If we start from a particular seed value, all future numbers will be the same.

Pseudo-random number generation example

For example, suppose we have a function to generate pseudo-random values called calc_next_random. To begin, we pick a seed number, in this case, one. calc_next_random does some calculations and returns three. We then feed three into calc_next_random, and it does the same set of calculations and returns two. And if we can keep feeding in the last number, it will return something apparently random. Although the process is deterministic, the trick to a random number generator is to make it look like the values are random.

Random number generating functions

NumPy has many functions for generating random numbers from statistical distributions. To use each of these, make sure to prepend each function name with numpy.random or np.random. Some of them, like .uniform and .normal, may be familiar. Others have more niche applications.

Visualizing random numbers

Let’s generate some pseudo-random numbers. The first arguments to each random number function specify distribution parameters. The size argument specifies how many numbers to generate, in this case, five thousand. We’ve chosen the beta distribution, and its parameters are named a and b. These random numbers come from a continuous distribution, so a great way to visualize them is with a histogram. Here, because the numbers were generated from the beta distribution, all the values are between zero and one.

Random numbers seeds

To set a random seed with NumPy, we use the .random.seed method. Random.seed takes an integer for the seed number, which can be any number you like. .normal generates pseudo-random numbers from the normal distribution. The loc and scale arguments set the mean and standard deviation of the distribution, and the size argument determines how many random numbers from that distribution will be returned. If we call .normal a second time, we get two different random numbers. If we reset the seed by calling random.seed with the same seed number, then call .normal again, we get the same numbers as before. This makes our code reproducible.

Using a different seed

Now let’s try a different seed. This time, calling .normal generates different numbers.

1.10 Exercise 1.3.1

Generating random numbers

You’ve used .sample() to generate pseudo-random numbers from a set of values in a DataFrame. A related task is to generate random numbers that follow a statistical distribution, like the uniform distribution or the normal distribution.

Each random number generation function has distribution-specific arguments and an argument for specifying the number of random numbers to generate.

Instructions

  1. Generate 5000 numbers from a uniform distribution, setting the parameters low to -3 and high to 3.
  2. Generate 5000 numbers from a normal distribution, setting the parameters loc to 5 and scale to 2.
  3. Plot a histogram of uniforms with bins of width of 0.25 from -3 to 3 using plt.hist().
  4. Plot a histogram of normals with bins of width of 0.5 from -2 to 13 using plt.hist().
Code
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Generate random numbers from a Uniform(-3, 3)
uniforms = np.random.uniform(low=-3, high=3, size=5000)

# Print uniforms
print(uniforms)

# Generate random numbers from a Normal(5, 2)
normals = np.random.normal(loc=5, scale = 2, size= 5000)

# Print normals
print(normals)

# Plot a histogram of uniform values, binwidth 0.25
plt.hist(uniforms, bins=np.arange(-3,3.25,0.25))
plt.show()

# Plot a histogram of normal values, binwidth 0.5
plt.hist(normals, bins = np.arange(-2, 13.5, 0.5))
plt.show()
[ 1.73504282  0.72457959 -1.34150253 ... -1.0571522  -0.12274863
  0.83445873]
[5.12140432 4.87589927 2.53710178 ... 6.39845768 2.70745544 6.79054579]

1.11 Exercise 1.3.2

Understanding random seeds

While random numbers are important for many analyses, they create a problem: the results you get can vary slightly. This can cause awkward conversations with your boss when your script for calculating the sales forecast gives different answers each time.

Setting the seed for numpy’s random number generator helps avoid such problems by making the random number generation reproducible.

Question 1

Which statement about x and y is true?

import numpy as np
np.random.seed(123)
x = np.random.normal(size=5)
y = np.random.normal(size=5)

The values of x are different from those of y

Question 2

Which statement about x and y is true?

import numpy as np
np.random.seed(123)
x = np.random.normal(size=5)
np.random.seed(123)
y = np.random.normal(size=5)

x and y have identical values.

Question 3

Which statement about x and y is true?

import numpy as np
np.random.seed(123)
x = np.random.normal(size=5)
np.random.seed(456)
y = np.random.normal(size=5)

The values of x are different from those of y.

2 CHAPTER 2: Sampling Methods

It’s time to get hands-on and perform the four random sampling methods in Python: simple, systematic, stratified, and cluster.

2.1 Chapter 2.1: Simple random and systematic sampling

There are several methods of sampling from a population. In this video, we’ll look at simple random sampling and systematic random sampling.

Simple random sampling

Simple random sampling works like a raffle or lottery. We start with our population of raffle tickets or lottery balls and randomly pick them out one at a time.

Simple random sampling of coffees

In our coffee ratings dataset, instead of raffle tickets or lottery balls, the population consists of coffee varieties. To perform simple random sampling, we take some at random, one at a time. Each coffee has the same chance as any other of being picked. When using this technique, sometimes we might end up with two coffees that were next to each other in the dataset, and sometimes we might end up with large areas of the dataset that were not selected from at all.

Simple random sampling with pandas

We’ve already seen how to do simple random sampling with pandas. We call .sample and set n to the size of the sample. We can also set the seed using the random_state argument to generate reproducible results, just like we did pseudo-random number generation. Previously, by not setting random_state when sampling, our code would generate a different random sample each time it was run.

Systematic sampling

Another sampling method is known as systematic sampling. This samples the population at regular intervals. Here, looking from top to bottom and left to right within each row, every fifth coffee is sampled.

Systematic sampling - defining the interval

Systematic sampling with pandas is slightly trickier than simple random sampling. The tricky part is determining how big the interval between each row should be for a given sample size. Suppose we want a sample size of five coffees. The population size is the number of rows in the whole dataset, and in this case, it’s one thousand three hundred and thirty-eight. The interval is the population size divided by the sample size, but because we want the answer to be an integer, we perform integer division with two forward slashes. This is like standard division but discards any fractional part. One-three-three-eight divided by five is actually two hundred and sixty-seven-point-six, and discarding the fractional part leaves two hundred and sixty-seven. Thus, to get a systematic sample of five coffees, we will select every two hundred sixty-seventh coffee in the dataset.

Systematic sampling - selecting the rows

To select every two hundred and sixty-seventh row, we call dot-iloc on coffee_ratings and pass double-colons and the interval, which is 267 in this case. Double-colon interval tells pandas to select every two hundred and sixty-seventh row from zero to the end of the DataFrame.

The trouble with systematic sampling

There is a problem with systematic sampling, though. Suppose we are interested in statistics about the aftertaste attribute of the coffees. To examine this, first, we use reset_index to create a column of index values in our DataFrame that we can plot. Plotting aftertaste against index shows a pattern. Earlier rows generally have higher aftertaste scores than later rows. This introduces bias into the statistics that we calculate. In general, it is only safe to use systematic sampling if a plot like this has no pattern; that is, it just looks like noise.

Making systematic sampling safe

To ensure that systematic sampling is safe, we can randomize the row order before sampling. dot-sample has an argument named frac that lets us specify the proportion of the dataset to return in the sample, rather than the absolute number of rows that n specifies. Setting frac to one randomly samples the whole dataset. In effect, this randomly shuffles the rows. Next, the indices need to be reset so that they go in order from zero again. Specifying drop equals True clears the previous row indexes, and chaining to another reset_index call creates a column containing these new indexes. Redrawing the plot with the shuffled dataset shows no pattern between aftertaste and index. This is great, but note that once we’ve shuffled the rows, systematic sampling is essentially the same as simple random sampling.

2.2 Exercise 2.1.1

Simple random sampling

The simplest method of sampling a population is the one you’ve seen already. It is known as simple random sampling (sometimes abbreviated to “SRS”), and involves picking rows at random, one at a time, where each row has the same chance of being picked as any other.

In this chapter, you’ll apply sampling methods to a synthetic (fictional) employee attrition dataset from IBM, where “attrition” in this context means leaving the company.

Instructions

  • Sample 70 rows from attrition using simple random sampling, setting the random seed to 18900217.
  • Print the sample dataset, attrition_samp. What do you notice about the indices?
Code
# Importing pandas
import pandas as pd

# Importing the course arrays
attrition = pd.read_feather("datasets/attrition.feather")

# Sample 70 rows using simple random sampling and set the seed
attrition_samp = attrition.sample(n=70, random_state=18900217)

# Print the sample
print(attrition_samp)
      Age  Attrition     BusinessTravel  DailyRate            Department  \
1134   35        0.0      Travel_Rarely        583  Research_Development   
1150   52        0.0         Non-Travel        585                 Sales   
531    33        0.0      Travel_Rarely        931  Research_Development   
395    31        0.0      Travel_Rarely       1332  Research_Development   
392    29        0.0      Travel_Rarely        942  Research_Development   
...   ...        ...                ...        ...                   ...   
361    27        0.0  Travel_Frequently       1410                 Sales   
1180   36        0.0      Travel_Rarely        530                 Sales   
230    26        0.0      Travel_Rarely       1443                 Sales   
211    29        0.0  Travel_Frequently        410  Research_Development   
890    30        0.0  Travel_Frequently       1312  Research_Development   

      DistanceFromHome      Education    EducationField  \
1134                25         Master           Medical   
1150                29         Master     Life_Sciences   
531                 14       Bachelor           Medical   
395                 11        College           Medical   
392                 15  Below_College     Life_Sciences   
...                ...            ...               ...   
361                  3  Below_College           Medical   
1180                 2         Master     Life_Sciences   
230                 23       Bachelor         Marketing   
211                  2  Below_College     Life_Sciences   
890                  2         Master  Technical_Degree   

     EnvironmentSatisfaction  Gender  ...  PerformanceRating  \
1134                    High  Female  ...          Excellent   
1150                     Low    Male  ...          Excellent   
531                Very_High  Female  ...          Excellent   
395                     High    Male  ...          Excellent   
392                   Medium  Female  ...          Excellent   
...                      ...     ...  ...                ...   
361                Very_High  Female  ...        Outstanding   
1180                    High  Female  ...          Excellent   
230                     High  Female  ...          Excellent   
211                Very_High  Female  ...          Excellent   
890                Very_High  Female  ...          Excellent   

     RelationshipSatisfaction  StockOptionLevel TotalWorkingYears  \
1134                     High                 1                16   
1150                   Medium                 2                16   
531                 Very_High                 1                 8   
395                 Very_High                 0                 6   
392                       Low                 1                 6   
...                       ...               ...               ...   
361                    Medium                 2                 6   
1180                     High                 0                17   
230                      High                 1                 5   
211                      High                 3                 4   
890                 Very_High                 0                10   

     TrainingTimesLastYear WorkLifeBalance  YearsAtCompany  \
1134                     3            Good              16   
1150                     3            Good               9   
531                      5          Better               8   
395                      2            Good               6   
392                      2            Good               5   
...                    ...             ...             ...   
361                      3          Better               6   
1180                     2            Good              13   
230                      2            Good               2   
211                      3          Better               3   
890                      2          Better               9   

      YearsInCurrentRole  YearsSinceLastPromotion YearsWithCurrManager  
1134                  10                       10                    1  
1150                   8                        0                    0  
531                    7                        1                    6  
395                    5                        0                    1  
392                    4                        1                    3  
...                  ...                      ...                  ...  
361                    5                        0                    4  
1180                   7                        6                    7  
230                    2                        0                    0  
211                    2                        0                    2  
890                    7                        0                    7  

[70 rows x 31 columns]

2.3 Exercise 2.1.2

Systematic sampling

One sampling method that avoids randomness is called systematic sampling. Here, you pick rows from the population at regular intervals.

For example, if the population dataset had one thousand rows, and you wanted a sample size of five, you could pick rows 0, 200, 400, 600, and 800.

Instructions

1.Set the sample size to 70. - Calculate the population size from attrition. - Calculate the interval between the rows to be sampled.

  1. Systematically sample attrition to get the rows of the population at each interval, starting at 0; assign the rows to attrition_sys_samp
Code
# Importing pandas
import pandas as pd

# Importing the course arrays
attrition = pd.read_feather("datasets/attrition.feather")

# Set the sample size to 70
sample_size = 70

# Calculate the population size from attrition_pop
pop_size = len(attrition)

# Calculate the interval
interval = pop_size//sample_size

# Systematically sample 70 rows
attrition_sys_samp = attrition.iloc[::interval]

# Print the sample
print(attrition_sys_samp)
      Age  Attrition BusinessTravel  DailyRate            Department  \
0      21        0.0  Travel_Rarely        391  Research_Development   
21     19        0.0  Travel_Rarely       1181  Research_Development   
42     45        0.0  Travel_Rarely        252  Research_Development   
63     23        0.0  Travel_Rarely        373  Research_Development   
84     30        1.0  Travel_Rarely        945                 Sales   
...   ...        ...            ...        ...                   ...   
1365   48        0.0  Travel_Rarely        715  Research_Development   
1386   48        0.0  Travel_Rarely       1355  Research_Development   
1407   50        0.0  Travel_Rarely        989  Research_Development   
1428   50        0.0     Non-Travel        881  Research_Development   
1449   52        0.0  Travel_Rarely        699  Research_Development   

      DistanceFromHome      Education EducationField EnvironmentSatisfaction  \
0                   15        College  Life_Sciences                    High   
21                   3  Below_College        Medical                  Medium   
42                   2       Bachelor  Life_Sciences                  Medium   
63                   1        College  Life_Sciences               Very_High   
84                   9       Bachelor        Medical                  Medium   
...                ...            ...            ...                     ...   
1365                 1       Bachelor  Life_Sciences               Very_High   
1386                 4         Master  Life_Sciences                    High   
1407                 7        College        Medical                  Medium   
1428                 2         Master  Life_Sciences                     Low   
1449                 1         Master  Life_Sciences                    High   

      Gender  ...  PerformanceRating RelationshipSatisfaction  \
0       Male  ...          Excellent                Very_High   
21    Female  ...          Excellent                Very_High   
42    Female  ...          Excellent                Very_High   
63      Male  ...        Outstanding                Very_High   
84      Male  ...          Excellent                     High   
...      ...  ...                ...                      ...   
1365    Male  ...          Excellent                     High   
1386    Male  ...          Excellent                   Medium   
1407  Female  ...          Excellent                Very_High   
1428    Male  ...          Excellent                Very_High   
1449    Male  ...          Excellent                      Low   

      StockOptionLevel TotalWorkingYears TrainingTimesLastYear  \
0                    0                 0                     6   
21                   0                 1                     3   
42                   0                 1                     3   
63                   1                 1                     2   
84                   0                 1                     3   
...                ...               ...                   ...   
1365                 0                25                     3   
1386                 0                27                     3   
1407                 1                29                     2   
1428                 1                31                     3   
1449                 1                34                     5   

     WorkLifeBalance  YearsAtCompany  YearsInCurrentRole  \
0             Better               0                   0   
21            Better               1                   0   
42            Better               1                   0   
63            Better               1                   0   
84              Good               1                   0   
...              ...             ...                 ...   
1365            Best               1                   0   
1386          Better              15                  11   
1407            Good              27                   3   
1428          Better              31                   6   
1449          Better              33                  18   

      YearsSinceLastPromotion YearsWithCurrManager  
0                           0                    0  
21                          0                    0  
42                          0                    0  
63                          0                    1  
84                          0                    0  
...                       ...                  ...  
1365                        0                    0  
1386                        4                    8  
1407                       13                    8  
1428                       14                    7  
1449                       11                    9  

[70 rows x 31 columns]

2.4 Exercise 2.1.3

Is systematic sampling OK?

Systematic sampling has a problem: if the data has been sorted, or there is some sort of pattern or meaning behind the row order, then the resulting sample may not be representative of the whole population. The problem can be solved by shuffling the rows, but then systematic sampling is equivalent to simple random sampling.

Here you’ll look at how to determine whether or not there is a problem.

Instructions

  1. Add an index column to attrition, assigning the result to attrition_id.
    • Create a scatter plot of YearsAtCompany versus index for attrition_id using pandas .plot().
  2. Randomly shuffle the rows of attrition.
    • Reset the row indexes, and add an index column to attrition.
    • Repeat the scatter plot of YearsAtCompany versus index, this time using attrition_shuffled.
Code
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Importing the course arrays
attrition = pd.read_feather("datasets/attrition.feather")

# Add an index column to attrition_pop
attrition_id = attrition.reset_index()

# Plot YearsAtCompany vs. index for attrition_pop_id
attrition_id.plot(x="index", y="YearsAtCompany", kind="scatter")
plt.show()

# Shuffle the rows of attrition_pop
attrition_shuffled = attrition.sample(frac=1)

# Reset the row indexes and create an index column
attrition_shuffled = attrition_shuffled.reset_index(drop=True).reset_index()

# Plot YearsAtCompany vs. index for attrition_shuffled
attrition_shuffled.plot(x="index", y="YearsAtCompany", kind="scatter")
plt.show()

Question

Does a systematic sample always produce a sample similar to a simple random sample?

No, Systematic sampling has problems when the data are sorted or contain a pattern. Shuffling the rows makes it equivalent to simple random sampling.

2.5 Chapter 2.2: Stratified and weighted random sampling

Stratified sampling is a technique that allows us to sample a population that contains subgroups.

Coffees by country

For example, we could group the coffee ratings by country. If we count the number of coffees by country using the value_counts method, we can see that these six countries have the most data.

  1. 1 The dataset lists Hawaii and Taiwan as countries for convenience, as they are notable coffee-growing regions.

Filtering for 6 countries

To make it easier to think about sampling subgroups, let’s limit our analysis to these six countries. We can use the .isin method to filter the population and only return the rows corresponding to these six countries. This filtered dataset is stored as coffee_ratings_top.

Counts of a simple random sample

Let’s take a ten percent simple random sample of the dataset using .sample with frac set to 0.1. We also set the random_state argument to ensure reproducibility. As with the whole dataset, we can look at the counts for each country. To make comparisons easier, we set normalize to True to convert the counts into a proportion, which shows what proportion of coffees in the sample came from each country.

Comparing proportions

Here are the proportions for the population and the ten percent sample side by side. Just by chance, in this sample, Taiwanese coffees form a disproportionately low percentage. The different makeup of the sample compared to the population could be a problem if we want to analyze the country of origin, for example.

Proportional stratified sampling

If we care about the proportions of each country in the sample closely matching those in the population, then we can group the data by country before taking the simple random sample. Note that we used the Python line continuation backslash here, which can be useful for breaking up longer chains of pandas code like this. Calling the .sample method after grouping takes a simple random sample within each country. Now the proportions of each country in the stratified sample are much closer to those in the population.

Equal counts stratified sampling

One variation of stratified sampling is to sample equal counts from each group, rather than an equal proportion. The code only has one change from before. This time, we use the n argument in .sample instead of frac to extract fifteen randomly-selected rows from each country. Here, the resulting sample has equal proportions of one-sixth from each country.

Weighted random sampling

A close relative of stratified sampling that provides even more flexibility is weighted random sampling. In this variant, we create a column of weights that adjust the relative probability of sampling each row. For example, suppose we thought that it was important to have a higher proportion of Taiwanese coffees in the sample than in the population. We create a condition, in this case, rows where the country of origin is Taiwan. Using the where function from NumPy, we can set a weight of two for rows that match the condition and a weight of one for rows that don’t match the condition. This means when each row is randomly sampled, Taiwanese coffees have two times the chance of being picked compared to other coffees. When we call .sample, we pass the column of weights to the weights argument.

Weighted random sampling results

Here, we can see that Taiwan now contains seventeen percent of the sampled dataset, compared to eight-point-five percent in the population. This sort of weighted sampling is common in political polling, where we need to correct for under- or over-representation of demographic groups.

2.5.1 Exercise 2.2.1

Proportional stratified sampling

If you are interested in subgroups within the population, then you may need to carefully control the counts of each subgroup within the population. Proportional stratified sampling results in subgroup sizes within the sample that are representative of the subgroup sizes within the population. It is equivalent to performing a simple random sample on each subgroup.

Instructions

  1. Get the proportion of employees by Education level from attrition.
  2. Use proportional stratified sampling on attrition_pop to sample 40% of each Education group, setting the seed to 2022.
  3. Get the proportion of employees by Education level from attrition_strat.
Code
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Importing the course arrays
attrition = pd.read_feather("datasets/attrition.feather")

# Proportion of employees by Education level
education_counts_pop = attrition['Education'].value_counts(normalize=True)

# Print education_counts_pop
print(education_counts_pop)

# Proportional stratified sampling for 40% of each Education group
attrition_strat = attrition.groupby('Education')\
.sample(frac=0.4, random_state=2022)

# Print the sample
print(attrition_strat)

# Calculate the Education level proportions from attrition_strat
education_counts_strat = attrition_strat['Education'].value_counts(normalize=True)

# Print education_counts_strat
print(education_counts_strat)
Education
Bachelor         0.389116
Master           0.270748
College          0.191837
Below_College    0.115646
Doctor           0.032653
Name: proportion, dtype: float64
      Age  Attrition     BusinessTravel  DailyRate            Department  \
1191   53        0.0      Travel_Rarely        238                 Sales   
407    29        0.0  Travel_Frequently        995  Research_Development   
1233   59        0.0  Travel_Frequently       1225                 Sales   
366    37        0.0      Travel_Rarely        571  Research_Development   
702    31        0.0  Travel_Frequently        163  Research_Development   
...   ...        ...                ...        ...                   ...   
733    38        0.0  Travel_Frequently        653  Research_Development   
1061   44        0.0  Travel_Frequently        602       Human_Resources   
1307   41        0.0      Travel_Rarely       1276                 Sales   
1060   33        0.0      Travel_Rarely        516  Research_Development   
177    29        0.0      Travel_Rarely        738  Research_Development   

      DistanceFromHome      Education    EducationField  \
1191                 1  Below_College           Medical   
407                  2  Below_College     Life_Sciences   
1233                 1  Below_College     Life_Sciences   
366                 10  Below_College     Life_Sciences   
702                 24  Below_College  Technical_Degree   
...                ...            ...               ...   
733                 29         Doctor     Life_Sciences   
1061                 1         Doctor   Human_Resources   
1307                 2         Doctor     Life_Sciences   
1060                 8         Doctor     Life_Sciences   
177                  9         Doctor             Other   

     EnvironmentSatisfaction  Gender  ...  PerformanceRating  \
1191               Very_High  Female  ...        Outstanding   
407                      Low    Male  ...          Excellent   
1233                     Low  Female  ...          Excellent   
366                Very_High  Female  ...          Excellent   
702                Very_High  Female  ...        Outstanding   
...                      ...     ...  ...                ...   
733                Very_High  Female  ...          Excellent   
1061                     Low    Male  ...          Excellent   
1307                  Medium  Female  ...          Excellent   
1060               Very_High    Male  ...          Excellent   
177                   Medium    Male  ...          Excellent   

     RelationshipSatisfaction  StockOptionLevel TotalWorkingYears  \
1191                Very_High                 0                18   
407                 Very_High                 1                 6   
1233                Very_High                 0                20   
366                    Medium                 2                 6   
702                 Very_High                 0                 9   
...                       ...               ...               ...   
733                 Very_High                 0                10   
1061                     High                 0                14   
1307                   Medium                 1                22   
1060                      Low                 0                14   
177                      High                 0                 4   

     TrainingTimesLastYear WorkLifeBalance  YearsAtCompany  \
1191                     2            Best              14   
407                      0            Best               6   
1233                     2            Good               4   
366                      3            Good               5   
702                      3            Good               5   
...                    ...             ...             ...   
733                      2          Better              10   
1061                     3          Better              10   
1307                     2          Better              18   
1060                     6          Better               0   
177                      2          Better               3   

      YearsInCurrentRole  YearsSinceLastPromotion YearsWithCurrManager  
1191                   7                        8                   10  
407                    4                        1                    3  
1233                   3                        1                    3  
366                    3                        4                    3  
702                    4                        1                    4  
...                  ...                      ...                  ...  
733                    3                        9                    9  
1061                   7                        0                    2  
1307                  16                       11                    8  
1060                   0                        0                    0  
177                    2                        2                    2  

[588 rows x 31 columns]
Education
Bachelor         0.389456
Master           0.270408
College          0.192177
Below_College    0.115646
Doctor           0.032313
Name: proportion, dtype: float64
Note

By grouping then sampling, the size of each group in the sample is representative of the size of the sample in the population.

2.6 Exercise 2.2.2

Equal counts stratified sampling

If one subgroup is larger than another subgroup in the population, but you don’t want to reflect that difference in your analysis, then you can use equal counts stratified sampling to generate samples where each subgroup has the same amount of data. For example, if you are analyzing blood types, O is the most common blood type worldwide, but you may wish to have equal amounts of O, A, B, and AB in your sample.

Instructions

  1. Use equal counts stratified sampling on attrition to get 30 employees from each Education group, setting the seed to 2022.
  2. Get the proportion of employees by Education level from attrition_eq.
Code
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Importing the course arrays
attrition = pd.read_feather("datasets/attrition.feather")

# Proportion of employees by Education level
education_counts_pop = attrition['Education'].value_counts(normalize=True)

# Print education_counts_pop
print(education_counts_pop)

# Get 30 employees from each Education group
attrition_eq = attrition.groupby('Education')\
.sample(n=30, random_state=2022)

# Print the sample
print(attrition_eq)

# Get the proportions from attrition_eq
education_counts_eq = attrition_eq['Education'].value_counts(normalize=True)

# Print the results
print(education_counts_eq)
Education
Bachelor         0.389116
Master           0.270748
College          0.191837
Below_College    0.115646
Doctor           0.032653
Name: proportion, dtype: float64
      Age  Attrition     BusinessTravel  DailyRate            Department  \
1191   53        0.0      Travel_Rarely        238                 Sales   
407    29        0.0  Travel_Frequently        995  Research_Development   
1233   59        0.0  Travel_Frequently       1225                 Sales   
366    37        0.0      Travel_Rarely        571  Research_Development   
702    31        0.0  Travel_Frequently        163  Research_Development   
...   ...        ...                ...        ...                   ...   
774    33        0.0      Travel_Rarely        922  Research_Development   
869    45        0.0      Travel_Rarely       1015  Research_Development   
530    32        0.0      Travel_Rarely        120  Research_Development   
1049   48        0.0      Travel_Rarely        163                 Sales   
350    29        1.0      Travel_Rarely        408  Research_Development   

      DistanceFromHome      Education    EducationField  \
1191                 1  Below_College           Medical   
407                  2  Below_College     Life_Sciences   
1233                 1  Below_College     Life_Sciences   
366                 10  Below_College     Life_Sciences   
702                 24  Below_College  Technical_Degree   
...                ...            ...               ...   
774                  1         Doctor           Medical   
869                  5         Doctor           Medical   
530                  6         Doctor     Life_Sciences   
1049                 2         Doctor         Marketing   
350                 25         Doctor  Technical_Degree   

     EnvironmentSatisfaction  Gender  ...  PerformanceRating  \
1191               Very_High  Female  ...        Outstanding   
407                      Low    Male  ...          Excellent   
1233                     Low  Female  ...          Excellent   
366                Very_High  Female  ...          Excellent   
702                Very_High  Female  ...        Outstanding   
...                      ...     ...  ...                ...   
774                      Low  Female  ...          Excellent   
869                     High  Female  ...          Excellent   
530                     High    Male  ...        Outstanding   
1049                  Medium  Female  ...          Excellent   
350                     High  Female  ...          Excellent   

     RelationshipSatisfaction  StockOptionLevel TotalWorkingYears  \
1191                Very_High                 0                18   
407                 Very_High                 1                 6   
1233                Very_High                 0                20   
366                    Medium                 2                 6   
702                 Very_High                 0                 9   
...                       ...               ...               ...   
774                      High                 1                10   
869                       Low                 0                10   
530                       Low                 0                 8   
1049                      Low                 1                14   
350                    Medium                 0                 6   

     TrainingTimesLastYear WorkLifeBalance  YearsAtCompany  \
1191                     2            Best              14   
407                      0            Best               6   
1233                     2            Good               4   
366                      3            Good               5   
702                      3            Good               5   
...                    ...             ...             ...   
774                      2          Better               6   
869                      3          Better              10   
530                      2          Better               5   
1049                     2          Better               9   
350                      2            Best               2   

      YearsInCurrentRole  YearsSinceLastPromotion YearsWithCurrManager  
1191                   7                        8                   10  
407                    4                        1                    3  
1233                   3                        1                    3  
366                    3                        4                    3  
702                    4                        1                    4  
...                  ...                      ...                  ...  
774                    1                        0                    5  
869                    7                        1                    4  
530                    4                        1                    4  
1049                   7                        6                    7  
350                    2                        1                    1  

[150 rows x 31 columns]
Education
Below_College    0.2
College          0.2
Bachelor         0.2
Master           0.2
Doctor           0.2
Name: proportion, dtype: float64
Note

If you want each subgroup to have equal weight in your analysis, then equal counts stratified sampling is the appropriate technique.

2.7 Exercise 2.2.3

Weighted sampling

Stratified sampling provides rules about the probability of picking rows from your dataset at the subgroup level. A generalization of this is weighted sampling, which lets you specify rules about the probability of picking rows at the row level. The probability of picking any given row is proportional to the weight value for that row.

Instructions

  1. Plot YearsAtCompany from attrition as a histogram with bins of width 1 from 0 to 40.
  2. Sample 400 employees from attrition weighted by YearsAtCompany.
  3. Plot YearsAtCompany from attrition_weight as a histogram with bins of width 1 from 0 to 40.
  4. Which is higher? The mean YearsAtCompany from attrition or the mean YearsAtCompany from attrition_weight? Answer: The weighted sample mean is around 11, which is higher than the population mean of around 7. The fact that the two numbers are different means that the weighted simple random sample is biased.
Code
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Importing the course arrays
attrition = pd.read_feather("datasets/attrition.feather")

# Plot YearsAtCompany from attrition_pop as a histogram
attrition['YearsAtCompany'].hist(bins=np.arange(0,41,1))

# Sample 400 employees weighted by YearsAtCompany
attrition_weight = attrition.sample(n=400, weights='YearsAtCompany')

# Print the sample
print(attrition_weight)

# Plot YearsAtCompany from attrition_weight as a histogram
attrition_weight['YearsAtCompany'].hist(bins=np.arange(0, 41, 1))
plt.show()

# The mean YearsAtCompany from attrition dataset 
print(attrition['YearsAtCompany'].mean())

# The mean YearsAtCompany from attrition_weight 
print(attrition_weight['YearsAtCompany'].mean())
      Age  Attrition     BusinessTravel  DailyRate            Department  \
1415   48        0.0      Travel_Rarely       1224  Research_Development   
1295   41        0.0      Travel_Rarely        582  Research_Development   
1176   36        0.0      Travel_Rarely        363  Research_Development   
1381   45        1.0      Travel_Rarely       1449                 Sales   
1359   43        0.0  Travel_Frequently        394                 Sales   
...   ...        ...                ...        ...                   ...   
1234   40        0.0  Travel_Frequently       1395  Research_Development   
894    43        0.0      Travel_Rarely        244       Human_Resources   
1020   31        0.0      Travel_Rarely        480  Research_Development   
180    42        0.0      Travel_Rarely        544       Human_Resources   
1231   37        0.0      Travel_Rarely       1239       Human_Resources   

      DistanceFromHome      Education    EducationField  \
1415                10       Bachelor     Life_Sciences   
1295                28         Master     Life_Sciences   
1176                 1       Bachelor  Technical_Degree   
1381                 2       Bachelor         Marketing   
1359                26        College     Life_Sciences   
...                ...            ...               ...   
1234                26       Bachelor           Medical   
894                  2       Bachelor     Life_Sciences   
1020                 7        College           Medical   
180                  2  Below_College  Technical_Degree   
1231                 8        College             Other   

     EnvironmentSatisfaction  Gender  ...  PerformanceRating  \
1415               Very_High    Male  ...          Excellent   
1295                     Low  Female  ...        Outstanding   
1176                    High  Female  ...        Outstanding   
1381                     Low  Female  ...          Excellent   
1359                    High    Male  ...          Excellent   
...                      ...     ...  ...                ...   
1234                  Medium  Female  ...          Excellent   
894                   Medium    Male  ...          Excellent   
1020                  Medium  Female  ...          Excellent   
180                     High    Male  ...          Excellent   
1231                    High    Male  ...          Excellent   

     RelationshipSatisfaction  StockOptionLevel TotalWorkingYears  \
1415                Very_High                 0                29   
1295                     High                 1                21   
1176                     High                 1                17   
1381                      Low                 0                26   
1359                Very_High                 2                25   
...                       ...               ...               ...   
1234                      Low                 1                20   
894                    Medium                 0                10   
1020                   Medium                 1                13   
180                      High                 1                 4   
1231                     High                 0                19   

     TrainingTimesLastYear WorkLifeBalance  YearsAtCompany  \
1415                     3          Better              22   
1295                     3          Better              20   
1176                     2          Better               7   
1381                     2          Better              24   
1359                     3            Best              25   
...                    ...             ...             ...   
1234                     2          Better              20   
894                      5          Better               9   
1020                     5             Bad              13   
180                      5          Better               3   
1231                     4            Good              10   

      YearsInCurrentRole  YearsSinceLastPromotion YearsWithCurrManager  
1415                  10                       12                    9  
1295                   7                        0                   10  
1176                   7                        7                    7  
1381                  10                        1                   11  
1359                  12                        4                   12  
...                  ...                      ...                  ...  
1234                   7                        2                   13  
894                    7                        1                    8  
1020                  10                        3                   12  
180                    2                        1                    0  
1231                   0                        4                    7  

[400 rows x 31 columns]

7.0081632653061225
11.22

2.8 Chapter 2.3: Cluster sampling

One problem with stratified sampling is that we need to collect data from every subgroup. In cases where collecting data is expensive, for example, when we have to physically travel to a location to collect it, it can make our analysis prohibitively expensive. There’s a cheaper alternative called cluster sampling.

Stratified sampling vs. cluster sampling

The stratified sampling approach was to split the population into subgroups, then use simple random sampling on each of them. Cluster sampling means that we limit the number of subgroups in the analysis by picking a few of them with simple random sampling. We then perform simple random sampling on each subgroup as before.

Varieties of coffee

Let’s return to the coffee dataset and look at the varieties of coffee. In this image, each bean represents the whole subgroup rather than an individual coffee, and there are twenty-eight of them. To extract unique varieties, we use the .unique method. This returns an array, so wrapping it in the list function creates a list of unique varieties. Let’s suppose that it’s expensive to work with all of the different varieties. Enter cluster sampling.

Stage 1: sampling for subgroups

The first stage of cluster sampling is to randomly cut down the number of varieties, and we do this by randomly selecting them. Here, we’ve used the random.sample function from the random package to get three varieties, specified using the argument k.

Stage 2: sampling each group

The second stage of cluster sampling is to perform simple random sampling on each of the three varieties we randomly selected. We first filter the dataset for rows where the variety is one of the three selected, using the .isin method. To ensure that the isin filtering removes levels with zero rows, we apply the cat.remove_unused_categories method on the Series of focus, which is variety here. If we exclude this method, we might receive an error when sampling by variety level. The pandas code is the same as for stratified sampling. Here, we’ve opted for equal counts sampling, with five rows from each remaining variety.

Stage 2 output

Here’s the first few columns of the result. Notice that there are the fifteen rows, which we’d expect from sampling five rows from three varieties.

Multistage sampling

Note that we had two stages in the cluster sampling. We randomly sampled the subgroups to include, then we randomly sampled rows from those subgroups. Cluster sampling is a special case of multistage sampling. It’s possible to use more than two stages. A common example is national surveys, which can include several levels of administrative regions, like states, counties, cities, and neighborhoods.

2.9 Exercise 2.3.1

Performing cluster sampling

Now that you know when to use cluster sampling, it’s time to put it into action. In this exercise, you’ll explore the JobRole column of the attrition dataset. You can think of each job role as a subgroup of the whole population of employees.

Use a seed of 19790801 to set the seed with random.seed().

Instructions

  • Create a list of unique JobRole values from attrition, and assign to job_roles_pop.
  • Randomly sample four JobRole values from job_roles_pop.
  1. Subset attrition_pop for the sampled job roles by filtering for rows where JobRole is in job_roles_samp.
  • Remove any unused categories from JobRole.
  • For each job role in the filtered dataset, take a random sample of ten rows, setting the seed to 2022.
Code
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

# Importing the course arrays
attrition = pd.read_feather("datasets/attrition.feather")

# Set the seed
random.seed(19790801)

# Create a list of unique JobRole values
job_roles_pop = list(attrition['JobRole'].unique())

# Randomly sample four JobRole values
job_roles_samp = random.sample(job_roles_pop, k=4)

# Print the result
print(job_roles_samp)

# Filter for rows where JobRole is in job_roles_samp
jobrole_condition = attrition['JobRole'].isin(job_roles_samp)
attrition_filtered = attrition[jobrole_condition]

# Print the result
print(attrition_filtered)

# Remove categories with no rows
attrition_filtered['JobRole'] = attrition_filtered['JobRole'].cat.remove_unused_categories()

# Randomly sample 10 employees from each sampled job role
attrition_clust = attrition_filtered.groupby('JobRole')\
.sample(n=10, random_state=2022)


# Print the sample
print(attrition_clust)
['Research_Director', 'Research_Scientist', 'Human_Resources', 'Manager']
      Age  Attrition BusinessTravel  DailyRate            Department  \
0      21        0.0  Travel_Rarely        391  Research_Development   
5      27        0.0     Non-Travel        443  Research_Development   
6      18        0.0     Non-Travel        287  Research_Development   
10     18        0.0     Non-Travel       1431  Research_Development   
17     31        0.0  Travel_Rarely       1082  Research_Development   
...   ...        ...            ...        ...                   ...   
1462   54        0.0  Travel_Rarely        584  Research_Development   
1464   55        0.0  Travel_Rarely        452  Research_Development   
1465   55        0.0  Travel_Rarely       1117                 Sales   
1466   58        0.0     Non-Travel        350                 Sales   
1469   58        1.0  Travel_Rarely        286  Research_Development   

      DistanceFromHome Education EducationField EnvironmentSatisfaction  \
0                   15   College  Life_Sciences                    High   
5                    3  Bachelor        Medical               Very_High   
6                    5   College  Life_Sciences                  Medium   
10                  14  Bachelor        Medical                  Medium   
17                   1    Master        Medical                    High   
...                ...       ...            ...                     ...   
1462                22    Doctor        Medical                  Medium   
1464                 1  Bachelor        Medical               Very_High   
1465                18    Doctor  Life_Sciences                     Low   
1466                 2  Bachelor        Medical                  Medium   
1469                 2    Master  Life_Sciences               Very_High   

      Gender  ...  PerformanceRating RelationshipSatisfaction  \
0       Male  ...          Excellent                Very_High   
5       Male  ...          Excellent                     High   
6       Male  ...          Excellent                Very_High   
10    Female  ...          Excellent                     High   
17      Male  ...          Excellent                   Medium   
...      ...  ...                ...                      ...   
1462  Female  ...        Outstanding                     High   
1464    Male  ...          Excellent                     High   
1465  Female  ...        Outstanding                Very_High   
1466    Male  ...        Outstanding                Very_High   
1469    Male  ...          Excellent                Very_High   

      StockOptionLevel TotalWorkingYears TrainingTimesLastYear  \
0                    0                 0                     6   
5                    3                 0                     6   
6                    0                 0                     2   
10                   0                 0                     4   
17                   0                 1                     4   
...                ...               ...                   ...   
1462                 1                36                     6   
1464                 0                37                     2   
1465                 0                37                     2   
1466                 1                37                     0   
1469                 0                40                     2   

     WorkLifeBalance  YearsAtCompany  YearsInCurrentRole  \
0             Better               0                   0   
5               Good               0                   0   
6             Better               0                   0   
10               Bad               0                   0   
17            Better               1                   1   
...              ...             ...                 ...   
1462          Better              10                   8   
1464          Better              36                  10   
1465          Better              10                   9   
1466            Good              16                   9   
1469          Better              31                  15   

      YearsSinceLastPromotion YearsWithCurrManager  
0                           0                    0  
5                           0                    0  
6                           0                    0  
10                          0                    0  
17                          1                    0  
...                       ...                  ...  
1462                        4                    7  
1464                        4                   13  
1465                        7                    7  
1466                       14                   14  
1469                       13                    8  

[526 rows x 31 columns]
      Age  Attrition     BusinessTravel  DailyRate            Department  \
1348   44        1.0      Travel_Rarely       1376       Human_Resources   
886    41        0.0         Non-Travel        552       Human_Resources   
983    39        0.0      Travel_Rarely        141       Human_Resources   
88     27        1.0  Travel_Frequently       1337       Human_Resources   
189    34        0.0      Travel_Rarely        829       Human_Resources   
160    24        0.0  Travel_Frequently        897       Human_Resources   
839    46        0.0      Travel_Rarely        991       Human_Resources   
966    30        0.0      Travel_Rarely       1240       Human_Resources   
162    28        0.0         Non-Travel        280       Human_Resources   
1231   37        0.0      Travel_Rarely       1239       Human_Resources   
1375   44        0.0      Travel_Rarely       1315  Research_Development   
1462   54        0.0      Travel_Rarely        584  Research_Development   
1316   45        0.0  Travel_Frequently        364  Research_Development   
1356   48        0.0  Travel_Frequently        117  Research_Development   
1387   48        0.0         Non-Travel       1262  Research_Development   
1321   54        0.0         Non-Travel        142       Human_Resources   
1266   50        0.0      Travel_Rarely       1452  Research_Development   
1330   46        0.0      Travel_Rarely        406                 Sales   
1052   59        0.0      Travel_Rarely       1089                 Sales   
1449   52        0.0      Travel_Rarely        699  Research_Development   
1439   58        0.0      Travel_Rarely       1055  Research_Development   
1339   58        0.0  Travel_Frequently       1216  Research_Development   
1426   49        0.0      Travel_Rarely       1245  Research_Development   
1415   48        0.0      Travel_Rarely       1224  Research_Development   
1322   51        0.0      Travel_Rarely        684  Research_Development   
1284   40        0.0      Travel_Rarely       1308  Research_Development   
1149   37        0.0      Travel_Rarely        161  Research_Development   
1126   42        0.0      Travel_Rarely        810  Research_Development   
1374   46        0.0      Travel_Rarely       1009  Research_Development   
1050   33        0.0      Travel_Rarely        213  Research_Development   
86     26        0.0      Travel_Rarely        482  Research_Development   
930    52        1.0      Travel_Rarely        723  Research_Development   
860    37        0.0      Travel_Rarely        674  Research_Development   
36     20        1.0      Travel_Rarely       1362  Research_Development   
997    32        0.0      Travel_Rarely        824  Research_Development   
1358   45        0.0      Travel_Rarely       1339  Research_Development   
993    41        0.0  Travel_Frequently       1200  Research_Development   
421    34        0.0      Travel_Rarely        181  Research_Development   
789    28        1.0      Travel_Rarely        654  Research_Development   
94     36        1.0      Travel_Rarely        318  Research_Development   

      DistanceFromHome      Education    EducationField  \
1348                 1        College           Medical   
886                  4       Bachelor   Human_Resources   
983                  3       Bachelor   Human_Resources   
88                  22       Bachelor   Human_Resources   
189                  3        College   Human_Resources   
160                 10       Bachelor           Medical   
839                  1        College     Life_Sciences   
966                  9       Bachelor   Human_Resources   
162                  1        College     Life_Sciences   
1231                 8        College             Other   
1375                 3         Master             Other   
1462                22         Doctor           Medical   
1316                25       Bachelor           Medical   
1356                22       Bachelor           Medical   
1387                 1         Master           Medical   
1321                26       Bachelor   Human_Resources   
1266                11       Bachelor     Life_Sciences   
1330                 3  Below_College         Marketing   
1052                 1        College  Technical_Degree   
1449                 1         Master     Life_Sciences   
1439                 1       Bachelor           Medical   
1339                15         Master     Life_Sciences   
1426                18         Master     Life_Sciences   
1415                10       Bachelor     Life_Sciences   
1322                 6       Bachelor     Life_Sciences   
1284                14       Bachelor           Medical   
1149                10       Bachelor     Life_Sciences   
1126                23         Doctor     Life_Sciences   
1374                 2       Bachelor     Life_Sciences   
1050                 7       Bachelor           Medical   
86                   1        College     Life_Sciences   
930                  8         Master           Medical   
860                 13       Bachelor           Medical   
36                  10  Below_College           Medical   
997                  5        College     Life_Sciences   
1358                 7       Bachelor     Life_Sciences   
993                 22       Bachelor     Life_Sciences   
421                  2         Master           Medical   
789                  1        College     Life_Sciences   
94                   9       Bachelor           Medical   

     EnvironmentSatisfaction  Gender  ...  PerformanceRating  \
1348                  Medium    Male  ...          Excellent   
886                     High    Male  ...          Excellent   
983                     High  Female  ...          Excellent   
88                       Low  Female  ...          Excellent   
189                     High    Male  ...          Excellent   
160                      Low    Male  ...          Excellent   
839                Very_High  Female  ...          Excellent   
966                     High    Male  ...          Excellent   
162                     High    Male  ...          Excellent   
1231                    High    Male  ...          Excellent   
1375               Very_High    Male  ...          Excellent   
1462                  Medium  Female  ...        Outstanding   
1316                  Medium  Female  ...        Outstanding   
1356               Very_High  Female  ...          Excellent   
1387                     Low    Male  ...        Outstanding   
1321               Very_High  Female  ...          Excellent   
1266                    High  Female  ...          Excellent   
1330                     Low    Male  ...          Excellent   
1052                  Medium    Male  ...          Excellent   
1449                    High    Male  ...          Excellent   
1439               Very_High  Female  ...        Outstanding   
1339                     Low    Male  ...          Excellent   
1426               Very_High    Male  ...          Excellent   
1415               Very_High    Male  ...          Excellent   
1322                     Low    Male  ...          Excellent   
1284                    High    Male  ...          Excellent   
1149                    High  Female  ...        Outstanding   
1126                     Low  Female  ...          Excellent   
1374                     Low    Male  ...          Excellent   
1050                    High    Male  ...          Excellent   
86                    Medium  Female  ...          Excellent   
930                     High    Male  ...          Excellent   
860                      Low    Male  ...          Excellent   
36                 Very_High    Male  ...          Excellent   
997                Very_High  Female  ...          Excellent   
1358                  Medium    Male  ...          Excellent   
993                Very_High  Female  ...          Excellent   
421                Very_High    Male  ...          Excellent   
789                      Low  Female  ...          Excellent   
94                 Very_High    Male  ...          Excellent   

     RelationshipSatisfaction  StockOptionLevel TotalWorkingYears  \
1348                Very_High                 1                24   
886                    Medium                 1                10   
983                      High                 1                12   
88                        Low                 0                 1   
189                      High                 1                 4   
160                 Very_High                 1                 3   
839                      High                 0                10   
966                 Very_High                 0                12   
162                    Medium                 1                 3   
1231                     High                 0                19   
1375                      Low                 1                26   
1462                     High                 1                36   
1316                     High                 0                22   
1356                   Medium                 1                24   
1387                     High                 0                27   
1321                     High                 0                23   
1266                   Medium                 0                21   
1330                Very_High                 1                23   
1052                     High                 1                14   
1449                      Low                 1                34   
1439                     High                 1                32   
1339                   Medium                 0                23   
1426                     High                 1                31   
1415                Very_High                 0                29   
1322                     High                 0                23   
1284                      Low                 0                21   
1149                      Low                 1                16   
1126                   Medium                 0                16   
1374                     High                 0                26   
1050                Very_High                 0                14   
86                       High                 1                 1   
930                       Low                 0                11   
860                       Low                 0                10   
36                  Very_High                 0                 1   
997                       Low                 1                12   
1358                     High                 1                25   
993                       Low                 2                12   
421                       Low                 3                 6   
789                 Very_High                 0                10   
94                        Low                 1                 2   

     TrainingTimesLastYear WorkLifeBalance  YearsAtCompany  \
1348                     1          Better              20   
886                      4          Better               3   
983                      3             Bad               8   
88                       2          Better               1   
189                      1             Bad               3   
160                      2          Better               2   
839                      3            Best               7   
966                      2             Bad              11   
162                      2          Better               3   
1231                     4            Good              10   
1375                     2            Best               2   
1462                     6          Better              10   
1316                     4          Better               0   
1356                     3          Better              22   
1387                     3            Good               5   
1321                     3          Better               5   
1266                     5          Better               5   
1330                     3          Better              12   
1052                     1             Bad               6   
1449                     5          Better              33   
1439                     3          Better               9   
1339                     3          Better               2   
1426                     5          Better              31   
1415                     3          Better              22   
1322                     5          Better              20   
1284                     2            Best              20   
1149                     2          Better              16   
1126                     2          Better               1   
1374                     2             Bad               3   
1050                     3            Best              13   
86                       3            Good               1   
930                      3            Good               8   
860                      2          Better              10   
36                       5          Better               1   
997                      2          Better               7   
1358                     2          Better               1   
993                      4            Good               6   
421                      3          Better               5   
789                      4          Better               7   
94                       0            Good               1   

      YearsInCurrentRole  YearsSinceLastPromotion YearsWithCurrManager  
1348                   6                        3                    6  
886                    2                        1                    2  
983                    3                        3                    6  
88                     0                        0                    0  
189                    2                        0                    2  
160                    2                        2                    1  
839                    6                        5                    7  
966                    9                        4                    7  
162                    2                        2                    2  
1231                   0                        4                    7  
1375                   2                        0                    1  
1462                   8                        4                    7  
1316                   0                        0                    0  
1356                  17                        4                    7  
1387                   4                        2                    1  
1321                   3                        4                    4  
1266                   4                        4                    4  
1330                   9                        4                    9  
1052                   4                        0                    4  
1449                  18                       11                    9  
1439                   8                        1                    5  
1339                   2                        2                    2  
1426                   9                        0                    9  
1415                  10                       12                    9  
1322                  18                       15                   15  
1284                   7                        4                    9  
1149                  11                        6                    8  
1126                   0                        0                    0  
1374                   2                        0                    1  
1050                   9                        3                    7  
86                     0                        1                    0  
930                    2                        7                    7  
860                    8                        3                    7  
36                     0                        1                    1  
997                    1                        2                    5  
1358                   0                        0                    0  
993                    2                        3                    3  
421                    0                        1                    2  
789                    7                        3                    7  
94                     0                        0                    0  

[40 rows x 31 columns]

2.10 Chapter 2.4: Comparing sampling methods

Let’s review the various sampling techniques we learned about.

Review of sampling techniques - setup

For convenience, we’ll stick to the six countries with the most coffee varieties that we used before. This corresponds to eight hundred and eighty rows and eight columns, retrieved using the .shape attribute.

Review of simple random sampling

Simple random sampling uses .sample with either n or frac set to determine how many rows to pseudo-randomly choose, given a seed value set with random_state. The simple random sample returns two hundred and ninety-three rows because we specified frac as one-third, and one-third of eight hundred and eighty is just over two hundred and ninety-three.

Review of stratified sampling

Stratified sampling groups by the country subgroup before performing simple random sampling on each subgroup. Given each of these top countries have quite a few rows, stratifying produces the same number of rows as the simple random sample.

Review of cluster sampling

In the cluster sample, we’ve used two out of six countries to roughly mimic frac equals one-third from the other sample types. Setting n equal to one-sixth of the total number of rows gives roughly equal sample sizes in each of the two subgroups. Using .shape again, we see that this cluster sample has close to the same number of rows: two-hundred-ninety-two compared to two-hundred-ninety-three for the other sample types.

Calculating mean cup points

Let’s calculate a population parameter, the mean of the total cup points. When the population parameter is the mean of a field, it’s often called the population mean. Remember that in real-life scenarios, we typically wouldn’t know what the population mean is. Since we have it here, though, we can use this value of eighty-one-point-nine as a gold standard to measure against. Now we’ll calculate the same value using each of the sampling techniques we’ve discussed. These are point estimates of the mean, often called sample means. The simple and stratified sample means are really close to the population mean. Cluster sampling isn’t quite as close, but that’s typical. Cluster sampling is designed to give us an answer that’s almost as good while using less data.

Mean cup points by country: simple random

Here’s a slightly more complicated calculation of the mean total cup points for each country. We group by country before calculating the mean to return six numbers. So how do the numbers from the simple random sample compare? The sample means are pretty close to the population means.

Mean cup points by country: stratified

The same is true of the sample means from the stratified technique. Each sample mean is relatively close to the population mean.

Mean cup points by country: cluster

With cluster sampling, while the sample means are pretty close to the population means, the obvious limitation is that we only get values for the two countries that were included in the sample. If the mean cup points for each country is an important metric in our analysis, cluster sampling would be a bad idea.

2.11 Exercise 2.4.1

3 kinds of sampling

You’re going to compare the performance of point estimates using simple, stratified, and cluster sampling. Before doing that, you’ll have to set up the samples.

You’ll use the RelationshipSatisfaction column of the attrition dataset, which categorizes the employee’s relationship with the company. It has four levels: Low, Medium, High, and Very_High.

Instructions

  1. Perform simple random sampling on attrition to get one-quarter of the population, setting the seed to 2022.
  2. Perform stratified sampling on attrition to sample one-quarter of each RelationshipSatisfaction group, setting the seed to 2022.
  3. Create a list of unique values from attrition’s RelationshipSatisfaction column. Randomly sample satisfaction_unique to get two values. Subset the population for rows where RelationshipSatisfaction is in satisfaction_samp and clear any unused categories from RelationshipSatisfaction; assign to attrition_clust_prep. Perform cluster sampling on the selected satisfaction groups, sampling one quarter of the population and setting the seed to 2022.
Code
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

# Importing the course arrays
attrition = pd.read_feather("datasets/attrition.feather")

# Perform simple random sampling to get 0.25 of the population
attrition_srs = attrition.sample(frac=1/4, random_state=2022)

# Perform stratified sampling to get 0.25 of each relationship group
attrition_strat = attrition.groupby('RelationshipSatisfaction')\
.sample(frac=1/4, random_state=2022)

# Create a list of unique RelationshipSatisfaction values
satisfaction_unique = list(attrition['RelationshipSatisfaction'].unique())

# Randomly sample 2 unique satisfaction values
satisfaction_samp = random.sample(satisfaction_unique, k=2)

# Filter for satisfaction_samp and clear unused categories from RelationshipSatisfaction
satis_condition = attrition['RelationshipSatisfaction'].isin(satisfaction_samp)
attrition_clust_prep = attrition[satis_condition]
attrition_clust_prep['RelationshipSatisfaction'] = attrition_clust_prep['RelationshipSatisfaction'].cat.remove_unused_categories()

# Perform cluster sampling on the selected group, getting 0.25 of attrition_clust_prep
attrition_clust = attrition_clust_prep.groupby("RelationshipSatisfaction")\
.sample(n=len(attrition) // 6, random_state=2022)

print(attrition_clust)
      Age  Attrition     BusinessTravel  DailyRate            Department  \
1381   45        1.0      Travel_Rarely       1449                 Sales   
1357   42        0.0      Travel_Rarely        300  Research_Development   
924    30        0.0      Travel_Rarely        288  Research_Development   
1224   46        0.0      Travel_Rarely       1003  Research_Development   
1277   48        0.0      Travel_Rarely       1236  Research_Development   
...   ...        ...                ...        ...                   ...   
357    27        0.0      Travel_Rarely        798  Research_Development   
424    44        1.0  Travel_Frequently        429  Research_Development   
1182   36        0.0  Travel_Frequently        884  Research_Development   
1055   34        0.0  Travel_Frequently        669  Research_Development   
962    34        0.0      Travel_Rarely       1031  Research_Development   

      DistanceFromHome Education EducationField EnvironmentSatisfaction  \
1381                 2  Bachelor      Marketing                     Low   
1357                 2  Bachelor  Life_Sciences                     Low   
924                  2  Bachelor  Life_Sciences                    High   
1224                 8    Master  Life_Sciences               Very_High   
1277                 1    Master  Life_Sciences               Very_High   
...                ...       ...            ...                     ...   
357                  6    Master        Medical                     Low   
424                  1   College        Medical                    High   
1182                23   College        Medical                    High   
1055                 1  Bachelor        Medical               Very_High   
962                  6    Master  Life_Sciences                    High   

      Gender  ...  PerformanceRating RelationshipSatisfaction  \
1381  Female  ...          Excellent                      Low   
1357    Male  ...          Excellent                      Low   
924     Male  ...          Excellent                      Low   
1224  Female  ...        Outstanding                      Low   
1277  Female  ...          Excellent                      Low   
...      ...  ...                ...                      ...   
357   Female  ...          Excellent                     High   
424     Male  ...          Excellent                     High   
1182    Male  ...          Excellent                     High   
1055    Male  ...        Outstanding                     High   
962   Female  ...          Excellent                     High   

      StockOptionLevel TotalWorkingYears TrainingTimesLastYear  \
1381                 0                26                     2   
1357                 0                24                     2   
924                  3                11                     3   
1224                 3                19                     2   
1277                 1                21                     3   
...                ...               ...                   ...   
357                  2                 6                     5   
424                  3                 6                     2   
1182                 1                17                     3   
1055                 0                14                     3   
962                  1                12                     3   

     WorkLifeBalance  YearsAtCompany  YearsInCurrentRole  \
1381          Better              24                  10   
1357            Good              22                   6   
924           Better              11                  10   
1224          Better              16                  13   
1277             Bad               3                   2   
...              ...             ...                 ...   
357             Good               5                   3   
424             Good               5                   3   
1182          Better               5                   2   
1055          Better              13                   9   
962           Better               1                   0   

      YearsSinceLastPromotion YearsWithCurrManager  
1381                        1                   11  
1357                        4                   14  
924                        10                    8  
1224                        1                    7  
1277                        0                    2  
...                       ...                  ...  
357                         0                    3  
424                         2                    3  
1182                        0                    3  
1055                        4                    9  
962                         0                    0  

[490 rows x 31 columns]

2.12 Exercise 2.4.4

Comparing point estimates

Now that you have three types of sample (simple, stratified, and cluster), you can compare point estimates from each sample to the population parameter. That is, you can calculate the same summary statistic on each sample and see how it compares to the summary statistic for the population.

Here, we’ll look at how satisfaction with the company affects whether or not the employee leaves the company. That is, you’ll calculate the proportion of employees who left the company (they have an Attrition value of 1) for each value of RelationshipSatisfaction.

Instructions

  1. Group attrition by RelationshipSatisfaction levels and calculate the mean of Attrition for each level.
  2. Calculate the proportion of employee attrition for each relationship satisfaction group, this time on the simple random sample, attrition_srs.
  3. Calculate the proportion of employee attrition for each relationship satisfaction group, this time on the stratified sample, attrition_strat.
  4. Calculate the proportion of employee attrition for each relationship satisfaction group, this time on the cluster sample, attrition_clust.
Code
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

# Importing the course arrays
attrition = pd.read_feather("datasets/attrition.feather")

# Perform simple random sampling to get 0.25 of the population
attrition_srs = attrition.sample(frac=1/4, random_state=2022)

# Perform stratified sampling to get 0.25 of each relationship group
attrition_strat = attrition.groupby('RelationshipSatisfaction')\
.sample(frac=1/4, random_state=2022)

# Mean Attrition by RelationshipSatisfaction group
mean_attrition_pop = attrition.groupby('RelationshipSatisfaction')\
['Attrition'].mean()

# Print the result
print(mean_attrition_pop)

# Calculate the same thing for the simple random sample 
mean_attrition_srs = attrition_srs.groupby('RelationshipSatisfaction')\
['Attrition'].mean()


# Print the result
print(mean_attrition_srs)

# Calculate the same thing for the stratified sample 
mean_attrition_strat = attrition_strat.groupby('RelationshipSatisfaction')\
['Attrition'].mean()


# Print the result
print(mean_attrition_strat)

# Calculate the same thing for the cluster sample 
mean_attrition_clust = attrition_clust.groupby('RelationshipSatisfaction')\
['Attrition'].mean()

# Print the result
print(mean_attrition_clust)
RelationshipSatisfaction
Low          0.206522
Medium       0.148515
High         0.154684
Very_High    0.148148
Name: Attrition, dtype: float64

RelationshipSatisfaction
Low          0.134328
Medium       0.164179
High         0.160000
Very_High    0.155963
Name: Attrition, dtype: float64
RelationshipSatisfaction
Low          0.144928
Medium       0.078947
High         0.165217
Very_High    0.129630
Name: Attrition, dtype: float64
RelationshipSatisfaction
Low     0.191837
High    0.134694
Name: Attrition, dtype: float64

3 CHAPTER 3: Sampling Distributions

Let’s test your sampling. In this chapter, you’ll discover how to quantify the accuracy of sample statistics using relative errors, and measure variation in your estimates by generating sampling distributions.

3.1 Chapter 3.1: Relative error of point estimates

Let’s see how the size of the sample affects the accuracy of the point estimates we calculate.

Sample size is number of rows

The sample size, calculated here with the len function, is the number of observations, that is, the number of rows in the sample. That’s true whichever method we use to create the sample. We’ll stick to looking at simple random sampling since it works well in most cases and it’s easier to reason about.

Various sample sizes

Let’s calculate a population parameter, the mean cup points of the coffees. It’s around eighty-two-point-one-five. This is our gold standard to compare against. If we take a sample size of ten, the point estimate of this parameter is wrong by about point-eight-eight. Increasing the sample size to one hundred gets us closer; the estimate is only wrong by about point-three-four. Increasing the sample size further to one thousand brings the estimate to about point-zero-three away from the population parameter. In general, larger sample sizes will give us more accurate results.

Relative errors

For any of these sample sizes, we want to compare the population mean to the sample mean. This is the same code we just saw, but with the numerical sample size replaced with a variable named sample_size. The most common metric for assessing the difference between the population and a sample mean is the relative error. The relative error is the absolute difference between the two numbers; that is, we ignore any minus signs, divided by the population mean. Here, we also multiply by one hundred to make it a percentage.

Relative error vs. sample size

Here’s a line plot of relative error versus sample size. We see that the relative error decreases as the sample size increases, and beyond that, the plot has other important properties. Firstly, the blue line is really noisy, particularly for small sample sizes. If our sample size is small, the sample mean we calculate can be wildly different by adding one or two more random rows to the sample. Secondly, the amplitude of the line is quite steep, to begin with. When we have a small sample size, adding just a few more samples can give us much better accuracy. Further to the right of the plot, the line is less steep. If we already have a large sample size, adding a few more rows to the sample doesn’t bring as much benefit. Finally, at the far right of the plot, where the sample size is the whole population, the relative error decreases to zero.

3.2 Exercise 3.1.1

Calculating relative errors

The size of the sample you take affects how accurately the point estimates reflect the corresponding population parameter. For example, when you calculate a sample mean, you want it to be close to the population mean. However, if your sample is too small, this might not be the case.

The most common metric for assessing accuracy is relative error. This is the absolute difference between the population parameter and the point estimate, all divided by the population parameter. It is sometimes expressed as a percentage.

Instructions

  1. Generate a simple random sample from attrition_pop of fifty rows, setting the seed to 2022.
  • Calculate the mean employee Attrition in the sample.
  • Calculate the relative error between mean_attrition_srs50 and mean_attrition_pop as a percentage.
  1. Calculate the relative error percentage again. This time, use a simple random sample of one hundred rows of attrition.
Code
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

# Importing the course arrays
attrition = pd.read_feather("datasets/attrition.feather")

# Population Attrtion mean 
mean_attrition_pop = attrition['Attrition'].mean()

# Print the result
print(mean_attrition_pop)

# Generate a simple random sample of 50 rows, with seed 2022
attrition_srs50 = attrition.sample(n=50, random_state = 2022)

# Calculate the mean employee attrition in the sample
mean_attrition_srs50 = attrition_srs50['Attrition'].mean()

# Calculate the relative error percentage
rel_error_pct50 = 100 * abs(mean_attrition_pop - mean_attrition_srs50)/mean_attrition_pop

# Print rel_error_pct50
print(rel_error_pct50)

# Generate a simple random sample of 100 rows, with seed 2022
attrition_srs100 = attrition.sample(n=100, random_state = 2022)

# Calculate the mean employee attrition in the sample
mean_attrition_srs100 = attrition_srs100['Attrition'].mean()

# Calculate the relative error percentage
rel_error_pct100 = 100 * abs(mean_attrition_pop - mean_attrition_srs100)/mean_attrition_pop

# Print rel_error_pct100
print(rel_error_pct100)
0.16122448979591836
62.78481012658227
6.962025316455695

3.3 Chapter 3.2: Creating a sampling distribution

We just saw how point estimates like the sample mean will vary depending on which rows end up in the sample.

Same code, different answer

For example, this same code to calculate the mean cup points from a simple random sample of thirty coffees gives a slightly different answer each time. Let’s try to visualize and quantify this variation.

Same code, 1000 times

A for loop lets us run the same code many times. It’s especially useful for situations like this where the result contains some randomness. We start by creating an empty list to store the means. Then, we set up the for loop to repeatedly sample 30 coffees from coffee_ratings a total of 1000 times, calculating the mean cup points each time. After each calculation, we append the result, also called a replicate, to the list. Each time the code is run, we get one sample mean, so running the code a thousand times generates a list of one thousand sample means.

Distribution of sample means for size 30

The one thousand sample means form a distribution of sample means. To visualize a distribution, the best plot is often a histogram. Here we can see that most of the results lie between eighty-one and eighty-three, and they roughly follow a bell-shaped curve, like a normal distribution. There’s an important piece of jargon we need to know here. A distribution of replicates of sample means, or other point estimates, is known as a sampling distribution.

Different sample sizes

Here are histograms from running the same code again with different sample sizes. When we decrease the original sample size of thirty to six, we can see from the x-values that the range of the results is broader. The bulk of the results now lie between eighty and eighty-four. On the other hand, increasing the sample size to one hundred and fifty results in a much narrower range. Now most of the results are between eighty-one-point-eight and eighty-two-point-six. As we saw previously, bigger sample sizes give us more accurate results. By replicating the sampling many times, as we’ve done here, we can quantify that accuracy.

3.4 Exercise 3.2.1

Replicating samples

When you calculate a point estimate such as a sample mean, the value you calculate depends on the rows that were included in the sample. That means that there is some randomness in the answer. In order to quantify the variation caused by this randomness, you can create many samples and calculate the sample mean (or another statistic) for each sample.

Instructions

  1. Replicate the provided code so that it runs 500 times. Assign the resulting list of sample means to mean_attritions.
  2. Draw a histogram of the mean_attritions list with 16 bins.
Code
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

# Importing the course arrays
attrition = pd.read_feather("datasets/attrition.feather")

# Create an empty list
mean_attritions = []
# Loop 500 times to create 500 sample means
for i in range(500):
    mean_attritions.append(
        attrition.sample(n=60)['Attrition'].mean()
    )
  
# Print out the first few entries of the list
print(mean_attritions[0:5])

# Create a histogram of the 500 sample means
plt.hist(mean_attritions, bins=16)
plt.show()
[0.16666666666666666, 0.18333333333333332, 0.1, 0.1, 0.1]

3.5 Chapter 3.3: Approximate sampling distributions

In the last exercise, we saw that while increasing the number of replicates didn’t affect the relative error of the sample means; it did result in a more consistent shape to the distribution.

4 dice

Let’s consider the case of four six-sided dice rolls. We can generate all possible combinations of rolls using the expand_grid function, which is defined in the pandas documentation, and uses the itertools package. There are six to the power four, or one-thousand-two-hundred-ninety-six possible dice roll combinations.

Mean roll

Let’s consider the mean of the four rolls by adding a column to our DataFrame called mean_roll. mean_roll ranges from 1, when four ones are rolled, to 6, when four sixes are rolled.

Exact sampling distribution

Since the mean roll takes discrete values instead of continuous values, the best way to see the distribution of mean_roll is to draw a bar plot. First, we convert mean_roll to a categorical by setting its type to category. We are interested in the counts of each value, so we use dot-value_counts, passing the sort equals False argument. This ensures the x-axis ranges from one to six instead of sorting the bars by frequency. Chaining .plot to value_counts, and setting kind to "bar", produces a bar plot of the mean roll distribution. This is the exact sampling distribution of the mean roll because it contains every single combination of die rolls.

The number of outcomes increases fast

If we increase the number of dice in our scenario, the number of possible outcomes increases by a factor of six each time. These values can be shown by creating a DataFrame with two columns: n_dice, ranging from 1 to 100, and n_outcomes, which is the number of possible outcomes, calculated using six to the power of the number of dice. With just one hundred dice, the number of outcomes is about the same as the number of atoms in the universe: six-point-five times ten to the seventy-seventh power. Long before you start dealing with big datasets, it becomes computationally impossible to calculate the exact sampling distribution. That means we need to rely on approximations.

Simulating the mean of four dice rolls

We can generate a sample mean of four dice rolls using NumPy’s random.choice method, specifying size as four. This will randomly choose values from a specified list, in this case, four values from the numbers one to six, which is created using a range from one to seven wrapped in the list function. Notice that we set replace equals True because we can roll the same number several times.

Simulating the mean of four dice rolls

Then we use a for loop to generate lots of sample means, in this case, one thousand. We again use the .append method to populate the sample means list with our simulated sample means. The output contains a sampling of many of the same values we saw with the exact sampling distribution.

Approximate sampling distribution

Here’s a histogram of the approximate sampling distribution of mean rolls. This time, it uses the simulated rather than the exact values. It’s known as an approximate sampling distribution. Notice that although it isn’t perfect, it’s pretty close to the exact sampling distribution. Usually, we don’t have access to the whole population, so we can’t calculate the exact sampling distribution. However, we can feel relatively confident that using an approximation will provide a good guess as to how the sampling distribution will behave.

3.6 Exercise 3.3.1

Exact sampling distribution

To quantify how the point estimate (sample statistic) you are interested in varies, you need to know all the possible values it can take and how often. That is, you need to know its distribution.

The distribution of a sample statistic is called the sampling distribution. When we can calculate this exactly, rather than using an approximation, it is known as the exact sampling distribution.

Let’s take another look at the sampling distribution of dice rolls. This time, we’ll look at five eight-sided dice. (These have the numbers one to eight.)

Instructions

  1. Expand a grid representing 5 8-sided dice. That is, create a DataFrame with five columns from a dictionary, named die1 to die5. The rows should contain all possibilities for throwing five dice, each numbered 1 to 8.
  2. Add a column, mean_roll, to dice, that contains the mean of the five rolls as a categorical.
  3. Create a bar plot of the mean_roll categorical column, so it displays the count of each mean_roll in increasing order from 1.0 to 8.0.
Code
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

# Function to create a grid of all possible combinations
def expand_grid(dictionary):
    from itertools import product
    return pd.DataFrame([row for row in product(*dictionary.values())], columns=dictionary.keys())

# Expand a grid representing 5 8-sided dice
dice = expand_grid(
    {'die1': range(1, 9),
     'die2': range(1, 9),
     'die3': range(1, 9),
     'die4': range(1, 9),
     'die5': range(1, 9)}
)

# Print the result
print(dice)

# Add a column of mean rolls and convert to a categorical
dice['mean_roll'] = (dice['die1']+ dice['die2']+ dice['die3']+ dice['die4']+ dice['die5'])/5
                     
                    
dice['mean_roll'] = dice['mean_roll'].astype('category')

# Print result
print(dice)

# Draw a bar plot of mean_roll
dice['mean_roll'].value_counts(sort=False).plot(kind='bar')
plt.show()
       die1  die2  die3  die4  die5
0         1     1     1     1     1
1         1     1     1     1     2
2         1     1     1     1     3
3         1     1     1     1     4
4         1     1     1     1     5
...     ...   ...   ...   ...   ...
32763     8     8     8     8     4
32764     8     8     8     8     5
32765     8     8     8     8     6
32766     8     8     8     8     7
32767     8     8     8     8     8

[32768 rows x 5 columns]
       die1  die2  die3  die4  die5 mean_roll
0         1     1     1     1     1       1.0
1         1     1     1     1     2       1.2
2         1     1     1     1     3       1.4
3         1     1     1     1     4       1.6
4         1     1     1     1     5       1.8
...     ...   ...   ...   ...   ...       ...
32763     8     8     8     8     4       7.2
32764     8     8     8     8     5       7.4
32765     8     8     8     8     6       7.6
32766     8     8     8     8     7       7.8
32767     8     8     8     8     8       8.0

[32768 rows x 6 columns]

3.7 Exercise 3.3.2

Generating an approximate sampling distribution

Calculating the exact sampling distribution is only possible in very simple situations. With just five eight-sided dice, the number of possible rolls is 8**5, which is over thirty thousand. When the dataset is more complicated, for example, where a variable has hundreds or thousands of categories, the number of possible outcomes becomes too difficult to compute exactly.

In this situation, you can calculate an approximate sampling distribution by simulating the exact sampling distribution. That is, you can repeat a procedure over and over again to simulate both the sampling process and the sample statistic calculation process.

Instructions

  1. Sample one to eight, five times, with replacement. Assign to five_rolls.
  • Calculate the mean of five_rolls.
  1. Replicate the sampling code 1000 times, assigning each result to the list sample_means_1000.
  2. Plot sample_means_1000 as a histogram with 20 bins.
Code
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

# Sample one to eight, five times, with replacement
five_rolls = np.random.choice(list(range(1, 9)), size=5, replace=True)

# Print the mean of five_rolls
print(five_rolls.mean())

# Replicate the sampling code 1000 times
sample_means_1000 = []
for i in range(1000):
    sample_means_1000.append(
        np.random.choice(list(range(1, 9)), size=5, replace=True).mean()
    )
    
# Print the first 10 entries of the result
print(sample_means_1000[0:10])

# Draw a histogram of sample_means_1000 with 20 bins
plt.hist(sample_means_1000, bins=20)
plt.show()
3.4
[3.8, 3.6, 4.4, 3.6, 2.8, 4.6, 4.6, 4.6, 3.6, 5.4]

3.8 Chapter 3.4: Standard errors and the Central Limit Theorem

The Gaussian distribution (also known as the normal distribution) plays an important role in statistics. Its distinctive bell-shaped curve has been cropping up throughout this course.

Sampling distribution of mean cup points

Here are approximate sampling distributions of the mean cup points from the coffee dataset. Each histogram shows five thousand replicates, with different sample sizes in each case. Look at the x-axis labels. We already saw how increasing the sample size results in greater accuracy in our estimates of the population parameter, so the width of the distribution shrinks as the sample size increases. When the sample size is five, the x-axis ranges from seventy-six to eighty-six, whereas, for a sample size of three hundred and twenty, the range is from eighty-one-point-six to eighty-two-point-six. Now, look at the shape of each distribution. As the sample size increases, we can see that the shape of the curve gets closer and closer to being a normal distribution. At sample size five, the curve is only a very loose approximation since it isn’t very symmetric. By sample size eighty, it is a very reasonable approximation.

Consequences of the central limit theorem

What we just saw is, in essence, what the central limit theorem tells us. The means of independent samples have normal distributions. Then, as the sample size increases, we see two things. The distribution of these averages gets closer to being normal, and the width of this sampling distribution gets narrower.

Population & sampling distribution means

Recall the population parameter of the mean cup points. We’ve seen this calculation before, and its value is eighty-two-point-one-five. We can also calculate summary statistics on our sampling distributions to see how they compare. For each of our four sampling distributions, if we take the mean of our sample means, we can see that we get values that are pretty close to the population parameter that the sampling distributions are trying to estimate.

Population & sampling distribution standard deviations

Now let’s consider the standard deviation of the population cup points. It’s about two-point-seven. By comparison, if we take the standard deviation of the sample means from each of the sampling distributions using NumPy, we get much smaller numbers, and they decrease as the sample size increases. Note that when we are calculating a population standard deviation with pandas .std, we must specify ddof equals zero, as .std calculates a sample standard deviation by default. When we are calculating a standard deviation on a sample of the population using NumPy’s std function, like in these calculations on the sampling distribution, we must specify a ddof of one. So what are these smaller standard deviation values?

Population mean over square root sample size

One other consequence of the central limit theorem is that if we divide the population standard deviation, in this case around 2.7, by the square root of the sample size, we get an estimate of the standard deviation of the sampling distribution for that sample size. It isn’t exact because of the randomness involved in the sampling process, but it’s pretty close.

Standard error

We just saw the impact of the sample size on the standard deviation of the sampling distribution. This standard deviation of the sampling distribution has a special name: the standard error. It is useful in a variety of contexts, from estimating population standard deviation to setting expectations on what level of variability we would expect from the sampling process.

3.9 Exercise 3.4.1

Population & sampling distribution means

One of the useful features of sampling distributions is that you can quantify them. Specifically, you can calculate summary statistics on them. Here, you’ll look at the relationship between the mean of the sampling distribution and the population parameter’s mean.

Three sampling distributions are provided. For each, the employee attrition dataset was sampled using simple random sampling, then the mean attrition was calculated. This was done 1000 times to get a sampling distribution of mean attritions. One sampling distribution used a sample size of 5 for each replicate, one used 50, and one used 500.

Instructions

  1. Calculate the mean of sampling_distribution_5, sampling_distribution_50, and sampling_distribution_500 (a mean of sample means).
Code
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

# Importing the course arrays
attrition = pd.read_feather("datasets/attrition.feather")

# Set a seed for reproducibility
random_seed = 2021

# Create three empty lists to hold the sampling distributions
sampling_distribution_5 = []   # Sample size of 5
sampling_distribution_50 = []  # Sample size of 50
sampling_distribution_500 = [] # Sample size of 500

# Perform biased sampling and calculate mean attrition 1000 times for each sample size
for i in range(1000):
    # Sample size = 5 (heavier weights toward high attrition)
    sampling_distribution_5.append(
        attrition.sample(n=5, random_state=random_seed + i)['Attrition'].mean()
    )
    
    # Sample size = 50 (bias reduces as sample size increases)
    sampling_distribution_50.append(
        attrition.sample(n=50, random_state=random_seed + i)['Attrition'].mean()
    )
    
    # Sample size = 500 (approaching unbiased mean)
    sampling_distribution_500.append(
        attrition.sample(n=500, random_state=random_seed + i)['Attrition'].mean()
    )

# Optional: Convert the sampling distributions to DataFrame for analysis
sampling_df = pd.DataFrame({
    'Sample_Size_5': sampling_distribution_5,
    'Sample_Size_50': sampling_distribution_50,
    'Sample_Size_500': sampling_distribution_500
})

# Calculate the mean of the mean attritions for each sampling distribution
mean_of_means_5 = np.mean(sampling_distribution_5)
mean_of_means_50 = np.mean(sampling_distribution_50)
mean_of_means_500 = np.mean(sampling_distribution_500)

# Print the results
print(mean_of_means_5)
print(mean_of_means_50)
print(mean_of_means_500)
0.155
0.15998
0.160622
Note

Even for small sample sizes, the mean of the sampling distribution is a good approximation of the population mean.

3.10 Exercise 3.4.2

Population & sampling distribution variation

You just calculated the mean of the sampling distribution and saw how it is an estimate of the corresponding population parameter. Similarly, as a result of the central limit theorem, the standard deviation of the sampling distribution has an interesting relationship with the population parameter’s standard deviation and the sample size.

Instructions

  1. Calculate the standard deviation of sampling_distribution_5, sampling_distribution_50, and sampling_distribution_500 (a standard deviation of sample means).
Code
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

# Importing the course arrays
attrition = pd.read_feather("datasets/attrition.feather")

# Set a seed for reproducibility
random_seed = 2021

# Create three empty lists to hold the sampling distributions
sampling_distribution_5 = []   # Sample size of 5
sampling_distribution_50 = []  # Sample size of 50
sampling_distribution_500 = [] # Sample size of 500

# Perform biased sampling and calculate mean attrition 1000 times for each sample size
for i in range(1000):
    # Sample size = 5 (heavier weights toward high attrition)
    sampling_distribution_5.append(
        attrition.sample(n=5, random_state=random_seed + i)['Attrition'].mean()
    )
    
    # Sample size = 50 (bias reduces as sample size increases)
    sampling_distribution_50.append(
        attrition.sample(n=50, random_state=random_seed + i)['Attrition'].mean()
    )
    
    # Sample size = 500 (approaching unbiased mean)
    sampling_distribution_500.append(
        attrition.sample(n=500, random_state=random_seed + i)['Attrition'].mean()
    )

# Optional: Convert the sampling distributions to DataFrame for analysis
sampling_df = pd.DataFrame({
    'Sample_Size_5': sampling_distribution_5,
    'Sample_Size_50': sampling_distribution_50,
    'Sample_Size_500': sampling_distribution_500
})

# Calculate the std. dev. of the mean attritions for each sampling distribution
sd_of_means_5 = np.std(sampling_distribution_5, ddof = 1)
sd_of_means_50 = np.std(sampling_distribution_50, ddof = 1)
sd_of_means_500 = np.std(sampling_distribution_500, ddof = 1)

# Print the results
print(sd_of_means_5)
print(sd_of_means_50)
print(sd_of_means_500)
0.15244093360458746
0.04970785119546479
0.014243454356018837
Note

The amount of variation in the sampling distribution is related to the amount of variation in the population and the sample size. This is another consequence of the Central Limit Theorem.

4 CHAPTER 4: Bootstrap Distributions

You’ll get to grips with resampling to perform bootstrapping and estimate variation in an unknown population. You’ll learn the difference between sampling distributions and bootstrap distributions using resampling.

4.1 Chapter 4.1: Introduction to bootstrapping

So far, we’ve mostly focused on the idea of sampling without replacement.

With or without

Sampling without replacement is like dealing a pack of cards. When we deal the ace of spades to one player, we can’t then deal the ace of spades to another player. Sampling with replacement is like rolling dice. If we roll a six, we can still get a six on the next roll. Sampling with replacement is sometimes called resampling. We’ll use the terms interchangeably.

Simple random sampling without replacement

If we take a simple random sample without replacement, each row of the dataset, or each type of coffee, can only appear once in the sample.

Simple random sampling with replacement

If we sample with replacement, it means that each row of the dataset, or each coffee, can be sampled multiple times.

Why sample with replacement?

So far, we’ve been treating the coffee_ratings dataset as the population of all coffees. Of course, it doesn’t include every coffee in the world, so we could treat the coffee dataset as just being a big sample of coffees. To imagine what the whole population is like, we need to approximate the other coffees that aren’t in the dataset. Each of the coffees in the sample dataset will have properties that are representative of the coffees that we don’t have. Resampling lets us use the existing coffees to approximate those other theoretical coffees.

Coffee data preparation

To keep it simple, let’s focus on three columns of the coffee dataset. To make it easier to see which rows ended up in the sample, we’ll add a row index column called index using the reset_index method.

Resampling with .sample()

To sample with replacement, we call sample as usual but set the replace argument to True. Setting frac to 1 produces a sample of the same size as the original dataset.

Repeated coffees

Counting the values of the index column shows how many times each coffee ended up in the resampled dataset. Some coffees were sampled four or five times.

Missing coffees

That means that some coffees didn’t end up in the resample. By taking the number of distinct index values in the resampled dataset, using len on drop_duplicates, we see that eight hundred and sixty-eight different coffees were included. By comparing this number with the total number of coffees, we can see that four hundred and seventy coffees weren’t included in the resample.

Bootstrapping

We’re going to use resampling for a technique called bootstrapping. In some sense, bootstrapping is the opposite of sampling from a population. With sampling, we treat the dataset as the population and move to a smaller sample. With bootstrapping, we treat the dataset as a sample and use it to build up a theoretical population. A use case of bootstrapping is to try to understand the variability due to sampling. This is important in cases where we aren’t able to sample the population multiple times to create a sampling distribution.

Bootstrapping process

The bootstrapping process has three steps. First, randomly sample with replacement to get a resample the same size as the original dataset. Then, calculate a statistic, such as a mean of one of the columns. Note that the mean isn’t always the choice here and bootstrapping allows for complex statistics to be computed, too. Then, replicate this many times to get lots of these bootstrap statistics. Earlier in the course, we did something similar. We took a simple random sample, then calculated a summary statistic, then repeated those two steps to form a sampling distribution. This time, when we’ve used resampling instead of sampling, we get a bootstrap distribution.

Bootstrapping coffee mean flavor

The resampling step uses the code we just saw: calling sample with frac set to one and replace set to True. Calculating a bootstrap statistic can be done with mean from NumPy. In this case, we’re calculating the mean flavor score. To repeat steps one and two one thousand times, we can wrap the code in a for loop and append the statistics to a list.

Bootstrap distribution histogram

Here’s a histogram of the bootstrap distribution of the sample mean. Notice that it is close to following a normal distribution.

4.2 Exercise 4.1.1

Generating a bootstrap distribution

The process for generating a bootstrap distribution is similar to the process for generating a sampling distribution; only the first step is different.

To make a sampling distribution, you start with the population and sample without replacement. To make a bootstrap distribution, you start with a sample and sample that with replacement. After that, the steps are the same: calculate the summary statistic that you are interested in on that sample/resample, then replicate the process many times. In each case, you can visualize the distribution with a histogram.

Here, spotify_sample is a subset of the spotify dataset. To make it easier to see how resampling works, a row index column called 'index' has been added, and only the artist name, song name, and danceability columns have been included.

Instructions

  1. Generate a single bootstrap resample from spotify_sample.
  2. Calculate the mean of the danceability column of spotify_1_resample using numpy.
  3. Replicate the expression provided 1000 times.
  4. Create a bootstrap distribution by drawing a histogram of mean_danceability_1000.
Code
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

# Importing the course array
spotify = pd.read_feather("datasets/spotify_2000_2020.feather")

# Subset of spotify sample to use
spotify_sample = spotify.sample(n=41656)[['artists', 'name', 'danceability']]
spotify_sample['index'] = spotify_sample.index

# Reorder columns to make 'index' the first column
spotify_sample = spotify_sample[['index', 'artists', 'name', 'danceability']]

# Generate 1 bootstrap resample
spotify_1_resample = spotify_sample.sample(frac=1, replace = True)

# Print the resample
print(spotify_1_resample)

# Calculate of the danceability column of spotify_1_resample
mean_danceability_1 = np.mean(spotify_1_resample['danceability'])

# Print the result
print(mean_danceability_1)

# Replicate this 1000 times
mean_danceability_1000 = []
for i in range(1000):
    mean_danceability_1000.append(
        np.mean(spotify_sample.sample(frac=1, replace=True)['danceability'])
    )
  
# Print the result
print(mean_danceability_1000)

# Draw a histogram of the resample means
plt.hist(mean_danceability_1000)
plt.show()
       index                                   artists  \
40112  40112                  ['Peter, Paul and Mary']   
13844  13844                             ['Cat Power']   
19675  19675                                ['Thalía']   
7595    7595                              ['¡MAYDAY!']   
29531  29531                               ['50 Cent']   
...      ...                                       ...   
1695    1695  ['Nevada', 'Mark Morrison', 'Fetty Wap']   
24818  24818                        ['Eli Young Band']   
30099  30099                     ['Portugal. The Man']   
14328  14328                            ['Dan + Shay']   
3933    3933                           ['beabadoobee']   

                                    name  danceability  
40112  If I Had a Hammer - 2004 Remaster         0.628  
13844                          Manhattan         0.711  
19675  El Próximo Viernes - Live Version         0.755  
7595                            Badlands         0.730  
29531                               Heat         0.519  
...                                  ...           ...  
1695                            The Mack         0.711  
24818                         Love Ain't         0.600  
30099                         People Say         0.543  
14328                When I Pray for You         0.382  
3933                      If You Want To         0.666  

[41656 rows x 4 columns]
0.591741393796812
[0.5926616429806031, 0.5912292370846938, 0.5909229594776263, 0.5927271509506433, 0.5903040522373727, 0.5904833997503361, 0.5917418835221816, 0.5922994814672555, 0.5914843479930862, 0.5911015628000768, 0.5910246807182639, 0.5908936575763394, 0.5916269060879585, 0.5900065416746687, 0.5917592447666603, 0.5908108027655079, 0.5914857355483004, 0.5907100729786825, 0.5920515892068371, 0.5908976978106396, 0.5915547940272711, 0.5913858267716536, 0.5902494694641828, 0.5917825787401575, 0.5907649030151719, 0.5898005377376608, 0.5917094008066064, 0.591363736316497, 0.5930069281736124, 0.5914717639715767, 0.5915227866333782, 0.5912849193393509, 0.5916315728826579, 0.5912630929517957, 0.5917036537353564, 0.5913172484155945, 0.5920614677357404, 0.5918894973113117, 0.5914229018628769, 0.5909182566737086, 0.5912621135010563, 0.5932023886114846, 0.5913446466295372, 0.5914125432110621, 0.5915523838102553, 0.5906654263491453, 0.5916980434991357, 0.5903723353178414, 0.59163305406184, 0.5910800532936432, 0.5913561815824852, 0.5914952539850202, 0.5909793667178798, 0.5905758066064912, 0.5919920131553678, 0.5912781039946227, 0.5909638107355483, 0.5910683022853851, 0.5915094992318033, 0.5922963318609564, 0.591612788073747, 0.5917137915306319, 0.5907620150758595, 0.5920314144420973, 0.5923453164009986, 0.5909815032648358, 0.5905275854618782, 0.592712125504129, 0.5911578740157479, 0.5902561911849433, 0.591076615613597, 0.5912477698290762, 0.5928863140964086, 0.5911237876896486, 0.5917720736508546, 0.5907518172652199, 0.59126746207029, 0.5922028231227194, 0.5923195097945074, 0.5893636618974458, 0.5906233147685808, 0.589327114941425, 0.5907251656424046, 0.5908303245630883, 0.5918481035144997, 0.5904468071826388, 0.5913788313808335, 0.5913491717879777, 0.5928264739773381, 0.5911507898021894, 0.5917474337430383, 0.5907791674668716, 0.592098578836182, 0.5923731371231036, 0.5905206956980987, 0.5922919579412329, 0.5906497695410025, 0.592824565488765, 0.5894132273862108, 0.5910890892068369, 0.589665411945458, 0.590928648934127, 0.5908854858843864, 0.5906148814096408, 0.5911096168619167, 0.5914759530439793, 0.591913573074707, 0.5915902895141156, 0.5919703692145188, 0.5921623943729595, 0.5893560759554445, 0.5918974289418091, 0.5922629489149223, 0.5911910841175341, 0.5905550965047052, 0.5914129369118495, 0.5908735068177454, 0.5919226618014212, 0.5896804277895141, 0.5918910745150758, 0.5909926445169964, 0.5911779095448436, 0.5917400518532745, 0.5915469128096793, 0.5913175364893414, 0.5905761234876128, 0.5907693513539466, 0.5906233627808719, 0.5912526574803149, 0.5927125000000001, 0.5911204172268102, 0.591540640003841, 0.5917459309583253, 0.5929342711734203, 0.5907872023237949, 0.5913667538889956, 0.5913613477050124, 0.591543861628577, 0.5911535024966391, 0.5916594920299596, 0.5917597753024774, 0.5915537425580949, 0.5913673348377184, 0.5902052669483386, 0.5921307902823123, 0.5922133234107931, 0.59176450691377, 0.5908498079508354, 0.5921899246207029, 0.5902212502400616, 0.591075211254081, 0.5908238285000962, 0.5911779167466872, 0.591597575379297, 0.5911855266948339, 0.591486897445746, 0.5907305646245438, 0.5904625072018437, 0.5910472945073939, 0.5932251968503938, 0.5922508738236989, 0.5926040690416747, 0.5928815320722105, 0.592704957269061, 0.5919456188784329, 0.5913617774150182, 0.5920723593239869, 0.5910119646629537, 0.5911351858075667, 0.5908025206452852, 0.5905718023814096, 0.5913644805070098, 0.5918884698482811, 0.5920818753600922, 0.5896607163433839, 0.5931012171115806, 0.5906362204724409, 0.5907316857115422, 0.5917356923372383, 0.589505547820242, 0.5899278327251776, 0.590222049644709, 0.5902936095640483, 0.5908735500288073, 0.590986719800269, 0.5914564504513155, 0.5898611220472441, 0.5917383714230843, 0.5906654359516036, 0.5919472536969463, 0.5912775494526599, 0.5910284760898791, 0.5934322474553485, 0.5896524822354522, 0.5910561695794123, 0.5908920875744191, 0.5902741117726137, 0.5905362276742847, 0.590882072210486, 0.590599875168043, 0.5908478994622623, 0.5913667442865375, 0.5918546980026886, 0.5904053701747647, 0.5917759626464375, 0.5895187800076819, 0.5908222512963319, 0.5918495222777032, 0.5900889595736508, 0.5915296956020741, 0.5913352434223161, 0.5898937079892452, 0.5923311407720376, 0.5915862036681391, 0.5940695506049549, 0.5917825115229497, 0.5930165882465911, 0.5913268460725946, 0.5910937223929326, 0.5900466463414634, 0.5914370534856923, 0.5902467711734204, 0.5910614149222201, 0.5923320698098714, 0.5910843311887843, 0.5907899102170155, 0.5909062920107548, 0.592194761859036, 0.5913332365085462, 0.591112581620895, 0.5909766900326484, 0.5915309703284041, 0.590623329172268, 0.5901133426157096, 0.5921993206260803, 0.5908828812175917, 0.5899316689072402, 0.5901478538505858, 0.5905907672364126, 0.5902499351834068, 0.5930716007297867, 0.5911384122335318, 0.5904675052813521, 0.5917777270981371, 0.5917923180334165, 0.5915852050124832, 0.5902159376800461, 0.590632266660265, 0.5921240565584789, 0.5914197426541195, 0.5921987540810447, 0.5910413049740735, 0.5910402102938352, 0.5912252400614557, 0.592017966199347, 0.5932416074515076, 0.5914617077971961, 0.5900291458613406, 0.5921380521413483, 0.5920603274438255, 0.591674582293067, 0.5902335293835222, 0.5910828188016133, 0.5902600609756098, 0.5917250552141349, 0.59129137699251, 0.5906637987324755, 0.5911635226618014, 0.5929964830996735, 0.5919752664682159, 0.5907731539274055, 0.5931615661609372, 0.5916405919915498, 0.5921774246207028, 0.5912177573458806, 0.5899589518916841, 0.5922920563664298, 0.5903987444785865, 0.5900557206644902, 0.5916777799116574, 0.5904387555214134, 0.591779863645093, 0.5909780295755712, 0.5911936599769541, 0.5912351906087958, 0.5913116093719992, 0.5907024846360668, 0.59152518964855, 0.5913170131553677, 0.5928027318993663, 0.5910584045515651, 0.5906748223545227, 0.5914975537737661, 0.5910603538505858, 0.5910368038217784, 0.5906125840215095, 0.590829880449395, 0.5898112324755137, 0.5915295947762628, 0.5911917658920683, 0.5905858747839446, 0.5899634506433646, 0.5914477866333782, 0.59188657816401, 0.5911678725753793, 0.5908139523718072, 0.5912910673132321, 0.5905924116573842, 0.5919264043595159, 0.5914105771077396, 0.5909259602458229, 0.5907988453043979, 0.5905330948722872, 0.59271008738237, 0.591034023910121, 0.5906882922028039, 0.5912970016324179, 0.5905414826195505, 0.5907892524486268, 0.5912277415018246, 0.5926230146917612, 0.5910144973113118, 0.5903959333589398, 0.5893279575571346, 0.5915921884002304, 0.5923037113501056, 0.590284820914154, 0.5921825979450739, 0.5918227986364508, 0.5920301829268293, 0.5909208637411179, 0.5921829940464759, 0.5918935639523717, 0.5918183550989052, 0.5911464686959861, 0.5909657384290378, 0.5914849025350489, 0.5918623727674285, 0.5912250480122913, 0.5908601930094104, 0.5905958829460342, 0.5919878240829652, 0.5903600825811407, 0.5906465767236413, 0.5911581284808911, 0.5908419531400039, 0.5926030535817169, 0.592132223449203, 0.5909092639715767, 0.5912387387171115, 0.591160910793163, 0.5905455972729019, 0.5913859876128289, 0.5898515772037642, 0.5914172652198963, 0.5921838222584981, 0.5907164346072594, 0.5919652559055117, 0.5926609516036105, 0.5915412833685423, 0.5915394156904167, 0.5920688088150565, 0.5915686503744958, 0.5892230723065105, 0.590730298156328, 0.591971274246207, 0.5912631577683887, 0.5903619286537355, 0.5921993158248512, 0.5902068225465719, 0.5924656111964663, 0.5905889667754946, 0.5911808935087381, 0.592788443441521, 0.5904766564240446, 0.5916002664682158, 0.5903410024966391, 0.5911174692721337, 0.5915942409256769, 0.5931965839254849, 0.5914087310351451, 0.5903466655463798, 0.592073655655848, 0.5911923564432494, 0.5902879849241406, 0.5907995486844633, 0.5908140579988478, 0.5886947066449011, 0.5899172988285001, 0.5924709597657001, 0.5896644829076244, 0.5909232235452276, 0.5915582629153064, 0.5907816785096984, 0.5916415522373727, 0.5916577971960822, 0.5901446754369118, 0.5908445482043404, 0.5906149030151719, 0.59106637699251, 0.5916466607451507, 0.5913610476281927, 0.5910194905895909, 0.5902812367966198, 0.5906482667562896, 0.5905427309391204, 0.5909240685615518, 0.5900334357595544, 0.5922640603994622, 0.5923900470520453, 0.5896494166506625, 0.5919552045323603, 0.5905567817361245, 0.5910871062992125, 0.5910113549068562, 0.5905677045323603, 0.5903230939120415, 0.591066084117534, 0.5911983075667371, 0.5902467423660457, 0.5925808815056655, 0.5926247215287114, 0.591081844152103, 0.5904435183406952, 0.5909309271173421, 0.5910111628576916, 0.591658959093528, 0.5913717087574419, 0.5893621975225658, 0.5919720304397926, 0.5911797340119072, 0.5917493854426733, 0.591742798156328, 0.592457098617246, 0.5909810663529864, 0.5907371543115038, 0.5915628480891109, 0.5909732787593625, 0.5899263899558288, 0.5906783080468601, 0.5898577995966968, 0.5905892812560015, 0.5913739485308239, 0.5897054181870559, 0.5900391420203572, 0.5918797436143652, 0.5905917202803919, 0.5914559175148839, 0.5914782360284233, 0.5918666002496639, 0.5913876776454772, 0.5911210821970425, 0.5926675244862685, 0.591414446898406, 0.5904670899750336, 0.5895828620126752, 0.5911828164009987, 0.592114430094104, 0.5916178653735357, 0.5905210005761475, 0.5905137051085078, 0.5918047100057615, 0.5900443921643941, 0.5887437680046091, 0.590961599769541, 0.5896375528135203, 0.5915577083733435, 0.590345834933743, 0.5908644613020934, 0.5911053725753793, 0.5920369550604955, 0.5919685303437681, 0.5916705276550798, 0.5923448482811599, 0.5912085581908968, 0.5910077083733435, 0.5927318969656232, 0.5903771509506434, 0.5904429950067216, 0.5922689720568466, 0.5908619958709429, 0.5903033056462454, 0.5912089975033609, 0.5905264451699636, 0.5922565752832726, 0.5922172436143652, 0.589337456788938, 0.5918172244094487, 0.5918977098137124, 0.5896244766660265, 0.5896562752064528, 0.591152391012099, 0.5900982979642788, 0.5913992342039563, 0.5900886090839256, 0.5899317961398118, 0.5908641900326483, 0.5903901310735548, 0.5911480194929902, 0.5931159568849625, 0.5911266228154408, 0.5906838774726331, 0.5916445890147879, 0.5915871999231804, 0.5910617414057998, 0.5921722512963318, 0.5914262747263299, 0.5895554613981179, 0.5908564768580757, 0.5906258090071058, 0.5925458613405032, 0.5915739197234492, 0.5906524486268484, 0.5903650302477434, 0.5904123823698867, 0.5917875840215095, 0.5911104810831572, 0.5915711710197811, 0.5935077731899365, 0.590618026214711, 0.5917633666218552, 0.5915322258498176, 0.5907491717879777, 0.5919432950835414, 0.5927266204148262, 0.5917016924332629, 0.5914483819857884, 0.5914492630113309, 0.5924000096024582, 0.5913834309583254, 0.5897857427501442, 0.5905894348953332, 0.591994555406184, 0.5917281712118301, 0.591571567121183, 0.5921743182254657, 0.5901465575187248, 0.5909690104666796, 0.5907872695410025, 0.5909780343768005, 0.5916456740925676, 0.5917466583445362, 0.590724289418091, 0.5896283032456309, 0.5906449419051277, 0.5918511162857693, 0.5897077323794891, 0.5920285865181486, 0.5909270789322066, 0.5910364557326675, 0.5909983219704243, 0.5918327059727291, 0.5910312704052237, 0.5907272733819857, 0.5929490061455732, 0.5912110596312656, 0.5925778567313232, 0.5907013395429229, 0.5913772229690801, 0.5901988429037833, 0.5913894541002496, 0.5913386402919147, 0.5910812680046094, 0.5913870678893796, 0.5917711950259266, 0.5917978274438256, 0.5921003432878817, 0.5925698458805454, 0.5905424884770502, 0.5911803869790667, 0.5910578764163626, 0.5913813736316498, 0.5913344152102938, 0.5920792898982139, 0.590987202323795, 0.5912458925484926, 0.5920510562704051, 0.5924995510850778, 0.5907250864221241, 0.5905292922988286, 0.5912207773189936, 0.5913054782024199, 0.5925251632417899, 0.5905603394468985, 0.5899143172652199, 0.5923736700595352, 0.5917944377760708, 0.5916957893220665, 0.5909236508546188, 0.5926740637603227, 0.5915126344344152, 0.5901454988477051, 0.5908931174380642, 0.5915016348185136, 0.5915574587094296, 0.5906221336662186, 0.5904921067793355, 0.5923366573842903, 0.5914849409448819, 0.593137228730555, 0.5931890387939313, 0.5914667826963702, 0.5908760010562705, 0.5911016492222009, 0.5910045995774919, 0.591552304589975, 0.5914536825427309, 0.5923033200499328, 0.5917603058382945, 0.5901439840599194, 0.5918508018052622, 0.5924167490877664, 0.591587982523526, 0.5922674548684463, 0.5922550845016323, 0.5929316064912619, 0.5919387819281736, 0.5919274558286921, 0.5901350633762243, 0.5903831236796621, 0.5908995630881505, 0.5912228082389092, 0.5912214758978299, 0.5918459141540234, 0.5894999951987709, 0.591283704628385, 0.5910272613789129, 0.589813465047052, 0.5903519180910313, 0.590896980026887, 0.5917108843864031, 0.5907662401574804, 0.5910690536777414, 0.5914731443249472, 0.5894100345688496, 0.5905653183214904, 0.5909030607835606, 0.5896165234299982, 0.5918423252352601, 0.5900764571730364, 0.5912617125984252, 0.5924135418667179, 0.5907782408296524, 0.5921932734780103, 0.5925110260226617, 0.5903616381793738, 0.590628346456693, 0.5913857451507587, 0.5911528063184175, 0.5902470352410217, 0.5894106707317073, 0.5909741165738429, 0.5907371807182639, 0.5908601738044941, 0.5918326579604378, 0.5914354642788555, 0.5902412641636259, 0.5931608147685807, 0.5910897805838294, 0.5903462214326868, 0.5920667010754753, 0.5915275590551181, 0.5902948122719416, 0.5902076267524486, 0.5923443945650086, 0.5912719128096793, 0.590834794507394, 0.5920066857115421, 0.5914438688304205, 0.5917880761474938, 0.5918856323218744, 0.589066230555022, 0.5912582413097754, 0.5920773405991934, 0.5903263011330901, 0.5913766372191281, 0.592531219992318, 0.5918475753792971, 0.5914500384098329, 0.5911696274246206, 0.5935998847705012, 0.5926084957749184, 0.5915739365277511, 0.5918752880737469, 0.5904889451699634, 0.5912049764739773, 0.5904391780295755, 0.5927635202611868, 0.5907452299788746, 0.5916699923180334, 0.5920747335317841, 0.5915456980987133, 0.5896649750336086, 0.5914672268100634, 0.5900466895525255, 0.5933143484732091, 0.5910936599769541, 0.5922225537737661, 0.5927912401574803, 0.5894851329940465, 0.5908105867101978, 0.5919073026694833, 0.5908564480507009, 0.592417337238333, 0.5911806654503552, 0.5909675028807374, 0.5907690656808143, 0.5907764331668908, 0.592030238140964, 0.5921125192049165, 0.5900756097560975, 0.5904101762051086, 0.5910435111388516, 0.5908461230074898, 0.5920875432110619, 0.590618888035337, 0.5909370150758594, 0.5911077467831765, 0.5931547220088342, 0.5909817097176878, 0.5913940296715959, 0.5886069041674669, 0.5914294051277127, 0.5906930190128672, 0.5915716511426925, 0.5901891204148262, 0.5931667322834646, 0.5924378240829653, 0.5907444185711541, 0.5901002256577683, 0.591022028039178, 0.5901844272133666, 0.5911939000384099, 0.5908411297292107, 0.5922902895141157, 0.5920179565968888, 0.5914320001920491, 0.592058562992126, 0.5912873175532937, 0.5913952083733437, 0.5917587238333013, 0.5919691425004802, 0.5922314216439408, 0.5903801397157673, 0.5904707677165355, 0.5911409496831189, 0.5901993014211638, 0.5915161633378145, 0.5922194161705397, 0.5909040354330709, 0.5916686575763396, 0.5920118446322258, 0.5905843191857115, 0.5911155535817169, 0.5918460797964279, 0.5914337238333014, 0.5915415354330709, 0.5906859732091415, 0.5915891180142117, 0.5911964566929134, 0.5915119022469751, 0.5905008306126368, 0.5922994118494336, 0.5895777102938352, 0.5906694233723834, 0.590757960437872, 0.5904870750912233, 0.5917914298060303, 0.5903014307662762, 0.5909345616477819, 0.5919178077587863, 0.5920319929902055, 0.5910793595160361, 0.5895955348569233, 0.5907411081236796, 0.5901596432686766, 0.5919464230843097, 0.5905677333397351, 0.5929830972729019, 0.5919192577299789, 0.5911707437103898, 0.5908752400614558, 0.5909162137507202, 0.5914774582293066, 0.592510048972537, 0.5913340215095064, 0.5908185063376225, 0.5910790762435183, 0.5919565248703669, 0.5907861316497024, 0.5914483603802574, 0.5903393676781256, 0.5917439816593048, 0.5916352770309199, 0.5918654527559055, 0.5920566881121567, 0.591219853082389, 0.5915324683118879, 0.5919101449971192, 0.5926820794123296, 0.591250753792971, 0.5919444257729979, 0.589140471960822, 0.5904931270405224, 0.5905940512771269, 0.5919322210485883, 0.5908368422316113, 0.591454587574419, 0.5907810183406953, 0.5919966631457653, 0.591575878624928, 0.5908360980410985, 0.5910640147877857, 0.5913534472825043, 0.5915799380641446, 0.5908325691376992, 0.5902554469944306, 0.591631361628577, 0.5922005449395046, 0.5912278807374688, 0.5916780727866333, 0.5905427525446514, 0.5912972657000192, 0.59099926781256, 0.5913474865565584, 0.5910899846360669, 0.5936998391588246, 0.5908733435759554, 0.5906226497983483, 0.59051488621087, 0.5920130329364317, 0.5888686767812561, 0.591043866429806, 0.5910991165738428, 0.5922102482235453, 0.5914233219704244, 0.5916931198386787, 0.5917946394276936, 0.5913848569233724, 0.5902227602266181, 0.591087814480507, 0.5904899150182447, 0.5903164178029576, 0.5928820842135586, 0.5891299716727483, 0.591887348761283, 0.5917103250432111, 0.5915796307854811, 0.5922009914538122, 0.5917082797196082, 0.5917062608027656, 0.5922855074899174, 0.5918847873055503, 0.592393340695218, 0.5923265603994622, 0.591542565296716, 0.5907780511811024, 0.590535867582101, 0.5894348425196851, 0.5913348305166123, 0.5930217111580566, 0.590582516324179, 0.5904781616093719, 0.5919792586902247, 0.5907385250624161, 0.5907233651814865, 0.5923417490877665, 0.5904190272709814, 0.592173845304398, 0.590197467351642, 0.5909474361436527, 0.5910543427117342, 0.5902070962166315, 0.5907220736508547, 0.5921677909544844, 0.5930224601497984, 0.5914549812752063, 0.59139707125024, 0.5923126728442482, 0.5902207797196083, 0.590962651238717, 0.5904989437295947, 0.5918422364125216, 0.591248622047244, 0.5891576651622816, 0.5905306678509699, 0.5905225993854427, 0.5906712934511235, 0.5912723041098522, 0.5910382009794508, 0.5903470904551565, 0.5914191521029384, 0.590758313328212, 0.5911435783560592, 0.5924365661609372, 0.5911546139811792, 0.5908434415210294, 0.5915033896677551, 0.5918223953332052, 0.5913173900518532, 0.5910597729018628, 0.5914045299596696, 0.59166672988285, 0.5919528879393124, 0.5914732379489149, 0.5902820001920491, 0.5910820265988093, 0.5922341559439217, 0.5913152150950645, 0.5905537281544077, 0.5912655031688113, 0.5914478730555023, 0.592068215863261, 0.5909442385250624, 0.5914872863453045, 0.5907959477626273, 0.5911647349721528, 0.5919397805838295, 0.5923229378720953, 0.5918574010946802, 0.5931670299596697, 0.5920471840791242, 0.5915165666410601, 0.5913484852122143, 0.5934646965623199, 0.5929283440560783, 0.5901001200307279, 0.5907375192049165, 0.5908793307086615, 0.5913857307470712, 0.5896550748991741, 0.5912437896101402, 0.5920363140964088, 0.5901828644132898, 0.5901846504705205, 0.5913092975801805, 0.5902259842519685, 0.592349577491838, 0.5905967351642021, 0.5903985164202036, 0.5898652487036681, 0.5920668667178797, 0.5904112732859612, 0.5909481851353947, 0.5909578164009986, 0.5906385706740925, 0.5915187487996927, 0.5911055694257729, 0.5900728370462838, 0.5911598737276744, 0.58958109996159, 0.5920260394661033, 0.5912480531015939, 0.5907358699827155, 0.5918171283848664, 0.591998449202996, 0.5909410241021702, 0.5910068849625505, 0.5913388347416939, 0.5915792754945265, 0.5912604354714807, 0.5911615325523334, 0.5915866669867487, 0.5914991117726138, 0.5895093888035338, 0.5908362348761282, 0.5901038961974265, 0.5917853346456694, 0.5895735980410987, 0.590800021605531, 0.5918660721144613, 0.5917803221624736, 0.5912428173612445, 0.5911495270789322, 0.5912164082004993, 0.5933099169387364, 0.5915245414826197, 0.5908806558478971, 0.5910030847897061, 0.5922708973497215, 0.5917013371423085, 0.5906103514499713, 0.592029105050893, 0.5910212646437488, 0.5930227626272326, 0.5915585773958134, 0.5931748751680429, 0.5891325019204917, 0.5911125720184367, 0.5910315128672939, 0.590771127808719, 0.5913357907624351, 0.591508627808719, 0.5920597729018628, 0.5896400158440561, 0.5907519132898023, 0.5910475153639332, 0.591450835413866, 0.5915562920107547, 0.5908322906664105]

4.3 Chapter 4.2: Comparing sampling and bootstrap distributions

Coffee focused subset

we took a focused subset of the coffee dataset. Here’s a five hundred row sample from it.

The bootstrap of mean coffee flavors

Here, we generate a bootstrap distribution of the mean coffee flavor scores from that sample. .sample generates a resample, np.mean calculates the statistic, and the for loop with .append repeats these steps to produce a distribution of bootstrap statistics.

Mean flavor bootstrap distribution

Observing the histogram of the bootstrap distribution, which is close to a normal distribution.

Sample, bootstrap distribution, population means

Here’s the mean flavor score from the original sample. In the bootstrap distribution, each value is an estimate of the mean flavor score. Recall that each of these values corresponds to one potential sample mean from the theoretical population. If we take the mean of those means, we get our best guess of the population mean. The two values are really close. However, there’s a problem. The true population mean is actually a little different.

Interpreting the means

The behavior that you just saw is typical. The bootstrap distribution mean is usually almost identical to the original sample mean. However, that is not often a good thing. If the original sample wasn’t closely representative of the population, then the bootstrap distribution mean won’t be a good estimate of the population mean. Bootstrapping cannot correct any potential biases due to differences between the sample and the population.

Sample sd vs. bootstrap distribution sd

While we do have that limitation in estimating the population mean, one great thing about distributions is that we can also quantify variation. The standard deviation of the sample flavors is around 0.354. Recall that pandas .std calculates a sample standard deviation by default. If we calculate the standard deviation of the bootstrap distribution, specifying a ddof of one, then we get a completely different number. So what’s going on here?

Sample, bootstrap dist’n, pop’n standard deviations

Remember that one goal of bootstrapping is to quantify what variability we might expect in our sample statistic as we go from one sample to another. Recall that this quantity is called the standard error as measured by the standard deviation of the sampling distribution of that statistic. The standard deviation of the bootstrap means can be used as a way to estimate this measure of uncertainty. If we multiply that standard error by the square root of the sample size, we get an estimate of the standard deviation in the original population. Our estimate of the standard deviation is around point-three-five-three. The true standard deviation is around point-three-four-one, so our estimate is pretty close. In fact, it is closer than just using the sample standard deviation alone.

Interpreting the standard errors

To recap, the estimated standard error is the standard deviation of the bootstrap distribution values for our statistic of interest. This estimated standard error times the square root of the sample size gives a really good estimate of the standard deviation of the population. That is, although bootstrapping was poor at estimating the population mean, it is generally great for estimating the population standard deviation.

4.4 Exercise 4.2.1

Sampling distribution vs. bootstrap distribution

The sampling distribution and bootstrap distribution are closely linked. In situations where you can repeatedly sample from a population (these occasions are rare), it’s helpful to generate both the sampling distribution and the bootstrap distribution, one after the other, to see how they are related.

Here, the statistic you are interested in is the mean popularity score of the songs.

Instructions

  1. Generate a sampling distribution of 2000 replicates.
  • Sample 500 rows of the population without replacement and calculate the mean popularity.
  1. Generate a bootstrap distribution of 2000 replicates.
  • Sample 500 rows of the sample with replacement and calculate the mean popularity.
Code
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

# Importing the course array
spotify = pd.read_feather("datasets/spotify_2000_2020.feather")

spotify_sample = spotify.sample(n=500)

mean_popularity_2000_samp = []

# Generate a sampling distribution of 2000 replicates
for i in range(2000):
    mean_popularity_2000_samp.append(
        # Sample 500 rows and calculate the mean popularity 
        spotify.sample(n=500)['popularity'].mean()
    )

# Print the sampling distribution results
print(mean_popularity_2000_samp)

mean_popularity_2000_boot = []

# Generate a bootstrap distribution of 2000 replicates
for i in range(2000):
    mean_popularity_2000_boot.append(
        # Resample 500 rows and calculate the mean popularity     
        np.mean(spotify_sample.sample(frac=1, replace=True)['popularity'])
    )

# Print the bootstrap distribution results
print(mean_popularity_2000_boot)
[55.06, 55.014, 54.164, 54.772, 55.028, 55.338, 55.792, 54.814, 54.58, 53.66, 54.884, 54.228, 55.012, 53.928, 54.882, 55.45, 54.892, 54.936, 54.28, 55.444, 54.26, 54.758, 55.436, 54.688, 54.528, 54.804, 56.13, 54.866, 54.642, 54.534, 53.96, 55.278, 54.802, 55.006, 55.15, 55.05, 54.908, 54.186, 54.888, 55.812, 53.944, 55.078, 54.702, 55.686, 53.992, 54.578, 54.376, 54.326, 54.974, 54.672, 56.162, 54.782, 54.902, 54.28, 54.492, 54.426, 54.332, 55.018, 54.594, 54.766, 54.958, 54.278, 55.396, 54.058, 55.426, 54.196, 53.42, 55.43, 54.874, 55.26, 54.44, 54.742, 55.344, 53.808, 55.294, 54.636, 54.416, 54.508, 53.778, 54.162, 54.33, 55.608, 54.86, 55.152, 55.21, 55.354, 55.416, 55.074, 55.24, 54.89, 54.76, 54.162, 55.264, 55.576, 55.344, 54.434, 54.53, 54.29, 54.096, 55.182, 54.93, 55.21, 54.838, 54.26, 54.504, 55.43, 55.484, 55.294, 55.028, 55.208, 54.282, 56.132, 55.036, 54.68, 55.076, 55.684, 55.114, 55.02, 54.538, 54.382, 54.05, 54.138, 55.17, 55.652, 54.568, 55.148, 54.728, 53.886, 55.624, 55.258, 55.038, 55.03, 54.738, 55.346, 55.588, 55.492, 54.47, 54.944, 54.614, 54.806, 54.508, 54.206, 55.22, 55.342, 55.152, 54.794, 55.058, 54.362, 55.52, 55.448, 54.902, 54.562, 55.008, 54.168, 54.796, 55.148, 55.368, 55.098, 53.936, 55.95, 54.346, 54.878, 55.102, 54.936, 54.602, 54.59, 54.682, 55.526, 54.844, 53.688, 54.886, 54.524, 54.59, 55.142, 54.532, 54.45, 54.338, 54.292, 54.82, 53.948, 55.06, 54.354, 54.31, 54.686, 55.686, 54.868, 55.65, 54.866, 55.092, 55.31, 54.622, 54.196, 54.396, 54.582, 54.494, 54.426, 54.766, 55.67, 54.536, 54.602, 55.046, 54.958, 55.532, 54.034, 55.026, 55.558, 55.296, 53.986, 55.108, 55.272, 54.742, 53.464, 55.448, 54.888, 54.774, 54.768, 54.224, 54.976, 54.686, 54.614, 55.18, 54.402, 54.378, 54.12, 55.096, 55.16, 54.852, 53.958, 55.158, 55.39, 54.318, 55.296, 54.844, 54.982, 54.476, 55.034, 54.44, 54.36, 54.28, 54.046, 55.448, 55.774, 55.574, 54.366, 54.632, 55.166, 54.928, 55.352, 54.364, 54.328, 55.182, 54.616, 55.242, 54.83, 55.382, 54.768, 55.278, 55.408, 54.486, 55.22, 54.906, 55.256, 55.238, 54.38, 54.676, 55.226, 55.446, 54.576, 54.168, 54.634, 54.762, 54.612, 55.25, 54.782, 54.618, 55.104, 54.646, 54.562, 54.064, 54.826, 54.37, 54.168, 54.864, 54.514, 55.448, 54.596, 54.812, 54.826, 55.532, 54.774, 55.7, 54.976, 53.498, 54.602, 54.104, 54.316, 54.616, 54.138, 54.66, 54.782, 54.572, 55.248, 54.924, 54.98, 54.37, 55.188, 54.794, 54.412, 54.872, 55.138, 56.0, 55.924, 55.146, 54.222, 54.776, 55.372, 55.084, 54.992, 54.474, 54.612, 54.882, 54.572, 56.112, 55.524, 54.514, 55.186, 54.328, 55.658, 55.316, 54.816, 54.876, 54.532, 54.792, 54.808, 54.432, 54.146, 55.356, 53.942, 54.674, 55.004, 54.394, 55.392, 54.756, 55.03, 54.142, 55.91, 54.966, 54.404, 54.976, 55.062, 55.096, 54.128, 54.91, 54.912, 54.702, 55.336, 54.914, 55.042, 54.824, 55.702, 55.228, 54.574, 54.848, 54.48, 54.594, 54.406, 55.17, 54.84, 55.222, 53.774, 54.772, 54.666, 55.38, 55.084, 54.546, 54.86, 54.972, 54.72, 54.648, 55.0, 54.766, 55.944, 55.51, 54.34, 54.716, 54.714, 55.382, 54.988, 55.674, 53.912, 54.834, 54.594, 55.752, 55.698, 54.102, 54.298, 54.514, 53.248, 54.654, 55.256, 55.898, 54.586, 55.272, 54.808, 54.71, 55.534, 54.712, 54.206, 55.014, 54.714, 55.656, 55.74, 54.926, 54.514, 55.714, 54.414, 54.468, 54.846, 54.844, 54.068, 54.712, 56.228, 55.612, 54.548, 55.412, 54.646, 53.918, 54.422, 54.484, 55.304, 56.544, 54.824, 54.898, 54.304, 54.74, 54.528, 54.426, 54.212, 54.842, 55.2, 54.576, 54.578, 53.88, 55.072, 54.876, 54.698, 54.78, 54.1, 54.89, 55.48, 53.956, 54.712, 54.864, 54.232, 54.156, 55.798, 55.118, 54.066, 54.802, 54.7, 54.294, 54.936, 54.986, 54.448, 54.858, 54.414, 55.662, 55.362, 54.174, 54.544, 55.004, 55.822, 55.024, 54.9, 54.816, 54.716, 54.892, 55.686, 55.968, 55.212, 55.174, 55.276, 54.764, 54.994, 54.708, 55.02, 54.638, 55.824, 54.288, 54.744, 54.558, 55.034, 54.504, 54.072, 56.076, 54.788, 54.762, 53.77, 54.754, 54.874, 54.574, 55.364, 55.29, 55.316, 54.958, 54.588, 55.42, 54.846, 54.966, 54.492, 55.09, 54.216, 54.144, 55.088, 54.758, 56.362, 55.104, 55.122, 54.346, 55.3, 54.852, 54.518, 54.3, 54.59, 54.308, 54.836, 55.092, 54.742, 55.1, 54.634, 55.118, 54.244, 53.992, 55.462, 54.556, 54.768, 54.64, 53.944, 55.172, 55.004, 55.394, 54.464, 55.014, 55.27, 54.22, 54.704, 54.788, 54.814, 54.402, 54.806, 53.85, 54.834, 54.83, 54.674, 54.342, 54.972, 54.758, 55.192, 54.84, 54.424, 54.82, 54.6, 54.576, 54.716, 54.18, 54.958, 54.632, 53.858, 54.494, 54.956, 55.332, 54.888, 54.694, 55.674, 55.304, 54.62, 55.76, 54.148, 55.596, 54.05, 54.836, 54.728, 54.588, 54.876, 54.694, 56.002, 54.812, 55.014, 54.066, 53.81, 54.836, 55.564, 54.146, 53.996, 54.062, 54.458, 54.344, 54.488, 55.076, 54.976, 54.604, 54.648, 54.086, 55.292, 54.86, 55.438, 54.024, 53.808, 55.692, 54.632, 54.82, 55.36, 54.146, 54.422, 55.424, 55.61, 55.554, 54.608, 54.268, 53.828, 54.824, 54.562, 55.154, 55.38, 54.98, 54.828, 54.88, 54.568, 54.538, 55.088, 55.132, 54.468, 55.086, 54.172, 54.202, 55.656, 55.978, 54.66, 55.356, 55.07, 54.842, 54.692, 55.796, 55.218, 54.492, 54.994, 55.438, 54.582, 55.242, 53.454, 53.902, 55.416, 55.244, 55.432, 54.402, 54.628, 55.206, 55.306, 54.948, 55.014, 55.262, 54.73, 55.238, 54.168, 54.946, 55.438, 54.85, 54.86, 55.988, 55.426, 54.654, 54.354, 54.562, 56.326, 54.474, 54.85, 54.718, 55.396, 54.67, 54.244, 54.924, 55.238, 54.748, 54.106, 54.796, 54.756, 54.51, 54.502, 55.02, 54.98, 54.24, 55.21, 55.534, 55.16, 54.952, 54.982, 54.646, 54.614, 55.33, 54.722, 54.464, 54.69, 55.886, 54.772, 55.006, 54.876, 54.536, 54.994, 55.278, 54.782, 54.886, 55.074, 54.662, 55.66, 55.3, 54.08, 54.134, 54.766, 55.344, 55.03, 54.814, 54.924, 54.746, 54.698, 55.544, 55.022, 55.156, 53.608, 54.44, 55.45, 54.504, 54.912, 54.21, 54.306, 55.276, 55.01, 54.422, 55.898, 53.79, 54.942, 54.388, 55.132, 54.578, 54.196, 55.324, 54.742, 55.092, 54.912, 54.496, 55.65, 54.968, 54.174, 54.656, 55.126, 55.246, 54.98, 55.116, 55.13, 54.482, 54.17, 55.382, 54.064, 54.288, 54.216, 55.286, 54.54, 55.0, 55.404, 54.98, 54.82, 54.776, 54.782, 54.952, 54.958, 54.324, 55.292, 55.434, 54.834, 55.156, 55.434, 54.072, 54.572, 54.616, 55.382, 54.424, 54.92, 55.12, 54.056, 54.81, 55.146, 55.042, 55.132, 55.128, 54.634, 54.59, 55.404, 54.606, 55.424, 54.926, 55.002, 54.238, 54.644, 55.156, 53.612, 54.766, 55.236, 55.096, 55.326, 55.056, 55.044, 55.142, 54.952, 54.818, 54.28, 55.256, 55.168, 54.484, 55.52, 54.278, 54.51, 54.888, 54.918, 55.41, 55.016, 54.972, 55.378, 55.638, 54.396, 54.862, 54.752, 54.192, 54.648, 55.662, 54.396, 54.46, 54.976, 54.84, 55.162, 54.402, 54.756, 54.474, 55.034, 54.662, 55.064, 55.918, 55.346, 55.012, 54.608, 55.35, 54.924, 55.1, 54.86, 54.784, 54.562, 54.592, 54.784, 55.288, 53.952, 54.146, 54.468, 55.486, 55.296, 54.318, 55.126, 55.144, 54.394, 55.784, 54.188, 53.952, 54.838, 53.638, 55.2, 54.706, 54.79, 54.338, 54.954, 54.62, 55.048, 55.35, 54.914, 54.824, 54.092, 55.182, 55.228, 55.074, 55.918, 55.252, 54.906, 55.704, 54.782, 54.914, 55.336, 54.514, 53.88, 54.9, 53.83, 55.116, 55.03, 55.66, 55.028, 54.708, 55.152, 54.372, 55.21, 54.868, 54.758, 54.344, 54.8, 54.718, 54.694, 54.968, 55.322, 54.186, 54.918, 55.112, 55.994, 54.792, 55.308, 54.48, 54.648, 54.452, 54.524, 55.644, 54.724, 54.396, 54.51, 54.736, 54.544, 54.794, 54.676, 55.404, 55.438, 55.346, 54.59, 54.984, 54.822, 54.75, 54.66, 55.032, 55.54, 53.726, 53.854, 54.282, 54.832, 53.938, 55.432, 55.098, 54.724, 55.032, 54.448, 54.844, 54.56, 55.372, 55.392, 54.632, 54.882, 55.196, 54.24, 55.336, 54.836, 55.184, 55.668, 54.836, 54.832, 55.028, 54.932, 54.646, 54.946, 54.344, 55.128, 55.23, 55.884, 55.738, 54.942, 54.778, 54.558, 55.538, 54.564, 53.768, 54.922, 54.06, 54.888, 55.176, 54.86, 55.702, 55.606, 55.348, 54.784, 54.52, 54.986, 54.776, 54.022, 54.834, 54.924, 54.69, 54.89, 54.108, 55.222, 54.07, 54.92, 54.464, 54.458, 54.856, 55.866, 54.874, 55.324, 53.948, 54.372, 55.606, 55.414, 55.05, 54.562, 54.14, 55.688, 54.286, 54.786, 54.722, 54.344, 55.084, 54.87, 54.632, 55.364, 54.566, 55.312, 54.33, 54.94, 54.056, 54.762, 54.502, 55.63, 54.072, 54.68, 54.802, 54.642, 54.6, 54.924, 54.402, 55.606, 54.364, 54.712, 55.432, 54.674, 55.184, 54.67, 54.758, 54.618, 55.056, 54.456, 55.208, 55.208, 55.55, 54.156, 54.906, 54.774, 54.776, 55.236, 55.062, 56.424, 54.564, 54.35, 55.054, 54.814, 53.724, 55.106, 54.41, 55.282, 54.342, 54.836, 54.882, 55.708, 54.368, 54.434, 54.61, 55.824, 53.946, 55.182, 53.9, 55.4, 54.734, 54.832, 55.426, 54.208, 54.834, 54.46, 53.97, 55.174, 55.752, 55.202, 54.924, 55.108, 53.77, 54.722, 54.952, 55.216, 55.038, 54.836, 55.59, 55.7, 53.756, 54.936, 55.484, 55.324, 55.88, 54.838, 55.16, 54.94, 55.192, 54.666, 54.5, 55.43, 53.844, 54.436, 55.068, 55.382, 54.84, 54.492, 54.286, 55.328, 55.242, 54.622, 55.708, 55.672, 54.014, 53.988, 54.834, 54.98, 54.548, 55.078, 54.344, 55.744, 54.674, 54.524, 54.934, 54.744, 56.026, 55.578, 54.754, 54.59, 54.376, 55.042, 54.604, 55.078, 55.382, 55.144, 54.774, 55.142, 54.366, 53.936, 55.166, 54.85, 54.1, 54.144, 54.872, 54.47, 54.0, 53.616, 55.04, 54.422, 54.586, 54.86, 54.614, 55.362, 54.874, 54.37, 54.428, 54.912, 55.132, 54.544, 55.152, 55.128, 54.25, 54.574, 54.346, 55.484, 55.022, 55.098, 54.652, 54.488, 55.638, 55.296, 55.306, 54.998, 54.422, 54.608, 54.638, 55.37, 55.298, 54.856, 55.194, 54.88, 54.71, 54.908, 55.02, 55.338, 55.364, 55.176, 54.97, 54.53, 54.702, 55.116, 55.024, 54.856, 55.376, 54.698, 54.406, 55.614, 54.476, 55.594, 54.526, 54.864, 55.348, 54.83, 54.996, 54.422, 54.652, 55.356, 55.388, 54.378, 54.466, 55.132, 54.75, 55.008, 54.304, 54.734, 54.978, 54.41, 54.974, 54.816, 55.314, 54.854, 54.764, 54.228, 55.9, 55.762, 55.494, 53.688, 54.836, 54.646, 54.72, 56.05, 54.518, 53.998, 55.258, 54.324, 54.77, 54.978, 55.178, 54.898, 54.37, 54.79, 54.78, 55.542, 54.094, 54.992, 54.814, 54.946, 54.776, 56.12, 54.716, 54.87, 55.182, 54.112, 55.08, 54.716, 54.564, 55.018, 54.868, 55.532, 53.982, 54.766, 54.634, 55.874, 55.322, 54.26, 54.722, 54.664, 55.454, 54.576, 55.128, 55.408, 54.538, 54.674, 55.302, 54.16, 55.34, 54.81, 54.956, 55.372, 55.518, 54.824, 54.61, 55.06, 55.096, 54.452, 54.73, 54.836, 55.17, 54.766, 54.284, 55.314, 54.442, 55.66, 54.682, 54.868, 54.778, 55.394, 55.088, 55.554, 54.676, 54.594, 54.846, 54.318, 53.62, 55.01, 54.546, 54.614, 55.584, 54.804, 55.562, 55.682, 54.082, 54.878, 54.968, 55.322, 55.136, 55.202, 53.974, 54.744, 54.2, 54.716, 54.49, 54.734, 54.508, 54.832, 53.782, 55.658, 54.698, 55.244, 54.962, 54.86, 55.436, 54.774, 54.762, 54.88, 54.704, 55.022, 54.53, 54.698, 55.922, 54.642, 55.096, 55.086, 54.442, 54.516, 55.54, 54.146, 55.066, 55.064, 55.622, 54.482, 54.694, 55.046, 54.382, 53.736, 54.162, 54.422, 54.76, 55.73, 55.494, 54.86, 55.616, 54.736, 55.432, 54.97, 54.676, 54.47, 56.108, 55.082, 55.016, 54.304, 55.16, 55.476, 55.036, 56.0, 55.47, 55.056, 55.326, 54.142, 54.596, 55.36, 54.768, 55.49, 54.986, 55.11, 54.334, 54.53, 56.0, 55.248, 55.186, 53.978, 53.538, 54.974, 54.988, 55.06, 54.984, 54.744, 54.258, 54.758, 54.996, 54.302, 55.094, 55.41, 54.878, 54.242, 55.744, 55.274, 55.316, 55.018, 54.772, 54.514, 54.922, 55.316, 55.122, 54.17, 55.034, 53.988, 55.298, 56.298, 54.566, 55.112, 55.318, 55.82, 56.036, 55.186, 55.262, 54.462, 54.356, 54.872, 54.972, 55.556, 55.31, 54.24, 54.522, 54.97, 54.884, 55.278, 55.04, 55.09, 54.862, 55.168, 54.906, 54.62, 54.908, 55.508, 54.452, 55.266, 55.004, 54.918, 54.536, 54.936, 55.064, 53.852, 54.626, 55.08, 54.884, 55.662, 54.246, 55.566, 54.454, 54.772, 54.61, 55.264, 54.288, 54.846, 54.464, 54.914, 54.454, 54.91, 55.602, 55.15, 54.45, 54.072, 54.446, 55.012, 55.628, 54.552, 53.92, 55.296, 55.446, 54.432, 54.942, 54.882, 54.894, 55.452, 55.238, 54.758, 54.346, 54.49, 54.6, 54.336, 54.424, 54.656, 54.508, 54.6, 55.866, 54.534, 54.432, 54.4, 55.286, 55.97, 55.086, 54.548, 54.074, 54.462, 54.716, 54.726, 54.84, 54.984, 54.91, 56.056, 55.074, 54.172, 54.054, 54.286, 54.262, 54.63, 54.598, 55.522, 55.542, 55.162, 54.772, 54.566, 54.432, 54.886, 55.318, 53.066, 54.704, 54.328, 53.93, 53.946, 54.57, 54.346, 54.148, 54.47, 54.32, 55.24, 54.654, 54.33, 55.528, 54.932, 54.88, 54.124, 54.924, 54.97, 54.85, 54.836, 55.338, 55.496, 53.858, 54.446, 53.916, 54.55, 54.82, 54.234, 55.35, 54.544, 54.348, 54.118, 54.504, 55.072, 53.88, 55.162, 54.778, 55.364, 54.61, 55.306, 54.682, 54.608, 54.198, 54.624, 54.764, 54.492, 54.806, 55.118, 54.638, 54.992, 54.908, 55.31, 55.122, 53.71, 55.41, 55.378, 54.924, 54.934, 55.078, 55.324, 54.858, 54.942, 54.16, 54.354, 54.978, 54.878, 55.566, 54.77, 55.212, 54.932, 54.938, 54.374, 54.942, 55.154, 55.384, 55.204, 55.06, 55.586, 54.962, 54.052, 54.61, 54.16, 55.05, 54.572, 54.086, 55.77, 54.964, 54.162, 55.41, 54.646, 54.328, 54.522, 55.654, 55.33, 54.902, 54.744, 54.188, 55.3, 54.796, 55.112, 54.664, 54.6, 54.478, 54.58, 54.914, 54.468, 54.506, 53.732, 54.838, 54.308, 55.244, 55.442, 55.366, 54.614, 55.058, 55.55, 54.226, 54.468, 55.538, 54.652, 53.736, 54.808, 54.318, 54.458, 55.67, 55.176, 55.38, 54.016, 54.724, 54.506, 54.606, 54.584, 55.868, 54.63, 54.09, 54.85, 54.656, 54.136, 55.328, 54.642, 55.704, 54.774, 54.832, 55.232, 54.476, 54.966, 55.244, 55.318, 54.27, 54.778, 54.818, 54.25, 55.242, 54.918, 54.51, 54.712, 55.26, 54.746, 55.182, 54.252, 54.728, 54.874, 54.962, 54.644, 55.34, 54.43, 54.02, 54.426, 55.42, 55.566, 55.506, 54.81, 55.412, 55.298, 54.708, 55.67, 53.988, 55.146, 53.856, 54.192, 55.18, 54.534, 55.038, 54.228, 54.176, 53.978, 54.644, 56.016, 54.236, 55.246, 54.044, 54.966, 54.102, 55.084, 54.828, 54.198, 54.31, 54.076, 54.678, 55.22, 55.326, 54.986, 54.706, 55.054, 54.906, 55.622, 54.012, 54.888, 54.658, 55.154, 55.296, 54.942, 54.262, 55.312, 54.776, 54.39, 53.956, 55.676, 54.884, 54.858, 55.492, 54.494, 55.006, 55.256, 55.868, 55.232, 55.024, 54.986, 55.032, 55.15, 54.846, 54.882, 55.178, 54.274, 54.946, 54.898, 54.914, 55.426, 54.57, 54.462, 54.798, 55.122, 54.806, 55.368, 54.552, 54.578, 55.168, 55.082, 54.698, 54.932, 55.16, 55.038, 55.252, 54.742, 54.878, 54.806, 54.838, 54.88, 54.99, 55.242, 54.284, 54.588, 55.998, 54.584, 55.274, 55.384, 54.998, 53.696, 55.012, 54.19, 55.282, 54.964, 54.368, 55.096, 53.986, 55.4, 55.2, 54.366, 55.19, 55.032, 53.588, 54.75, 55.36, 54.966, 54.602, 55.894, 54.796, 54.782, 54.692, 54.866, 55.222, 54.086, 53.882, 54.286, 54.778, 55.752, 55.056, 55.354, 55.346, 54.812, 54.676, 53.572, 53.844, 54.444, 54.088, 54.418, 54.32, 54.054, 54.656, 55.11, 55.718, 54.644, 54.52, 54.66, 54.714, 54.718, 54.456, 54.664, 55.236, 54.352, 54.828, 54.972, 55.4, 54.66, 54.606, 55.332, 54.464, 54.624, 55.934, 53.606, 54.754, 54.326, 55.554, 54.994, 55.06, 54.674, 55.082, 54.752, 55.452, 54.262, 55.186, 54.08, 55.546, 54.402, 54.454, 53.786, 54.438, 55.706, 55.6, 54.08, 54.27, 54.4, 54.002, 54.854, 54.698, 54.74, 54.704, 54.556, 54.792, 55.436, 54.742, 55.498, 54.612, 55.728, 54.484, 55.236, 55.336, 54.186, 55.186, 55.19, 55.83, 54.718, 55.246, 55.11, 55.044, 54.518, 53.978, 55.064, 54.754, 55.274, 54.726, 55.19, 54.566, 54.56, 54.87, 54.244, 55.038, 54.112, 54.932, 55.164, 55.166, 54.0, 54.526, 55.622, 54.692, 54.242, 54.92, 54.388, 54.96, 54.416, 54.85, 54.934, 55.696, 54.806, 55.22, 54.624, 55.482, 55.88, 54.312, 54.64, 55.062, 54.322, 55.59, 54.804, 54.776, 54.68, 54.682, 54.536, 55.822, 54.642, 56.1, 54.628, 55.288, 54.68, 54.648, 55.408, 55.218, 54.794, 54.444, 54.418, 55.062, 55.562, 55.616, 54.526, 55.226, 54.796, 55.222, 54.858, 54.87, 55.092, 55.534, 55.018, 54.686, 54.26, 54.398, 55.462, 55.51, 55.08, 54.02, 55.076, 54.608, 55.71, 55.136, 54.734, 55.036, 54.884, 55.736, 54.32, 54.674, 54.714, 54.52, 54.496, 55.258, 54.486]
[55.08, 54.902, 55.524, 55.756, 55.474, 55.592, 55.024, 55.196, 55.318, 55.234, 56.428, 54.954, 55.414, 55.602, 55.034, 54.146, 55.664, 54.69, 54.88, 55.318, 55.47, 55.018, 55.068, 55.238, 55.076, 55.108, 55.944, 54.57, 54.91, 55.288, 55.316, 55.864, 55.226, 55.148, 54.164, 55.566, 55.214, 54.498, 54.646, 55.126, 55.984, 55.278, 55.802, 54.438, 55.47, 56.028, 55.114, 55.188, 55.48, 55.712, 55.506, 54.66, 55.08, 55.224, 55.628, 55.646, 55.118, 54.912, 53.924, 55.516, 55.162, 53.852, 54.906, 55.23, 54.754, 54.828, 54.512, 55.258, 55.218, 54.944, 55.65, 55.532, 54.904, 55.06, 55.408, 55.566, 54.578, 55.37, 55.52, 54.236, 54.802, 55.686, 55.354, 54.682, 55.648, 55.318, 55.204, 55.32, 55.464, 55.02, 55.556, 54.382, 55.53, 55.454, 54.734, 55.34, 55.3, 54.818, 54.812, 55.562, 54.87, 53.942, 54.784, 55.32, 54.92, 55.706, 55.092, 55.148, 54.956, 56.062, 55.666, 55.18, 55.182, 55.216, 55.418, 55.798, 54.308, 55.918, 55.402, 54.81, 54.534, 54.43, 55.254, 55.172, 54.598, 56.102, 55.546, 54.91, 55.164, 55.77, 54.632, 55.288, 55.47, 56.17, 55.57, 54.348, 55.02, 55.112, 55.626, 55.504, 54.884, 55.014, 54.01, 55.316, 54.5, 55.086, 55.084, 55.28, 55.116, 56.026, 54.872, 54.534, 55.278, 55.356, 53.76, 54.76, 55.99, 55.138, 54.886, 54.872, 54.898, 55.15, 54.232, 55.014, 54.65, 55.698, 54.99, 54.844, 55.408, 54.958, 55.276, 55.174, 55.194, 54.254, 54.964, 55.358, 55.156, 55.08, 55.504, 55.034, 54.474, 55.308, 55.932, 54.968, 55.534, 55.646, 55.178, 54.72, 54.57, 55.762, 54.67, 55.064, 54.828, 55.762, 54.62, 56.442, 55.398, 56.71, 55.164, 55.284, 55.218, 54.546, 54.788, 54.602, 54.55, 55.194, 55.014, 54.106, 54.26, 55.252, 55.242, 54.842, 54.874, 55.718, 55.292, 54.932, 55.612, 55.61, 55.3, 55.29, 55.538, 54.946, 56.002, 54.86, 54.446, 55.074, 55.01, 55.534, 55.04, 54.69, 55.316, 54.73, 54.75, 54.692, 54.956, 56.456, 56.172, 54.978, 56.084, 54.418, 55.294, 54.934, 55.014, 55.438, 54.944, 54.652, 53.912, 54.822, 55.122, 55.426, 54.692, 55.062, 54.894, 55.51, 55.598, 55.042, 54.442, 55.0, 56.17, 54.806, 55.914, 55.574, 56.158, 54.936, 54.008, 55.004, 55.496, 55.636, 54.878, 55.598, 56.452, 55.414, 55.12, 55.47, 54.536, 55.042, 55.902, 55.502, 56.074, 54.6, 55.124, 54.986, 55.852, 55.542, 54.932, 55.172, 55.47, 55.314, 54.45, 55.072, 54.452, 55.21, 55.058, 56.066, 55.018, 55.282, 54.988, 55.514, 55.218, 55.756, 54.716, 55.052, 55.316, 54.832, 55.284, 56.008, 55.358, 55.32, 55.538, 54.99, 54.64, 55.426, 55.284, 55.044, 55.552, 55.51, 54.616, 56.252, 54.634, 54.646, 54.856, 54.948, 54.928, 55.304, 55.506, 55.36, 54.292, 55.42, 54.948, 54.904, 55.866, 54.558, 54.17, 55.492, 56.148, 55.09, 55.416, 55.296, 56.27, 55.758, 55.554, 55.93, 54.976, 54.638, 55.692, 54.888, 54.936, 53.652, 54.232, 55.052, 55.64, 54.02, 55.756, 55.142, 55.562, 54.858, 55.3, 54.914, 54.992, 54.85, 54.294, 54.698, 54.584, 54.896, 54.204, 54.376, 55.272, 55.024, 55.412, 55.534, 54.51, 55.108, 55.32, 55.572, 55.334, 55.398, 55.77, 55.044, 55.42, 55.066, 55.902, 56.116, 55.706, 55.54, 54.64, 55.47, 54.942, 56.152, 55.578, 54.274, 55.144, 54.756, 54.856, 55.174, 54.948, 55.212, 54.894, 55.782, 55.06, 54.538, 54.828, 55.36, 54.602, 55.4, 54.688, 55.05, 54.494, 55.396, 55.122, 55.22, 55.19, 54.564, 54.976, 55.48, 55.696, 55.128, 55.334, 54.87, 54.58, 55.852, 55.72, 55.382, 55.224, 55.634, 54.858, 54.938, 56.512, 55.494, 54.748, 55.806, 54.166, 54.806, 56.134, 54.504, 55.112, 54.292, 56.79, 54.554, 55.156, 55.63, 56.18, 55.304, 55.74, 55.112, 55.102, 55.55, 55.24, 54.86, 54.982, 54.954, 55.248, 55.38, 55.13, 54.63, 54.954, 54.616, 55.224, 54.966, 54.328, 54.276, 55.33, 55.086, 55.176, 55.588, 54.944, 54.504, 54.448, 55.302, 54.646, 54.694, 55.728, 55.844, 54.972, 54.866, 54.988, 54.482, 55.434, 55.18, 55.888, 53.784, 54.954, 54.914, 55.762, 56.084, 53.952, 55.148, 54.262, 54.318, 55.09, 56.738, 55.654, 53.942, 54.85, 54.892, 54.424, 56.076, 54.982, 54.586, 54.944, 54.798, 54.576, 54.538, 54.726, 55.402, 55.016, 55.4, 55.116, 53.666, 55.366, 53.922, 55.664, 55.2, 55.852, 55.218, 54.98, 54.914, 55.5, 54.948, 54.892, 55.194, 55.906, 55.306, 54.532, 54.904, 55.204, 55.572, 55.482, 54.524, 55.34, 54.21, 55.852, 54.438, 55.04, 54.298, 55.338, 54.834, 55.416, 55.234, 54.75, 54.234, 54.932, 55.972, 55.438, 55.72, 55.13, 54.48, 54.732, 56.164, 55.004, 54.79, 55.414, 54.95, 54.314, 55.446, 54.722, 55.55, 55.57, 55.228, 54.724, 54.73, 55.52, 54.838, 55.78, 55.662, 55.552, 55.142, 55.714, 55.28, 55.224, 54.452, 55.196, 54.588, 55.48, 54.274, 54.988, 56.314, 54.564, 55.17, 55.338, 56.124, 55.358, 55.43, 55.092, 55.552, 55.5, 55.58, 54.72, 55.27, 55.442, 55.026, 54.664, 54.87, 54.72, 55.042, 54.722, 54.292, 55.204, 55.87, 55.074, 56.224, 55.228, 55.242, 55.622, 54.908, 54.778, 55.616, 54.408, 55.272, 54.926, 54.774, 55.076, 56.288, 55.08, 55.986, 54.798, 55.162, 54.862, 55.186, 55.346, 54.616, 55.484, 54.486, 55.054, 55.014, 54.64, 56.146, 54.91, 54.826, 55.384, 55.278, 55.628, 54.624, 53.33, 55.108, 55.244, 55.248, 56.03, 55.414, 54.528, 55.342, 55.332, 55.236, 55.43, 54.688, 55.92, 54.9, 55.112, 55.076, 54.418, 54.672, 55.066, 54.928, 55.61, 55.27, 54.836, 54.884, 55.474, 55.124, 55.836, 55.616, 55.236, 55.914, 54.93, 55.196, 55.204, 56.13, 54.854, 54.786, 55.298, 55.142, 55.018, 55.338, 55.472, 55.488, 55.364, 55.576, 55.608, 54.752, 54.316, 55.58, 55.632, 55.912, 55.684, 55.104, 54.97, 56.058, 55.268, 54.888, 54.576, 55.57, 55.494, 54.888, 55.33, 55.652, 54.394, 54.832, 54.75, 54.844, 55.23, 54.624, 54.29, 55.144, 55.418, 55.122, 56.11, 54.872, 55.284, 54.77, 54.156, 54.846, 55.312, 54.58, 56.416, 55.318, 54.284, 55.518, 54.332, 55.742, 54.396, 55.194, 55.178, 55.034, 54.872, 55.87, 55.498, 54.718, 54.9, 55.14, 55.318, 54.908, 56.258, 55.758, 56.012, 55.596, 54.458, 54.48, 54.704, 54.69, 55.688, 55.776, 55.664, 54.74, 55.69, 55.56, 55.112, 55.714, 54.88, 55.266, 54.652, 55.572, 54.654, 54.892, 55.284, 54.956, 54.878, 54.84, 54.632, 54.658, 55.44, 54.648, 55.586, 54.846, 55.334, 56.046, 54.318, 55.638, 55.408, 54.698, 56.38, 55.47, 54.268, 55.806, 55.654, 55.424, 54.232, 54.558, 55.462, 55.402, 55.676, 54.778, 54.078, 54.454, 56.012, 56.262, 55.288, 55.044, 55.618, 55.268, 54.52, 55.622, 54.65, 54.996, 55.64, 55.804, 55.362, 55.182, 55.198, 54.766, 55.45, 55.394, 55.768, 54.946, 54.272, 55.25, 53.978, 55.444, 55.64, 54.57, 55.166, 56.368, 55.226, 54.904, 55.524, 54.82, 54.226, 54.79, 55.46, 54.702, 55.838, 55.526, 54.986, 54.638, 56.392, 54.73, 55.996, 54.618, 55.228, 54.434, 55.642, 55.614, 55.08, 55.042, 56.212, 54.668, 54.878, 55.21, 55.172, 54.852, 56.39, 54.898, 55.254, 55.314, 55.834, 55.814, 54.992, 54.444, 55.928, 55.208, 54.902, 55.668, 55.088, 55.198, 55.35, 55.83, 55.224, 55.552, 55.788, 55.09, 55.318, 54.96, 55.46, 55.37, 55.836, 55.02, 54.92, 55.212, 55.194, 54.762, 55.634, 54.97, 54.808, 55.552, 54.814, 54.986, 54.804, 55.244, 55.558, 55.56, 55.018, 55.142, 55.096, 54.16, 56.24, 54.794, 55.204, 55.362, 55.118, 54.782, 55.372, 56.22, 55.514, 55.43, 54.712, 55.664, 55.666, 54.984, 55.006, 55.086, 54.94, 54.798, 55.102, 55.114, 54.598, 55.376, 55.19, 54.546, 56.446, 54.434, 54.13, 54.424, 55.312, 53.626, 55.066, 54.968, 55.448, 55.588, 55.01, 54.666, 56.134, 54.674, 55.058, 55.562, 55.292, 55.502, 54.738, 55.328, 54.172, 54.908, 54.806, 54.99, 54.582, 55.274, 54.93, 54.54, 55.572, 54.972, 55.452, 55.294, 55.256, 54.186, 55.29, 55.028, 54.768, 55.264, 55.252, 55.696, 54.574, 55.13, 55.258, 54.456, 54.502, 55.098, 55.768, 54.878, 54.878, 55.228, 54.514, 55.876, 55.312, 55.49, 55.16, 55.49, 55.146, 55.848, 55.346, 55.316, 56.206, 55.756, 55.348, 54.994, 54.792, 54.996, 54.492, 54.334, 55.024, 54.996, 55.424, 54.774, 55.544, 55.734, 55.94, 55.244, 55.838, 56.252, 54.538, 55.334, 54.678, 55.126, 55.926, 55.618, 54.508, 54.906, 55.992, 54.95, 54.502, 55.064, 55.134, 55.258, 55.824, 55.03, 54.584, 54.95, 55.472, 54.428, 55.572, 55.362, 54.528, 55.026, 55.118, 55.354, 54.458, 55.592, 55.604, 54.672, 55.714, 53.94, 54.944, 55.02, 54.008, 54.658, 55.294, 55.31, 55.32, 55.54, 55.486, 54.856, 55.072, 54.948, 54.982, 55.512, 55.044, 54.406, 54.504, 54.982, 55.762, 54.57, 55.138, 54.622, 55.274, 55.544, 55.178, 55.786, 55.066, 55.04, 54.946, 55.434, 54.316, 55.568, 55.554, 54.692, 55.576, 54.792, 55.144, 54.866, 54.464, 54.684, 54.832, 54.326, 54.864, 55.692, 55.474, 56.382, 55.422, 54.398, 54.99, 55.462, 54.418, 54.8, 55.202, 54.862, 56.016, 54.666, 54.602, 55.442, 54.802, 54.718, 54.776, 55.58, 55.626, 55.042, 54.888, 56.066, 55.372, 56.604, 55.684, 55.716, 55.576, 55.49, 53.32, 56.288, 55.63, 54.232, 55.388, 55.69, 55.466, 55.708, 54.772, 55.454, 55.252, 55.938, 54.532, 55.128, 55.224, 56.106, 55.778, 54.4, 55.314, 55.636, 55.078, 54.828, 55.958, 55.8, 55.338, 54.818, 55.2, 56.394, 54.706, 55.174, 54.828, 55.438, 54.906, 54.444, 55.704, 55.488, 54.96, 54.436, 54.434, 54.884, 54.756, 54.372, 54.876, 55.628, 55.592, 54.954, 54.902, 55.186, 55.044, 54.498, 55.426, 55.198, 55.514, 55.13, 54.498, 55.13, 55.054, 55.27, 54.6, 55.902, 55.364, 55.332, 55.042, 55.52, 55.07, 55.644, 54.358, 54.99, 56.252, 54.742, 55.492, 55.644, 55.326, 55.316, 55.0, 55.132, 56.188, 54.95, 55.918, 54.692, 54.86, 55.268, 55.404, 54.636, 54.898, 54.884, 54.772, 54.762, 54.804, 55.434, 55.356, 54.362, 54.642, 54.82, 54.49, 54.914, 55.036, 54.784, 54.576, 54.782, 54.838, 55.012, 54.79, 55.008, 55.97, 54.992, 54.996, 55.52, 55.25, 54.984, 55.254, 55.018, 55.524, 55.184, 55.862, 54.514, 54.94, 55.758, 55.838, 55.386, 55.47, 55.166, 56.364, 55.746, 55.068, 54.616, 55.596, 54.834, 54.764, 54.45, 54.792, 55.33, 55.034, 54.888, 54.498, 54.628, 55.56, 55.32, 54.454, 55.004, 55.17, 55.34, 54.934, 55.332, 54.568, 55.162, 55.078, 55.188, 55.372, 55.382, 55.272, 55.69, 55.508, 55.342, 55.306, 55.226, 54.76, 55.042, 54.018, 55.746, 55.034, 55.548, 55.916, 54.746, 55.338, 54.488, 55.498, 54.766, 55.138, 55.934, 55.466, 55.296, 55.522, 55.246, 55.178, 53.932, 55.05, 54.712, 55.092, 54.472, 54.816, 54.878, 55.482, 54.966, 54.66, 54.51, 55.236, 54.9, 55.738, 55.404, 55.22, 55.36, 55.19, 55.198, 55.768, 54.656, 54.214, 55.878, 54.97, 54.776, 55.116, 55.906, 55.508, 54.364, 53.93, 55.296, 54.978, 55.156, 55.064, 55.016, 54.576, 54.616, 55.306, 54.686, 54.858, 55.608, 55.07, 54.358, 55.078, 55.298, 54.726, 56.02, 54.684, 54.944, 55.386, 54.804, 55.544, 55.064, 55.5, 55.02, 55.964, 54.758, 54.85, 54.668, 55.61, 55.264, 54.996, 55.584, 54.812, 54.86, 55.516, 54.378, 56.168, 54.44, 54.572, 55.23, 55.65, 55.112, 55.506, 54.674, 55.716, 54.924, 55.386, 55.174, 54.87, 55.676, 55.75, 55.386, 55.102, 55.708, 54.528, 55.87, 55.79, 54.346, 54.852, 54.812, 54.844, 55.058, 55.114, 54.602, 55.832, 54.712, 54.842, 54.342, 55.052, 55.294, 55.822, 55.256, 55.492, 54.942, 55.232, 55.044, 55.938, 55.386, 54.934, 56.472, 55.41, 54.464, 55.744, 54.6, 56.17, 55.454, 55.232, 55.264, 54.66, 55.366, 55.394, 55.066, 55.52, 55.09, 55.342, 54.48, 54.546, 54.918, 55.69, 54.426, 55.144, 54.94, 55.806, 54.768, 55.444, 54.774, 55.186, 55.718, 55.584, 55.128, 55.334, 55.588, 54.008, 55.258, 55.098, 55.532, 54.848, 55.552, 54.18, 55.478, 55.9, 55.266, 55.218, 55.286, 55.086, 54.636, 54.84, 55.334, 55.54, 55.228, 55.25, 55.302, 55.698, 55.602, 55.268, 54.872, 55.022, 54.354, 55.138, 55.222, 54.272, 55.174, 54.868, 55.286, 55.216, 54.678, 56.008, 55.28, 54.84, 55.484, 54.924, 54.654, 54.47, 56.72, 55.532, 55.688, 55.982, 55.318, 54.912, 55.786, 54.986, 54.798, 55.65, 55.304, 54.874, 55.69, 55.38, 55.354, 55.93, 54.722, 53.888, 55.192, 55.584, 55.07, 54.212, 55.154, 55.778, 55.008, 54.806, 55.962, 54.948, 55.224, 55.268, 54.748, 54.458, 55.472, 55.694, 54.528, 55.038, 55.448, 55.122, 55.168, 54.956, 53.722, 55.988, 55.668, 54.518, 55.518, 54.942, 55.088, 55.68, 56.254, 54.84, 55.154, 56.152, 55.62, 53.928, 54.99, 55.552, 55.6, 54.626, 54.846, 55.252, 55.654, 55.202, 55.494, 55.87, 54.876, 54.988, 54.916, 55.438, 55.892, 54.894, 55.094, 55.018, 54.77, 55.14, 54.358, 55.238, 54.9, 55.558, 55.526, 55.188, 55.292, 54.954, 55.818, 55.864, 55.176, 54.55, 54.76, 55.69, 54.956, 55.418, 55.166, 54.964, 54.922, 54.866, 54.682, 55.416, 55.034, 55.232, 54.67, 54.526, 55.486, 55.422, 55.304, 55.412, 54.15, 55.536, 54.936, 55.47, 54.738, 54.942, 54.586, 54.872, 55.08, 55.036, 55.236, 54.836, 54.562, 55.112, 54.944, 55.672, 54.546, 54.68, 55.662, 54.882, 54.676, 55.0, 55.984, 54.876, 54.92, 55.45, 55.158, 55.018, 55.284, 54.592, 54.632, 55.288, 55.708, 54.252, 54.96, 55.83, 55.434, 55.35, 55.586, 55.866, 54.598, 54.752, 54.892, 55.26, 55.724, 55.714, 55.354, 54.838, 55.51, 55.184, 55.42, 55.636, 54.378, 55.828, 55.63, 55.668, 55.274, 55.53, 55.294, 55.658, 54.624, 54.268, 55.646, 55.22, 55.052, 56.1, 55.224, 54.944, 54.688, 54.832, 55.314, 55.254, 54.874, 54.758, 55.28, 54.138, 54.604, 54.674, 54.702, 56.25, 55.294, 54.884, 55.534, 55.138, 55.15, 53.902, 54.782, 54.314, 55.474, 55.14, 55.096, 54.718, 54.664, 54.844, 54.44, 54.418, 55.906, 56.014, 55.946, 54.818, 55.43, 54.642, 55.136, 54.144, 55.07, 55.464, 54.934, 54.876, 55.308, 55.732, 55.008, 55.564, 55.884, 54.304, 54.876, 54.7, 55.186, 55.298, 55.02, 55.672, 55.03, 54.554, 54.282, 55.764, 54.886, 54.998, 55.492, 55.45, 55.3, 54.53, 54.004, 54.554, 56.004, 54.96, 55.696, 55.504, 54.998, 55.102, 54.74, 54.886, 55.388, 55.494, 54.868, 54.888, 55.036, 55.708, 55.1, 54.642, 53.534, 54.736, 55.414, 54.848, 55.336, 55.032, 54.094, 55.706, 54.41, 55.032, 55.888, 55.2, 54.812, 55.372, 55.524, 54.868, 54.702, 54.914, 55.202, 54.802, 54.79, 55.31, 55.312, 55.286, 55.036, 55.292, 54.514, 54.33, 55.478, 55.772, 55.438, 55.232, 55.742, 55.394, 54.55, 55.602, 55.02, 53.87, 54.588, 54.84, 55.602, 54.816, 54.936, 55.882, 55.72, 55.494, 55.054, 55.476, 56.304, 55.986, 55.29, 54.984, 55.612, 54.714, 54.682, 55.534, 55.548, 54.394, 54.762, 54.384, 54.814, 54.884, 54.808, 55.41, 55.08, 54.782, 55.436, 55.086, 54.836, 55.62, 55.268, 54.6, 55.294, 55.892, 53.966, 54.732, 53.514, 54.012, 55.308, 55.61, 55.198, 55.47, 54.438, 55.53, 55.094, 54.348, 54.81, 55.142, 54.93, 55.564, 54.74, 54.02, 55.334, 55.394, 54.792, 55.754, 56.41, 54.902, 55.2, 56.056, 54.546, 55.496, 53.712, 55.322, 54.798, 55.864, 55.214, 55.214, 55.552, 54.974, 55.004, 54.886, 54.964, 55.372, 56.002, 55.592, 55.184, 55.032, 55.6, 54.954, 55.238, 54.884, 56.168, 55.714, 54.942, 55.578, 56.11, 55.294, 54.156, 55.168, 54.684, 55.012, 55.244, 55.016, 54.602, 54.552, 55.504, 53.794, 54.266, 55.998, 54.972, 55.128, 54.658, 55.528, 55.058, 55.518, 55.416, 54.986, 55.096, 55.114, 56.522, 54.482, 54.8, 55.39, 55.272, 54.822, 54.604, 54.658, 54.932, 56.008, 55.096, 55.454, 55.598, 54.782, 55.006, 54.422, 55.484, 54.78, 55.408, 54.804, 55.658, 55.026, 54.712, 56.246, 54.534, 55.032, 55.602, 54.76, 55.554, 54.716, 54.678, 54.994, 54.698, 54.456, 54.158, 54.206, 55.404, 55.674, 55.45, 54.714, 54.778, 55.242, 55.542, 55.074, 55.108, 54.664, 54.162, 55.718, 55.592, 55.278, 54.974, 56.054, 55.36, 55.704, 54.032, 55.474, 55.13, 54.244, 54.716, 54.434, 54.112, 55.596, 54.922, 55.178, 54.2, 55.058, 55.23, 53.666, 55.138, 54.65, 54.958, 55.242, 55.162, 54.44, 54.878, 54.038, 55.272, 54.58, 55.016, 55.27, 55.316, 54.814, 55.298, 54.934, 54.226, 54.496, 55.064, 55.228, 55.318, 54.982, 54.978, 55.318, 54.512, 55.916, 54.606, 55.184, 54.846, 54.582, 56.034, 55.568, 54.87, 54.676, 55.596, 54.83, 54.008, 55.034, 54.082, 55.918, 55.636, 55.214, 55.334, 55.068, 55.388, 55.536, 55.112, 55.192, 54.958, 55.24, 54.782, 55.674, 54.61, 55.318, 55.208, 56.012, 55.04, 55.71, 55.528, 55.584, 55.334, 54.494, 55.478, 54.776, 54.448, 55.98, 55.302, 54.576, 54.53, 55.528, 54.824, 56.358, 55.164, 54.65, 55.438, 54.996, 54.368, 54.518, 54.888, 55.07, 55.834, 54.56, 54.878]
Note

The sampling distribution and bootstrap distribution are closely related, and so is the code to generate them.

4.5 Exercise 4.2.2

Compare sampling and bootstrap means

To make calculation easier, distributions similar to those calculated from the previous exercise have been included, this time using a sample size of 5000.

spotify_population, spotify_sample, sampling_distribution, and bootstrap_distribution are available; pandas and numpy are loaded with their usual aliases.

Instructions

  1. Calculate the mean popularity in 4 ways:
  • Population: from spotify, take the mean of popularity.
  • Sample: from spotify_sample, take the mean of popularity.
  • Sampling distribution: from sampling_distribution, take its mean.
  • Bootstrap distribution: from `bootstrap_distribution, take its mean.
Code
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

# Importing the course array
spotify = pd.read_feather("datasets/spotify_2000_2020.feather")

spotify_sample = spotify.sample(n=500)

mean_popularity_2000_samp = []

# Generate a sampling distribution of 2000 replicates
for i in range(2000):
    mean_popularity_2000_samp.append(
        # Sample 500 rows and calculate the mean popularity 
        spotify.sample(n=500)['popularity'].mean()
    )

# The sampling distribution results
sampling_distribution = mean_popularity_2000_samp 

mean_popularity_2000_boot = []

# Generate a bootstrap distribution of 2000 replicates
for i in range(2000):
    mean_popularity_2000_boot.append(
        # Resample 500 rows and calculate the mean popularity     
        np.mean(spotify_sample.sample(frac=1, replace=True)['popularity'])
    )

# The bootstrap distribution results
bootstrap_distribution = mean_popularity_2000_boot

# Calculate the population mean popularity
pop_mean = spotify['popularity'].mean()

# Calculate the original sample mean popularity
samp_mean = spotify_sample['popularity'].mean()

# Calculate the sampling dist'n estimate of mean popularity
samp_distn_mean = np.mean(sampling_distribution)

# Calculate the bootstrap dist'n estimate of mean popularity
boot_distn_mean = np.mean(bootstrap_distribution)

# Print the means
print([pop_mean, samp_mean, samp_distn_mean, boot_distn_mean])
[54.837142308430955, 55.266, 54.833191, 55.25912]
Note

The sampling distribution mean can be used to estimate the population mean, but that is not the case with the bootstrap distribution.

4.6 Exercise 4.2.3

Compare sampling and bootstrap standard deviations

In the same way that you looked at how the sampling distribution and bootstrap distribution could be used to estimate the population mean, you’ll now take a look at how they can be used to estimate variation, or more specifically, the standard deviation, in the population.

Recall that the sample size is 5000.

Instructions

Calculate the standard deviation of popularity in 4 ways. - Population: from spotify, take the standard deviation of popularity. - Original sample: from spotify_sample, take the standard deviation of popularity. - Sampling distribution: from sampling_distribution, take its standard deviation and multiply by the square root of the sample size (5000). - Bootstrap distribution: from bootstrap_distribution, take its standard deviation and multiply by the square root of the sample size.

Code
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

# Importing the course array
spotify = pd.read_feather("datasets/spotify_2000_2020.feather")

spotify_sample = spotify.sample(n=5000, random_state=2022)

mean_popularity_2000_samp = []

# Generate a sampling distribution of 2000 replicates
for i in range(2000):
    mean_popularity_2000_samp.append(
        # Sample 500 rows and calculate the mean popularity 
        spotify.sample(n=5000)['popularity'].mean()
    )

# The sampling distribution results
sampling_distribution = mean_popularity_2000_samp 

mean_popularity_2000_boot = []

# Generate a bootstrap distribution of 2000 replicates
for i in range(2000):
    mean_popularity_2000_boot.append(
        # Resample 500 rows and calculate the mean popularity     
        np.mean(spotify_sample.sample(frac=1, replace=True)['popularity'])
    )

# The bootstrap distribution results
bootstrap_distribution = mean_popularity_2000_boot

# Calculate the population std dev popularity
pop_sd = spotify['popularity'].std(ddof=0)

# Calculate the original sample std dev popularity
samp_sd = spotify_sample['popularity'].std(ddof=1)

# Calculate the sampling dist'n estimate of std dev popularity
samp_distn_sd = np.std(sampling_distribution, ddof=1) * np.sqrt(5000)

# Calculate the bootstrap dist'n estimate of std dev popularity
boot_distn_sd = np.std(bootstrap_distribution, ddof=1) * np.sqrt(5000)

# Print the standard deviations
print([pop_sd, samp_sd, samp_distn_sd, boot_distn_sd])
[10.880065274257536, 10.975581356685552, 10.32099231645863, 10.648136574361413]

4.7 Chapter 4.3: Confidence intervals

In the last few exercises, you looked at relationships between the sampling distribution and the bootstrap distribution.

One way to quantify these distributions is the idea of “values within one standard deviation of the mean”, which gives a good sense of where most of the values in a distribution lie. In this final lesson, we’ll formalize the idea of values close to a statistic by defining the term “confidence interval”.

Predicting the weather

Consider meteorologists predicting weather in one of the world’s most unpredictable regions - the northern Great Plains of the US and Canada. Rapid City, South Dakota was ranked as the least predictable of the 120 US cities with a National Weather Service forecast office. Suppose we’ve taken a job as a meteorologist at a news station in Rapid City. Our job is to predict tomorrow’s high temperature.

Our weather prediction

We analyze the weather data using the best forecasting tools available to us and predict a high temperature of 47 degrees Fahrenheit. In this case, 47 degrees is our point estimate. Since the weather is variable, and many South Dakotans will plan their day tomorrow based on our forecast, we’d instead like to present a range of plausible values for the high temperature. On our weather show, we report that the high temperature will be between forty and fifty-four degrees tomorrow.

We just reported a confidence interval!

This prediction of forty to fifty-four degrees can be thought of as a confidence interval for the unknown quantity of tomorrow’s high temperature. Although we can’t be sure of the exact temperature, we are confident that it will be in that range. These results are often written as the point estimate followed by the confidence interval’s lower and upper bounds in parentheses or square brackets. When the confidence interval is symmetric around the point estimate, we can represent it as the point estimate plus or minus the margin of error, in this case, seven degrees.

Bootstrap distribution of mean flavor

Here’s the bootstrap distribution of the mean flavor from the coffee dataset.

Mean of the resamples

We can calculate the mean of these resampled mean flavors.

Mean plus or minus one standard deviation

If we create a confidence interval by adding and subtracting one standard deviation from the mean, we see that there are lots of values in the bootstrap distribution outside of this one standard deviation confidence interval.

Quantile method for confidence intervals

If we want to include ninety-five percent of the values in the confidence interval, we can use quantiles. Recall that quantiles split distributions into sections containing a particular proportion of the total data. To get the middle ninety-five percent of values, we go from the point-zero-two-five quantile to the point-nine-seven-five quantile since the difference between those two numbers is point-nine-five. To calculate the lower and upper bounds for this confidence interval, we call quantile from NumPy, passing the distribution values and the quantile values to use. The confidence interval is from around seven-point-four-eight to seven-point-five-four.

Inverse cumulative distribution function

There is a second method to calculate confidence intervals. To understand it, we need to be familiar with the normal distribution’s inverse cumulative distribution function. The bell curve we’ve seen before is the probability density function or PDF. Using calculus, if we integrate this, we get the cumulative distribution function or CDF. If we flip the x and y axes, we get the inverse CDF. We can use scipy.stats and call norm.ppf to get the inverse CDF. It takes a quantile between zero and one and returns the values of the normal distribution for that quantile. The parameters of loc and scale are set to 0 and 1 by default, corresponding to the standard normal distribution. Notice that the values corresponding to point-zero-two-five and point-nine-seven-five are about minus and plus two for the standard normal distribution.

Standard error method for confidence interval

This second method for calculating a confidence interval is called the standard error method. First, we calculate the point estimate, which is the mean of the bootstrap distribution, and the standard error, which is estimated by the standard deviation of the bootstrap distribution. Then we call norm.ppf to get the inverse CDF of the normal distribution with the same mean and standard deviation as the bootstrap distribution. Again, the confidence interval is from seven-point-four-eight to seven-point-five-four, though the numbers differ slightly from last time since our bootstrap distribution isn’t perfectly normal.

4.8 Exercise 4.3.1

4.8.1 Calculating confidence intervals

You have learned about two methods for calculating confidence intervals: the quantile method and the standard error method. The standard error method involves using the inverse cumulative distribution function (inverse CDF) of the normal distribution to calculate confidence intervals. In this exercise, you’ll perform these two methods on the Spotify data.

4.8.2 Instructions

  1. Generate a 95% confidence interval using the quantile method on the bootstrap distribution, setting the 0.025 quantile as lower_quant and the 0.975 quantile as upper_quant.

  2. Generate a 95% confidence interval using the standard error method from the bootstrap distribution.

  • Calculate point_estimate as the mean of bootstrap_distribution, and standard_error as the standard deviation of bootstrap_distribution.
  • Calculate lower_se as the 0.025 quantile of an inv. CDF from a normal distribution with mean point_estimate and standard deviation standard_error.
  • Calculate upper_se as the 0.975 quantile of that same inv. CDF.
Code
# Importing libraries
import pandas as pd
import numpy as np
from scipy.stats import norm

# Importing the course array
spotify = pd.read_feather("datasets/spotify_2000_2020.feather")

spotify_sample = spotify.sample(n=5000, random_state=2022)

mean_popularity_2000_boot = []

# Generate a bootstrap distribution of 2000 replicates
for i in range(2000):
    mean_popularity_2000_boot.append(
        # Resample 500 rows and calculate the mean popularity     
        np.mean(spotify_sample.sample(frac=1, replace=True)['popularity'])
    )

# The bootstrap distribution results
bootstrap_distribution = mean_popularity_2000_boot

# Generate a 95% confidence interval using the quantile method
lower_quant = np.quantile(bootstrap_distribution, 0.025)
upper_quant = np.quantile(bootstrap_distribution, 0.975)

# Print quantile method confidence interval
print((lower_quant, upper_quant))

# Find the mean and std dev of the bootstrap distribution
point_estimate = np.mean(bootstrap_distribution)
standard_error = np.std(bootstrap_distribution, ddof=1)

# Find the lower limit of the confidence interval
lower_se = norm.ppf(0.025, loc=point_estimate, scale=standard_error)

# Find the upper limit of the confidence interval
upper_se = norm.ppf(0.975, loc=point_estimate, scale=standard_error)

# Print standard error method confidence interval
print((lower_se, upper_se))
(54.47574, 55.07474499999999)
(54.48036899601746, 55.079173603982525)

5 Reference

Sampling in Python in Intermediate Python Course for Associate Data Scientist in Python Carrer Track in DataCamp Inc by James Chapman.