Associate Data Science Course in Python by DataCamp Inc
Published
September 26, 2025
Chapter 1
Chapter 1.1: Data type classification
In the course , you will learn about two main types of data: numeric and categorical data.
Numeric variables can be classified as either discrete or continuous, and categorical variables can be classified as either nominal or ordinal. These characteristics of a variable determine which ways of summarizing your data will work best.
Measure of Center
In this lesson, we’ll begin to discuss summary statistics, some of which you may already be familiar with, like mean and median.
Mammal sleep data
In this lesson, we’ll look at data about different mammals’ sleep habits.
Histograms
Before we dive in, let’s remind ourselves how histograms work. A histogram takes a bunch of data points and separates them into bins, or ranges of values. Here, there’s a bin for 0 to 2 hours, 2 to 4 hours, and so on. The heights of the bars represent the number of data points that fall into that bin, so there’s one mammal in the dataset that sleeps between 0 to 2 hours, and nine mammals that sleep two to four hours. Histograms are a great way to visually summarize the data, but we can use numerical summary statistics to summarize even further.
How long do mammals in this dataset typically sleep?
One way we could summarize the data is by answering the question, How long do mammals in this dataset typically sleep? To answer this, we need to figure out what the “typical” or “center” value of the data is. We’ll discuss three different definitions, or measures, of center: mean, median, and mode.
Measures of center: mean
The mean, often called the average, is one of the most common ways of summarizing data. To calculate mean, we add up all the numbers of interest and divide by the total number of data points, which is 83 here. This gives us 10-point-43 hours of sleep. In Python, we can use numpy’s mean function, passing it the variable of interest.
Measures of center: median
Another measure of center is the median. The median is the value where 50% of the data is lower than it, and 50% of the data is higher. We can calculate this by sorting all the data points and taking the middle one, which would be index 41 in this case. This gives us a median of 10-point-1 hours of sleep. In Python, we can use np.median to do the calculations for us.
Measures of center: mode
The mode is the most frequent value in the data. If we count how many occurrences there are of each sleep_total and sort in descending order, there are 4 mammals that sleep for 12.5 hours, so this is the mode. The mode of the vore variable, which indicates the animal’s diet, is herbivore. We can also find the mode using the mode function from the statistics module. Mode is often used for categorical variables, since categorical variables can be unordered and often don’t have an inherent numerical representation.
Adding an outlier
Now that we have lots of ways to measure center, how do we know which one to use? Let’s look at an example. Here, we have all of the insectivores in the dataset. We get a mean sleep time of 16.5 hours and a median sleep time of 18.9 hours. Now let’s say we’ve discovered a new mystery insectivore that never sleeps. If we take the mean and median again, we get different results. The mean went down by more than 3 hours, while the median changed by less than an hour. This is because the mean is much more sensitive to extreme values than the median.
Which measure to use?
Since the mean is more sensitive to extreme values, it works better for symmetrical data like this. Notice that the mean, in black, and median, in red, are quite close.
Skew
However, if the data is skewed, meaning it’s not symmetrical, like this, median is usually better to use. In this histogram, the data is piled up on the right, with a tail on the left. Data that looks like this is called left-skewed data. When data is piled up on the left with a tail on the right, it’s right-skewed.
Which measure to use?
When data is skewed, the mean and median are different. The mean is pulled in the direction of the skew, so it’s lower than the median on the left-skewed data, and higher than the median on the right-skewed data. Because the mean is pulled around by the extreme values, it’s better to use the median since it’s less affected by outliers.
Exercise 1.1
Only show final grouped result
result = be_and_usa.groupby('country')['consumption'].agg(['mean', 'median'])result
Mean and median
In this chapter, you’ll be working with the 2018 Food Carbon Footprint Index from nu3. The food_consumption dataset contains information about the kilograms of food consumed per person per year in each country in each food category (consumption) as well as information about the carbon footprint of that food category (co2_emissions) measured in kilograms of carbon dioxide, or CO2, per person per year in each country.
In this exercise, you’ll compute measures of center to compare food consumption in the US and Belgium using your pandas and numpy skills.
Import numpy with the alias np.
Create two DataFrames: one that holds the rows of food_consumption for ‘Belgium’ and another that holds rows for ‘USA’. Call these be_consumption and usa_consumption.
Calculate the mean and median of kilograms of food consumed per person per year for both countries.
Code
import warningswarnings.filterwarnings("ignore")# Import numpy with alias npimport numpy as npimport pandas as pd# Importing the datasetfood_consumption = pd.read_csv("datasets/food_consumption.csv")# Filter for Belgiumbe_consumption = food_consumption[food_consumption['country'] =='Belgium']# ORbe_consumption = food_consumption.set_index('country').loc[['Belgium']]# Filter for USAusa_consumption = food_consumption[food_consumption['country'] =="USA"]# ORusa_consumption = food_consumption.set_index('country').loc[['USA']]# Calculate mean and median consumption in Belgiumprint(f"Mean consumption in Belgium is: {np.mean(be_consumption['consumption']):.4f}")print(f"Median consumption in Belgium is: {np.median(be_consumption['consumption']):.4f}")# Calculate mean and median consumption in USAprint(f"Mean consumption in USA is: {np.mean(usa_consumption['consumption']):.4f}")print(f"Median consumption in USA is: {np.median(usa_consumption['consumption']):.4f}")# Subset food_consumption for rows with data about Belgium and the USA.# Group the subsetted data by country and select only the consumption column.# Calculate the mean and median of the kilograms of food consumed per person per year in each country using .agg().# Subset for Belgium and USA onlybe_and_usa = food_consumption[(food_consumption['country'] =='Belgium') | (food_consumption['country'] =='USA')]# Group by country, select consumption column, and compute mean and medianprint("\n The mean and median of the kilograms of food consumed per person per year in Belgium and USA using .agg()")print(be_and_usa.groupby('country')['consumption'].agg([np.mean, np.median]))
Mean consumption in Belgium is: 42.1327
Median consumption in Belgium is: 12.5900
Mean consumption in USA is: 44.6500
Median consumption in USA is: 14.5800
The mean and median of the kilograms of food consumed per person per year in Belgium and USA using .agg()
mean median
country
Belgium 42.132727 12.59
USA 44.650000 14.58
Exercise 1.2
Import matplotlib.pyplot with the alias plt.
Subset food_consumption to get the rows where food_category is ‘rice’.
Create a histogram of co2_emission for rice and show the plot.
Use .agg() to calculate the mean and median of co2_emission for rice.
result = be_and_usa.groupby('country')['consumption'].agg(['mean', 'median'])result
Mean and median
Code
import warningswarnings.filterwarnings("ignore")# Import matplotlib.pyplot with the alias plt.import matplotlib.pyplot as plt# Subset food_consumption to get the rows where food_category is 'rice'.rice_consumption = food_consumption[food_consumption['food_category'] =='rice']# Create a histogram of co2_emission for rice and show the plot.rice_consumption.hist(column='co2_emission')# orrice_consumption['co2_emission'].hist()plt.show()# Use .agg() to calculate the mean and median of co2_emission for rice.mean_rice, median_rice = mean_rice, median_rice = rice_consumption['co2_emission'].agg([np.mean, np.median])print(f"Mean rice CO2 Emission: {mean_rice}")print(f"Median rice CO2 Emission: {median_rice}")
Mean rice CO2 Emission: 37.59161538461538
Median rice CO2 Emission: 15.2
Chapter 1.2: Measure of Dispersion
Measures of spread
In this lesson, we’ll talk about another set of summary statistics: measures of spread.
What is spread?
Spread is just what it sounds like - it describes how spread apart or close together the data points are. Just like measures of center, there are a few different measures of spread.
Variance
The first measure, variance, measures the average distance from each data point to the data’s mean.
Calculating variance
To calculate the variance, we start by calculating the distance between each point and the mean, so we get one number for every data point. We then square each distance and then add them all together. Finally, we divide the sum of squared distances by the number of data points minus 1, giving us the variance. The higher the variance, the more spread out the data is. It’s important to note that the units of variance are squared, so in this case, it’s 19.8 hours squared. We can calculate the variance in one step using np.var, setting the ddof argument to 1. If we don’t specify ddof equals 1, a slightly different formula is used to calculate variance that should only be used on a full population, not a sample.
Standard deviation
The standard deviation is another measure of spread, calculated by taking the square root of the variance. It can be calculated using np.std. Just like np.var, we need to set ddof to 1. The nice thing about standard deviation is that the units are usually easier to understand since they’re not squared. It’s easier to wrap your head around 4 and a half hours than 19.8 hours squared.
Mean absolute deviation
Mean absolute deviation takes the absolute value of the distances to the mean, and then takes the mean of those differences. While this is similar to standard deviation, it’s not exactly the same. Standard deviation squares distances, so longer distances are penalized more than shorter ones, while mean absolute deviation penalizes each distance equally. One isn’t better than the other, but SD is more common than MAD.
Quantiles
Before we discuss the next measure of spread, let’s quickly talk about quantiles. Quantiles, also called percentiles, split up the data into some number of equal parts. Here, we call np.quantile, passing in the column of interest, followed by 0.5. This gives us 10.1 hours, so 50% of mammals in the dataset sleep less than 10.1 hours a day, and the other 50% sleep more than 10.1 hours, so this is exactly the same as the median. We can also pass in a list of numbers to get multiple quantiles at once. Here, we split the data into 4 equal parts. These are also called quartiles. This means that 25% of the data is between 1.9 and 7.85, another 25% is between 7.85 and 10.10, and so on.
Boxplots use quartiles
The boxes in box plots represent quartiles. The bottom of the box is the first quartile, and the top of the box is the third quartile. The middle line is the second quartile, or the median.
Quantiles using np.linspace()
Here, we split the data in five equal pieces, but we can also use np.linspace as a shortcut, which takes in the starting number, the stopping number, and the number intervals. We can compute the same quantiles using `np.linspace starting at zero, stopping at one, splitting into 5 different intervals.
Interquartile range (IQR)
The interquartile range, or IQR, is another measure of spread. It’s the distance between the 25th and 75th percentile, which is also the height of the box in a boxplot. We can calculate it using the quantile function, or using the iqr function from scipy.stats to get 5.9 hours.
Outliers
Outliers are data points that are substantially different from the others. But how do we know what a substantial difference is ? A rule that’s often used is that any data point less than the first quartile minus 1.5 times the IQR is an outlier, as well as any point greater than the third quartile plus 1.5 times the IQR.
Finding outliers
To find outliers, we’ll start by calculating the IQR of the mammals’ body weights. We can then calculate the lower and upper thresholds following the formulas from the previous slide. We can now subset the DataFrame to find mammals whose body weight is below or above the thresholds. There are eleven body weight outliers in this dataset, including the cow and the Asian elephant.
All in one go {.unnumbered}
Many of the summary statistics we’ve covered so far can all be calculated in just one line of code using the .describe method, so it’s convenient to use when you want to get a general sense of your data.
Exercise 1.2.1
Calculate the quartiles of the co2_emission column of food_consumption.
Calculate the six quantiles that split up the data into 5 pieces (quintiles) of the co2_emission column of food_consumption.
Calculate the eleven quantiles of co2_emission that split up the data into ten pieces (deciles).
Calculate the variance and standard deviation of co2_emission for each food_category by grouping and aggregating.
Import matplotlib.pyplot with alias plt.
Create a histogram of co2_emission for the beef food_category and show the plot.
Create a histogram of co2_emission for the eggs food_category and show the plot.
Code
import warningswarnings.filterwarnings("ignore")# Calculate the quartiles of the co2_emission column of food_consumption.print("The quartiles of the co2_emission column of food_consumption: \n")print(np.quantile(food_consumption['co2_emission'], np.linspace(0, 1, 5)))# Calculate the six quantiles that split up the data into 5 pieces (quintiles) of the co2_emission column of food_consumption.print("The quintiles of the co2_emission column of food_consumption: \n")print(np.quantile(food_consumption['co2_emission'], np.linspace(0, 1, 6)))# Calculate the eleven quantiles of co2_emission that split up the data into ten pieces (deciles).print("The 11 quantiles of the co2_emission column of food_consumption: \n")print(np.quantile(food_consumption['co2_emission'], np.linspace(0, 1, 11)))# Calculate the variance and standard deviation of co2_emission for each food_category by grouping and aggregating.print("The variance and standard deviation of co2_emission for each food_category by grouping and aggregating \n")print(food_consumption.groupby('food_category')['co2_emission'].agg([np.var, np.std]))# Import matplotlib.pyplot with alias plt.import matplotlib.pyplot as plt# Create a histogram of co2_emission for the beef food_category and show the plot.food_consumption[food_consumption['food_category'] =='beef'].hist(column ='co2_emission')plt.title("Histogram of CO2 emission for the Beef food category")plt.show()# orfood_consumption[food_consumption['food_category'] =='beef']['co2_emission'].hist()plt.show()# Create a histogram of co2_emission for the eggs food_category and show the plot.food_consumption[food_consumption['food_category'] =='eggs'].hist(column ='co2_emission')plt.title("Histogram of CO2 emission for the Egg food category")plt.show()# orfood_consumption[food_consumption['food_category'] =='eggs']['co2_emission'].hist()plt.show()
The quartiles of the co2_emission column of food_consumption:
[ 0. 5.21 16.53 62.5975 1712. ]
The quintiles of the co2_emission column of food_consumption:
[ 0. 3.54 11.026 25.59 99.978 1712. ]
The 11 quantiles of the co2_emission column of food_consumption:
[0.00000e+00 6.68000e-01 3.54000e+00 7.04000e+00 1.10260e+01 1.65300e+01
2.55900e+01 4.42710e+01 9.99780e+01 2.03629e+02 1.71200e+03]
The variance and standard deviation of co2_emission for each food_category by grouping and aggregating
var std
food_category
beef 88748.408132 297.906710
dairy 17671.891985 132.935669
eggs 21.371819 4.622966
fish 921.637349 30.358481
lamb_goat 16475.518363 128.356996
nuts 35.639652 5.969895
pork 3094.963537 55.632396
poultry 245.026801 15.653332
rice 2281.376243 47.763754
soybeans 0.879882 0.938020
wheat 71.023937 8.427570
Exercise 1.2.2
Finding outliers using IQR
Outliers can have big effects on statistics like mean, as well as statistics that rely on the mean, such as variance and standard deviation. Interquartile range, or IQR, is another way of measuring spread that’s less influenced by outliers. IQR is also often used to find outliers. The outlier rule states that values are considered outliers if \(x < Q_1 - 1.5 \times IQR\) or \(x > Q_3 + 1.5 \times IQR\). In fact, this is how the lengths of the whiskers in a matplotlib box plot are calculated.
Calculate the total co2_emission per country by grouping by country and taking the sum of co2_emission. Store the resulting DataFrame as emissions_by_country.
Compute the first and third quartiles of emissions_by_country and store these as q1 and q3.
Calculate the interquartile range of emissions_by_country and store it as iqr.
Calculate the lower and upper cutoffs for outliers of emissions_by_country, and store these as lower and upper.
Subset emissions_by_country to get countries with a total emission greater than the upper cutoff or a total emission less than the lower cutoff.
Code
# Calculate the total co2_emission per country by grouping by country and taking the sum of co2_emission. Store the resulting DataFrame as emissions_by_country.emissions_by_country = food_consumption.groupby('country')['co2_emission'].sum()print("The Total CO2 Emission by Country \n")print(emissions_by_country)# Compute the first and third quartiles of emissions_by_country and store these as q1 and q3.q1 = np.quantile(emissions_by_country, 0.25)q3 = np.quantile(emissions_by_country, 0.75)# Calculate the interquartile range of emissions_by_country and store it as iqr.iqr = q3 - q1# Calculate the lower and upper cutoffs for outliers of emissions_by_country, and store these as lower and upper.lower = q1 -1.5* iqrupper = q3 +1.5* iqr# Subset emissions_by_country to get countries with a total emission greater than the upper cutoff or a total emission less than the lower cutoff.outliers = emissions_by_country[(emissions_by_country < lower) | (emissions_by_country > upper)]print("The Outliers (countries with a total emission greater than the upper cutoff or a total emission less than the lower cutoff) are")print(outliers)
The Total CO2 Emission by Country
country
Albania 1777.85
Algeria 707.88
Angola 412.99
Argentina 2172.40
Armenia 1109.93
...
Uruguay 1634.91
Venezuela 1104.10
Vietnam 641.51
Zambia 225.30
Zimbabwe 350.33
Name: co2_emission, Length: 130, dtype: float64
The Outliers (countries with a total emission greater than the upper cutoff or a total emission less than the lower cutoff) are
country
Argentina 2172.4
Name: co2_emission, dtype: float64
Chapter 2
Chapter 2.1: Random Numbers and Probability
What are the chances?
People talk about chance pretty frequently, like what are the chances of closing a sale, of rain tomorrow, or of winning a game? But how exactly do we measure chance?
Measuring chance
We can measure the chances of an event using probability. We can calculate the probability of some event by taking the number of ways the event can happen and dividing it by the total number of possible outcomes. For example, if we flip a coin, it can land on either heads or tails. To get the probability of the coin landing on heads, we divide the 1 way to get heads by the two possible outcomes, heads and tails. This gives us one half, or a fifty percent chance of getting heads. Probability is always between zero and 100 percent. If the probability of something is zero, it’s impossible, and if the probability of something is 100%, it will certainly happen.
Assigning salespeople
Let’s look at a more complex scenario. There’s a meeting coming up with a potential client, and we want to send someone from the sales team to the meeting. We’ll put each person’s name on a ticket in a box and pull one out randomly to decide who goes to the meeting. Brian’s name gets pulled out. The probability of Brian being selected is one out of four, or 25%.
Sampling from a DataFrame
We can recreate this scenario in Python using the sample() method. By default, it randomly samples one row from the DataFrame. However, if we run the same thing again, we may get a different row since the sample method chooses randomly. If we want to show the team how we picked Brian, this won’t work well.
Setting a random seed
To ensure we get the same results when we run the script in front of the team, we’ll set the random seed using np.random.seed. The seed is a number that Python’s random number generator uses as a starting point, so if we orient it with a seed number, it will generate the same random value each time. The number itself doesn’t matter. We could use 5, 139, or 3 million. The only thing that matters is that we use the same seed the next time we run the script. Now, we, or one of the sales-team members, can run this code over and over and get Brian every time.
A second meeting
Now there’s another potential client who wants to meet at the same time, so we need to pick another salesperson. Brian haas already been picked and he can’t be in two meetings at once, so we’ll pick between the remaining three. This is called sampling without replacement, since we aren’t replacing the name we already pulled out. This time, Claire is picked, and the probability of this is one out of three, or about 33%.
Sampling twice in Python
To recreate this in Python, we can pass 2 into the sample method, which will give us 2 rows of the DataFrame.
Sampling with replacement
Now let’s say the two meetings are happening on different days, so the same person could attend both. In this scenario, we need to return Brian’s name to the box after picking it. This is called sampling with replacement. Claire gets picked for the second meeting, but this time, the probability of picking her is 25%.
Sampling with/without replacement in Python
To sample with replacement, set the replace argument to True, so names can appear more than once. If there were 5 meetings, all at different times, it’s possible to pick some rows multiple times since we’re replacing them each time.
Independent events
Let’s quickly talk about independence. Two events are independent if the probability of the second event isn’t affected by the outcome of the first event. For example, if we’re sampling with replacement, the probability that Claire is picked second is 25%, no matter who gets picked first. In general, when sampling with replacement, each pick is independent.
Dependent events
Similarly, events are considered dependent when the outcome of the first changes the probability of the second. If we sample without replacement, the probability that Claire is picked second depends on who gets picked first. If Claire is picked first, there’s 0% probability that Claire will be picked second. If someone else is picked first, there’s a 33% probability Claire will be picked second. In general, when sampling without replacement, each pick is dependent.
Exercise 2.1.1: Calculating probabilities
You’re in charge of the sales team, and it’s time for performance reviews, starting with Amir. As part of the review, you want to randomly select a few of the deals that he’s worked on over the past year so that you can look at them more deeply. Before you start selecting deals, you’ll first figure out what the chances are of selecting certain deals.
Count the number of deals Amir worked on for each product type and store in counts.
Calculate the probability of selecting a deal for the different product types by dividing the counts by the total number of deals Amir worked on. Save this as probs.
Code
# Importing the datasetamir_deals = pd.read_csv("datasets/amir_deals.csv")# Count the number of deals Amir worked on for each product type and store in counts.counts = amir_deals['product'].value_counts()# orcounts = amir_deals.value_counts('product')print(f"The number of deals Amir worked on for each product type: {counts}")# Calculate the probability of selecting a deal for the different product types by dividing the counts by the total number of deals Amir worked on. Save this as probs.probs = amir_deals['product'].value_counts()/178print(f"The probability of selecting a deal for the different product types: {probs}")
The number of deals Amir worked on for each product type: product
Product B 62
Product D 40
Product A 23
Product C 15
Product F 11
Product H 8
Product I 7
Product E 5
Product N 3
Product G 2
Product J 2
Name: count, dtype: int64
The probability of selecting a deal for the different product types: product
Product B 0.348315
Product D 0.224719
Product A 0.129213
Product C 0.084270
Product F 0.061798
Product H 0.044944
Product I 0.039326
Product E 0.028090
Product N 0.016854
Product G 0.011236
Product J 0.011236
Name: count, dtype: float64
Exercise 2.1.2 : Sampling deals
In the previous exercise Section 0.2, you counted the deals Amir worked on. Now it’s time to randomly pick five deals so that you can reach out to each customer and ask if they were satisfied with the service they received. You’ll try doing this both with and without replacement.
Additionally, you want to make sure this is done randomly and that it can be reproduced in case you get asked how you chose the deals, so you’ll need to set the random seed before sampling from the deals.
Import the necessary packages.
Set the random seed to 24.
Take a sample of 5 deals without replacement and store them as sample_without_replacement.
Take a sample of 5 deals with replacement and save as sample_with_replacement.
Code
import pandas as pdimport numpy as np# Set the random seed to 24.np.random.seed(24)# Take a sample of 5 deals without replacement and store them as sample_without_replacement.sample_without_replacement = amir_deals.sample(5)print("Sample of 5 deals without replacement \n")print(sample_without_replacement)# Take a sample of 5 deals with replacement and save as sample_with_replacement.sample_with_replacement = amir_deals.sample(5, replace =True)print("Sample of 5 deals with replacement \n")print(sample_with_replacement)
Sample of 5 deals without replacement
Unnamed: 0 product client status amount num_users
127 128 Product B Current Won 2070.25 7
148 149 Product D Current Won 3485.48 52
77 78 Product B Current Won 6252.30 27
104 105 Product D Current Won 4110.98 39
166 167 Product C New Lost 3779.86 11
Sample of 5 deals with replacement
Unnamed: 0 product client status amount num_users
133 134 Product D Current Won 5992.86 98
101 102 Product H Current Won 5116.34 63
110 111 Product B Current Won 696.88 44
49 50 Product B Current Won 3488.36 79
56 57 Product D Current Won 6820.84 42
What type of sampling is better to use for this situation? If you sample with replacement, you might end up calling the same customer twice.
Chapter 2.2: Discrete distributions
In this lesson, we’ll take a deeper dive into probability and begin looking at probability distributions.
Rolling the dice
Let’s consider rolling a standard, six-sided die. There are six numbers, or six possible outcomes, and every number has one-sixth, or about a 17 percent chance of being rolled. This is an example of a probability distribution.
Choosing salespeople
This is similar to the scenario from earlier, except we had names instead of numbers. Just like rolling a die, each outcome, or name, had an equal chance of being chosen.
Probability distribution
A probability distribution describes the probability of each possible outcome in a scenario. We can also talk about the expected value of a distribution, which is the mean of a distribution. We can calculate this by multiplying each value by its probability (one-sixth in this case) and summing, so the expected value of rolling a fair die is 3.5.
Visualizing a probability distribution
We can visualize this using a barplot, where each bar represents an outcome, and each bar’s height represents the probability of that outcome.
Probability = area
We can calculate probabilities of different outcomes by taking areas of the probability distribution. For example, what’s the probability that our die roll is less than or equal to 2? To figure this out, we’ll take the area of each bar representing an outcome of 2 or less. Each bar has a width of 1 and a height of one-sixth, so the area of each bar is one-sixth. We’ll sum the areas for 1 and 2, to get a total probability of one-third.
Uneven die
Now let’s say we have a die where the two got turned into a three. This means that we now have a 0% chance of getting a 2, and a 33% chance of getting a 3. To calculate the expected value of this die, we now multiply 2 by 0, since it’s impossible to get a 2, and 3 by its new probability, one-third. This gives us an expected value that’s slightly higher than the fair die.
Visualizing uneven probabilities
When we visualize these new probabilities, the bars are no longer even.
Adding areas
With this die, what’s the probability of getting something less than or equal to 2? There’s a one-sixth probability of getting 1, and zero probability of getting 2, which sums to one-sixth.
Discrete probability distributions
The probability distributions you’ve seen so far are both discrete probability distributions, since they represent situations with discrete outcomes. Recall from chapter 1, Section 0.1 that discrete variables can be thought of as counted variables. In the case of a die, we’re counting dots, so we can’t roll a 1.5 or 4.3. When all outcomes have the same probability, like a fair die, this is a special distribution called a discrete uniform distribution.
Sampling from discrete distributions
Just like we sampled names from a box, we can do the same thing with probability distributions like the ones we’ve seen. Here’s a DataFrame called die that represents a fair die, and its expected value is 3.5. We’ll sample from it 10 times to simulate 10 rolls. Notice that we sample with replacement so that we’re sampling from the same distribution every time.
Visualizing a sample
We can visualize the outcomes of the ten rolls using a histogram, defining the bins we want using np.linspace.
Sample distribution vs. theoretical distribution
Notice that we have different numbers of 1’s, 2’s, 3’s, and so on since the sample was random, even though on each roll we had the same probability of rolling each number. The mean of our sample is 3.0, which isn’t super close to the 3.5 we were expecting.
A bigger sample
If we roll the die 100 times, the distribution of the rolls looks a bit more even, and the mean is closer to 3.5.
An even bigger sample
If we roll 1000 times, it looks even more like the theoretical probability distribution and the mean closely matches 3.5.
Law of large numbers
This is called the law of large numbers, which is the idea that as the size of your sample increases, the sample mean will approach the theoretical mean.
Exercise 2.2: Creating a probability distribution
A new restaurant opened a few months ago, and the restaurant’s management wants to optimize its seating space based on the size of the groups that come most often. On one night, there are 10 groups of people waiting to be seated at the restaurant, but instead of being called in the order they arrived, they will be called randomly. In this exercise, you’ll investigate the probability of groups of different sizes getting picked first. Data on each of the ten groups is contained in the restaurant_groups DataFrame.
Create a histogram of the group_size column of restaurant_groups, setting bins to [2, 3, 4, 5, 6]. Remember to show the plot.
Count the number of each group_size in restaurant_groups, then divide by the number of rows in restaurant_groups to calculate the probability of randomly selecting a group of each size. Save as size_dist.
Reset the index of size_dist.
Rename the columns of size_dist to group_size and prob.
Calculate the expected value of the size_dist, which represents the expected group size, by multiplying the group_size by the prob and taking the sum.
Calculate the probability of randomly picking a group of 4 or more people by subsetting for groups of size 4 or more and summing the probabilities of selecting those groups.
Sum the probabilities of groups_4_or_more.
# Create a histogram of the group_size column of restaurant_groups, setting bins to [2, 3, 4, 5, 6]. Remember to show the plot.restaurant_groups['group_size'].hist(bins = [2, 3, 4, 5, 6] )plt.show()# Count the number of each group_size in restaurant_groups, then divide by the number of rows in restaurant_groups to calculate the probability of randomly selecting a group of each size. Save as size_dist.# Reset the index of size_dist.# Rename the columns of size_dist to group_size and prob.size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]# Reset index and rename columnssize_dist = size_dist.reset_index()size_dist.columns = ['group_size', 'prob']# Calculate the expected value of the size_dist, which represents the expected group size, by multiplying the group_size by the prob and taking the sum.expected_value = (size_dist['group_size'] * size_dist['prob']).sum()# Orexpected_value = np.sum(size_dist['group_size'] * size_dist['prob'])print(expected_value)# Calculate the probability of randomly picking a group of 4 or more people by subsetting for groups of size 4 or more and summing the probabilities of selecting those groups.groups_4_or_more = size_dist[size_dist['group_size'] >=4]# Sum the probabilities of groups_4_or_moreprob_4_or_more = np.sum(groups_4_or_more['prob'])print(prob_4_or_more)
Special note
You learned about the basics of probability distributions, focusing on discrete distributions, and how they apply to real-world scenarios. Specifically, you explored:
Probability Distributions: Understanding that a probability distribution describes the likelihood of each possible outcome in a scenario, like rolling a six-sided die where each outcome has an equal chance.
Expected Value: Learning to calculate the expected value of a distribution as the mean, demonstrated by multiplying each outcome’s value by its probability and summing these products. For a fair die, the expected value is 3.5.
Visualizing Distributions: How to visualize probability distributions with bar plots, where each bar’s height represents the outcome’s probability, and histograms for sample outcomes.
Discrete Uniform Distribution: Identifying when all outcomes have the same probability, such as with a fair die, this represents a discrete uniform distribution.
Sampling and the Law of Large Numbers: Through examples, you saw how sampling from a distribution (like rolling a die multiple times) and calculating the sample mean can illustrate the law of large numbers. The larger the sample, the closer the sample mean will be to the theoretical mean.
Code
# Example of calculating expected value for a fair dieexpected_value =sum([i * (1/6) for i inrange(1, 7)])
Chapter 2.3 Continuous distributions
We can use discrete distributions to model situations that involve discrete or countable variables, but how can we model continuous variables?
Waiting for the bus
Let’s start with an example. The city bus arrives once every twelve minutes, so if you show up at a random time, you could wait anywhere from 0 minutes if you just arrive as the bus pulls in, up to 12 minutes if you arrive just as the bus leaves.
Continuous uniform distribution
Let’s model this scenario with a probability distribution. There are an infinite number of minutes we could wait since we could wait 1 minute, 1.5 minutes, 1.53 minutes, and so on, so we can’t create individual blocks like we could with a discrete variable.
Continuous uniform distribution
Instead, we’ll use a continuous line to represent probability. The line is flat since there’s the same probability of waiting any time from 0 to 12 minutes. This is called the continuous uniform distribution.
Probability still = area
Now that we have our distribution, let’s figure out what the probability is that we’ll wait between 4 and 7 minutes. Just like with discrete distributions, we can take the area from 4 to 7 to calculate probability. The width of this rectangle is 7 minus 4 which is 3. The height is one-twelfth. Multiplying those together to get area, we get 3/12 or 25%.
Uniform distribution in Python
Let’s use the uniform distribution in Python to calculate the probability of waiting 7 minutes or less. We need to import uniform from scipy.stats. We can call uniform.cdf and pass it 7, followed by the lower and upper limits, which in our case is 0 and 12. The probability of waiting less than 7 minutes is about 58%.
“Greater than” probabilities
If we want the probability of waiting more than 7 minutes, we need to take 1 minus the probability of waiting less than 7 minutes.
Combining multiple uniform.cdf() calls
How do we calculate the probability of waiting 4 to 7 minutes using Python? We can start with the probability of waiting less than 7 minutes, then subtract the probability of waiting less than 4 minutes. This gives us 25%.
Total area = 1
To calculate the probability of waiting between 0 and 12 minutes, we multiply 12 by 1/12, which is 1, or 100%. This makes sense since we’re certain we’ll wait anywhere from 0 to 12 minutes.
Generating random numbers according to uniform distribution
To generate random numbers according to the uniform distribution, we can use uniform.rvs, which takes in the minimum value, maximum value, followed by the number of random values we want to generate. Here, we generate 10 random values between 0 and 5.
Other continuous distributions
Continuous distributions can take forms other than uniform where some values have a higher probability than others. No matter the shape of the distribution, the area beneath it must always equal 1.
Other special types of distributions
This will also be true of other distributions you’ll learn about later on in the course, like the normal distribution or exponential distribution, which can be used to model many real-life situations.
Exercise 2.3.1
Data back-ups
The sales software used at your company is set to automatically back itself up, but no one knows exactly what time the back-ups happen. It is known, however, that back-ups happen exactly every 30 minutes. Amir comes back from sales meetings at random times to update the data on the client he just met with. He wants to know how long he’ll have to wait for his newly-entered data to get backed up. Use your new knowledge of continuous uniform distributions to model this situation and answer Amir’s questions.
To model how long Amir will wait for a back-up using a continuous uniform distribution, save his lowest possible wait time as min_time and his longest possible wait time as max_time. Remember that back-ups happen every 30 minutes.
Import uniform from scipy.stats and calculate the probability that Amir has to wait less than 5 minutes, and store in a variable called prob_less_than_5.
Calculate the probability that Amir has to wait more than 5 minutes, and store in a variable called prob_greater_than_5.
Calculate the probability that Amir has to wait between 10 and 20 minutes, and store in a variable called prob_between_10_and_20.
Code
min_time =0max_time =30from scipy.stats import uniformprob_less_than_5 = uniform.cdf(5, 0, 30)print(f"The probability that Amir has to wait less than 5 minutes: {prob_less_than_5}")# Calculate the probability that Amir has to wait more than 5 minutes, and store in a variable called prob_greater_than_5.prob_greater_than_5 =1- uniform.cdf(5, 0, 30)print(f"The probability that Amir has to wait more than 5 minutes: {prob_greater_than_5}")# Calculate the probability that Amir has to wait between 10 and 20 minutes, and store in a variable called prob_between_10_and_20.prob_between_10_and_20 = uniform.cdf(20, 0, 30) - uniform.cdf(10, 0, 30)print(f"The probability that Amir has to wait between 10 and 20 minutes: {prob_between_10_and_20}")
The probability that Amir has to wait less than 5 minutes: 0.16666666666666666
The probability that Amir has to wait more than 5 minutes: 0.8333333333333334
The probability that Amir has to wait between 10 and 20 minutes: 0.3333333333333333
Exercise 2.3.2
Simulating wait times
To give Amir a better idea of how long he’ll have to wait, you’ll simulate Amir waiting 1000 times and create a histogram to show him what he should expect. Recall from the last exercise that his minimum wait time is 0 minutes and his maximum wait time is 30 minutes.
Set the random seed to 334.
Import uniform from scipy.stats.
Generate 1000 wait times from the continuous uniform distribution that models Amir’s wait time. Save this as wait_times.
Create a histogram of the simulated wait times and show the plot.
Code
# Set the random seed to 334.np.random.seed(334)# Import uniform from scipy.stats.from scipy.stats import uniform# Generate 1000 wait times from the continuous uniform distribution that models Amir's wait time. Save this as wait_times.wait_times = uniform.rvs(0, 30, size =1000)print("The Wait times Distribution \n")print(wait_times)# Create a histogram of the simulated wait times and show the plot.plt.hist(wait_times)plt.show()
It’s time to further expand your toolbox of distributions. In this lesson, you’ll learn about the binomial distribution.
Coin flipping
We’ll start by flipping a coin, which has two possible outcomes, heads or tails, each with a probability of 50%.
Binary outcomes
This is just one example of a binary outcome, or an outcome with two possible values. We could also represent these outcomes as a 1 and a 0, a success or a failure, and a win or a loss.
A single flip
In Python, we can simulate this by importing binom from scipy.stats and using the binom.rvs function, which takes in the number of coins we want to flip, the probability of heads or success, and an argument called size, which is number of trials. size is a named argument, so we’ll need to explicitly specify that the third argument corresponds to size, or we’ll get incorrect results. This call will return a 1, which we’ll count as a head, or a 0, which we’ll count as tails. We can use binom.rvs 1, 0.5, size equals 1 to flip 1 coin, with a 50% probability of heads, 1 time.
One flip many times
To perform eight coin flips, we can change the size argument to 8, which will flip 1 coin with a 50% chance of heads 8 times. This gives us a set of 8 ones and zeros.
Many flips one time
If we swap the first and last arguments, we flip eight coins one time. This gives us one number, which is the total number of heads or successes.
Many flips many times
Similarly, we can pass 3 as the first argument, and set size equal to 10 to flip 3 coins. This returns 10 numbers, each representing the total number of heads from each set of flips.
Other probabilities
We could also have a coin that’s heavier on one side than the other, so the probability of getting heads is only 25%. To simulate flips with this coin, we’ll adjust the second argument of binom.rvs to 0.25. The result has lower numbers, since getting multiple heads isn’t as likely with the new coin.
Binomial distribution
The binomial distribution describes the probability of the number of successes in a sequence of independent trials. In other words, it can tell us the probability of getting some number of heads in a sequence of coin flips. Note that this is a discrete distribution since we’re working with a countable outcome. The binomial distribution can be described using two parameters, n and p. n represents the total number of trials being performed, and p is the probability of success. n and p are also the third and second arguments of binom.rvs. Here’s what the distribution looks like for 10 coins. We have the biggest chance of getting 5 heads total, and a much smaller chance of getting 0 heads or 10 heads.
What’s the probability of 7 heads?
To get the probability of getting 7 heads out of 10 coins, we can use binom.pmf. The first argument is the number of heads or successes. The second argument is the number of trials, n, and the third is the probability of success, p. If we flip 10 coins, there’s about a 12% chance that exactly 7 of them will be heads.
What’s the probability of 7 or fewer heads?
binom.cdf gives the probability of getting a number of successes less than or equal to the first argument. The probability of getting 7 or fewer heads out of 10 coins is about 95%.
What’s the probability of more than 7 heads?
We can take 1 minus the probability of getting 7 or fewer heads to get the probability of a number of successes greater than the first argument.
Expected value
The expected value of the binomial distribution can be calculated by multiplying n times p. The expected number of heads we’ll get from flipping 10 coins is 10 times 0.5, which is 5.
Independence
It’s important to remember that in order for the binomial distribution to apply, each trial must be independent, so the outcome of one trial shouldn’t have an effect on the next. For example, if we’re picking randomly from these cards with zeros and ones, we have a 50-50 chance of getting a 0 or a 1. But since we’re sampling without replacement, the probabilities for the second trial are different due to the outcome of the first trial. Since these trials aren’t independent, we can’t calculate accurate probabilities for this situation using the binomial distribution.
Exercise 2.4.1: Simulating sales deals
Assume that Amir usually works on 3 deals per week, and overall, he wins 30% of deals he works on. Each deal has a binary outcome: it’s either lost, or won, so you can model his sales deals with a binomial distribution. In this exercise, you’ll help Amir simulate a year’s worth of his deals so he can better understand his performance.
Import binom from scipy.stats and set the random seed to 10.
Simulate 1 deal worked on by Amir, who wins 30% of the deals he works on.
Simulate a typical week of Amir’s deals, or one week of 3 deals.
Simulate a year’s worth of Amir’s deals, or 52 weeks of 3 deals each, and store in deals.
Print the mean number of deals he won per week.
Code
# Import binom from scipy.stats and set the random seed to 10.from scipy.stats import binomnp.random.seed(10)# Simulate 1 deal worked on by Amir, who wins 30% of the deals he works on.print(f"Probability of 30% won: {binom.rvs(1, 0.3, size=1)}")# Simulate a typical week of Amir's deals, or one week of 3 deals.print(f"Probability of 1 week of 3 deals: {binom.rvs(3, 0.3, size =1)}")# Simulate a year's worth of Amir's deals, or 52 weeks of 3 deals each, and store in deals.# Print the mean number of deals he won per week.deals = binom.rvs(3, 0.3, size =52)print(f"The mean number of deals he won per week: {np.mean(deals)}")
Probability of 30% won: [1]
Probability of 1 week of 3 deals: [0]
The mean number of deals he won per week: 0.8461538461538461
Just as in the last exercise, Section 0.4, assume that Amir wins 30% of deals. He wants to get an idea of how likely he is to close a certain number of deals each week. In this exercise, you’ll calculate what the chances are of him closing different numbers of deals using the binomial distribution.
What’s the probability that Amir closes 1 or fewer deals in a week? Save this as prob_less_than_or_equal_1.
What’s the probability that Amir closes more than 1 deal? Save this as prob_greater_than_1.
Code
prob_3 = binom.pmf(3, 3, 0.3)print(f"The likely he is to close a certain number of deals each week: {prob_3}")# What's the probability that Amir closes 1 or fewer deals in a week? Save this as prob_less_than_or_equal_1.prob_less_than_or_equal_1 = binom.cdf(1, 3, 0.3)print(f"The probability that Amir closes 1 or fewer deals in a week: {prob_less_than_or_equal_1}")# What's the probability that Amir closes more than 1 deal? Save this as prob_greater_than_1.prob_greater_than_1 =1- binom.cdf(1, 3, 0.3)print(f"Tthe probability that Amir closes more than 1 deal: {prob_greater_than_1}")
The likely he is to close a certain number of deals each week: 0.027
The probability that Amir closes 1 or fewer deals in a week: 0.784
Tthe probability that Amir closes more than 1 deal: 0.21599999999999997
Exercise 2.4.3: How many sales will be won?
Now Amir wants to know how many deals he can expect to close each week if his win rate changes. Luckily, you can use your binomial distribution knowledge to help him calculate the expected value in different situations. Recall from the lesson that the expected value of a binomial distribution can be calculated by n x p.
Calculate the expected number of sales out of the 3 he works on that Amir will win each week if he maintains his 30% win rate.
Calculate the expected number of sales out of the 3 he works on that he’ll win if his win rate drops to 25%.
Calculate the expected number of sales out of the 3 he works on that he’ll win if his win rate rises to 35%.
Code
# Calculate the expected number of sales out of the 3 he works on that Amir will win each week if he maintains his 30% win rate.won_30pct =3*0.3print(f"Number of sales out of 3 works at 30% win rate: {won_30pct}")# Calculate the expected number of sales out of the 3 he works on that he'll win if his win rate drops to 25%.won_25pct =3*0.25print(f"Number of sales out of 3 works at 25% win rate: {won_25pct}")# Calculate the expected number of sales out of the 3 he works on that he'll win if his win rate rises to 35%.won_35pct =3*0.35print(f"Number of sales out of 3 works at 35% win rate: {won_35pct}")
Number of sales out of 3 works at 30% win rate: 0.8999999999999999
Number of sales out of 3 works at 25% win rate: 0.75
Number of sales out of 3 works at 35% win rate: 1.0499999999999998
Key Points:
You learned about the binomial distribution, a fundamental concept in probability that models events with two possible outcomes, such as flipping a coin. Key points included:
Understanding binary outcomes, which can be success/failure, win/loss, or heads/tails, and how these can be represented numerically (1 or 0).
Using the binom.rvs function from scipy.stats to simulate random variables following a binomial distribution. This function requires specifying the number of trials (n), the probability of success (p), and the size, which determines how many times the experiment is run.
The difference between simulating a single trial multiple times and multiple trials in one go was illustrated with coin flips.
Adjusting the probability of success (p) to model biased outcomes, like a weighted coin, and observing how it affects the results.
Calculating probabilities with the binomial distribution using binom.pmf for the probability of a specific number of successes, and binom.cdf for the probability of up to a certain number of successes.
The expected value of a binomial distribution, which is the average number of successes over many trials, can be calculated with n * p.
For example, to calculate the expected number of sales Amir will win each week with different win rates, you used the formula for the expected value in a binomial distribution:
# Expected number won with 30% win ratewon_30pct =3*0.3print(won_30pct)# Expected number won with 25% win rate won_25pct =3*0.25print(won_25pct) # Expected number won with 35% win rate won_35pct =3*0.35print(won_35pct)
This lesson emphasized the importance of understanding and applying the binomial distribution to model real-world scenarios with binary outcomes, enhancing your ability to analyze and predict the probability of events.
Chapter 2.5: The normal distribution
The next probability distribution we’ll discuss is the normal distribution. It’s one of the most important probability distributions you’ll learn about since a countless number of statistical methods rely on it, and it applies to more real-world situations than the distributions we’ve covered so far.
What is the normal distribution?
The normal distribution looks like this. Its shape is commonly referred to as a “bell curve”. The normal distribution has a few important properties.
Symmetrical
First, it’s symmetrical, so the left side is a mirror image of the right.
Area = 1
Second, just like any continuous distribution, the area beneath the curve is 1.
Curve never hits 0
Second, the probability never hits 0, even if it looks like it does at the tail ends. Only 0.006% of its area is contained beyond the edges of this graph.
Described by mean and standard deviation
The normal distribution is described by its mean and standard deviation. Here is a normal distribution with a mean of 20 and standard deviation of 3, and here is a normal distribution with a mean of 0 and a standard deviation of 1. When a normal distribution has mean 0 and a standard deviation of 1, it’s a special distribution called the standard normal distribution.
Areas under the normal distribution
For the normal distribution, 68% of the area is within 1 standard deviation of the mean. 95% of the area falls within 2 standard deviations of the mean, and 99.7% of the area falls within three standard deviations. This is sometimes called the 68-95-99.7 rule.
Lots of histograms look normal
There’s lots of real-world data shaped like the normal distribution. For example, here is a histogram of the heights of women that participated in the National Health and Nutrition Examination Survey. The mean height is around 161 centimeters and the standard deviation is about 7 centimeters.
Approximating data with the normal distribution
Since this height data closely resembles the normal distribution, we can take the area under a normal distribution with mean 161 and standard deviation 7 to approximate what percent of women fall into different height ranges.
What percent of women are shorter than 154 cm?
For example, what percent of women are shorter than 154 centimeters? We can answer this using norm.cdf from scipy.stats, which takes the area of the normal distribution less than some number. We pass in the number of interest, 154, followed by the mean and standard deviation of the normal distribution we’re using. This gives us about 16% of women are shorter than 154 centimeters.
What percent of women are taller than 154 cm?
To find the percent of women taller than 154 centimeters, we can take 1 minus the area on the left of 154, which equals the area to the right of 154.
What percent of women are 154-157 cm?
To get the percent of women between 154 and 157 centimeters tall we can take the area below 157 and subtract the area below 154, which leaves us the area between 154 and 157.
What height are 90% of women shorter than?
We can also calculate percentages from heights using norm.ppf. To figure out what height 90% of women are shorter than, we pass 0.9 into norm.ppf along with the same mean and standard deviation we’ve been working with. This tells us that 90% of women are shorter than 170 centimeters tall.
What height are 90% of women taller than?
We can figure out the height 90% of women are taller than, since this is also the height that 10% of women are shorter than. We can take 1 minus 0.9 to get 0.1, which we’ll use as the first argument of norm.ppf.
Generating random numbers
Just like with other distributions, we can generate random numbers from a normal distribution using norm.rvs, passing in the distribution’s mean and standard deviation, as well as the sample size we want.
Exercise 2.5.1
Create a histogram with 10 bins to visualize the distribution of the amount. Show the plot.
What’s the probability of Amir closing a deal worth less than $7500?
What’s the probability of Amir closing a deal worth more than $1000?
What’s the probability of Amir closing a deal worth between $3000 and $7000?
What amount will 25% of Amir’s sales be less than?
Code
from scipy.stats import norm# Create a histogram with 10 bins to visualize the distribution of the amount. Show the plot.amir_deals['amount'].hist( bins=10)plt.title("The distribution of the Amir's amount")plt.show()# What's the probability of Amir closing a deal worth less than $7500?prob_less_7500 = norm.cdf(7500, 5000, 2000)print(f"The probability of Amir closing a deal worth less than $7500 is: {prob_less_7500}")# What's the probability of Amir closing a deal worth more than $1000?prob_over_1000 =1- norm.cdf(1000, 5000, 2000)print(f"The probability of Amir closing a deal worth more than $1000 IS: {prob_over_1000}")# What's the probability of Amir closing a deal worth between $3000 and $7000?prob_3000_to_7000 = norm.cdf(7000, 5000, 2000) - norm.cdf(3000, 5000, 2000)print(f"The probability of Amir closing a deal worth between $3000 and $7000 is: {prob_3000_to_7000}")# What amount will 25% of Amir's sales be less than?pct_25 = norm.ppf(0.25, 5000, 2000)print(f"The amount of 25% of Amir's sales be will less than: {pct_25}")
The probability of Amir closing a deal worth less than $7500 is: 0.8943502263331446
The probability of Amir closing a deal worth more than $1000 IS: 0.9772498680518208
The probability of Amir closing a deal worth between $3000 and $7000 is: 0.6826894921370859
The amount of 25% of Amir's sales be will less than: 3651.0204996078364
Exercise 2.5.2: Simulating sales under new market conditions
The company’s financial analyst is predicting that next quarter, the worth of each sale will increase by 20% and the volatility, or standard deviation, of each sale’s worth will increase by 30%. To see what Amir’s sales might look like next quarter under these new market conditions, you’ll simulate new sales amounts using the normal distribution and store these in the new_sales DataFrame, which has already been created for you.
Currently, Amir’s average sale amount is $5000. Calculate what his new average amount will be if it increases by 20% and store this in new_mean.
Amir’s current standard deviation is $2000. Calculate what his new standard deviation will be if it increases by 30% and store this in new_sd.
Create a variable called new_sales, which contains 36 simulated amounts from a normal distribution with a mean of new_mean and a standard deviation of new_sd.
Plot the distribution of the new_sales amounts using a histogram and show the plot.
Code
# Currently, Amir's average sale amount is $5000. Calculate what his new average amount will be if it increases by 20% and store this in new_mean.new_mean = (0.2*5000) +5000# Amir's current standard deviation is $2000. Calculate what his new standard deviation will be if it increases by 30% and store this in new_sd.new_sd = (0.3*2000) +2000# Create a variable called new_sales, which contains 36 simulated amounts from a normal distribution with a mean of new_mean and a standard deviation of new_sd.new_sales = norm.rvs(new_mean, new_sd, size =36)# Plot the distribution of the new_sales amounts using a histogram and show the plot.plt.hist(new_sales)plt.title("The distribution of the New Sales amounts")plt.show()
Chapter 3: The central limit theorem
Now that you’re familiar with the normal distribution, it’s time to learn about what makes it so important.
Rolling the dice 5 times
Let’s go back to our dice rolling example. We have a Series of the numbers 1 to 6 called die. To simulate rolling the die 5 times, we’ll call die.sample. We pass in the Series we want to sample from, the size of the sample, and set replace to True. This gives us the results of 5 rolls. Now, we’ll take the mean of the 5 rolls, which gives us 2. If we roll another 5 times and take the mean, we get a different mean. If we do it again, we get another mean.
Rolling the dice 5 times 10 times
Let’s repeat this 10 times: we’ll roll 5 times and take the mean. To do this, we’ll use a for loop. We start by creating an empty list called sample_means to hold our means. We loop from 0 to 9 so that the process is repeated 10 times. Inside the loop, we roll 5 times and append the sample’s mean to the sample_means list. This gives us a list of 10 different sample means. Let’s plot these sample means.
Sampling distributions
A distribution of a summary statistic like this is called a sampling distribution. This distribution, specifically, is a sampling distribution of the sample mean.
100 sample means
Now let’s do this 100 times. If we look at the new sampling distribution, its shape somewhat resembles the normal distribution, even though the distribution of the die is uniform.
1000 sample means
Let’s take 1000 means. This sampling distribution more closely resembles the normal distribution.
Central limit theorem (CLT)
This phenomenon is known as the central limit theorem, which states that a sampling distribution will approach a normal distribution as the number of trials increases. In our example, the sampling distribution became closer to the normal distribution as we took more and more sample means. It’s important to note that the central limit theorem only applies when samples are taken randomly and are independent, for example, randomly picking sales deals with replacement.
Standard deviation and the CLT
The central limit theorem, or CLT, applies to other summary statistics as well. If we take the standard deviation of 5 rolls 1000 times, the sample standard deviations are distributed normally, centered around 1.9, which is the distribution’s standard deviation.
Proportions and the CLT
Another statistic that the CLT applies to is proportion. Let’s sample from the sales team 10 times with replacement and see how many draws have Claire as the outcome. In this case, 10% of draws were Claire. If we draw again, there are 40% Claires.
Sampling distribution of proportion
If we repeat this 1000 times and plot the distribution of the sample proportions, it resembles a normal distribution centered around 0.25, since Claire’s name was on 25% of the tickets.
Mean of sampling distribution
Since these sampling distributions are normal, we can take their mean to get an estimate of a distribution’s mean, standard deviation, or proportion. If we take the mean of our sample means from earlier, we get 3.48. That’s pretty close to the expected value, which is 3.5! Similarly, the mean of the sample proportions of Claires isn’t far off from 0.25. In these examples, we know what the underlying distributions look like, but if we don’t, this can be a useful method for estimating characteristics of an underlying distribution. The central limit theorem also comes in handy when you have a huge population and don’t have the time or resources to collect data on everyone. Instead, you can collect several smaller samples and create a sampling distribution to estimate what the mean or standard deviation is.
Exercise 3.1
Create a histogram of the num_users column of amir_deals and show the plot.
Set the seed to 104.
Take a sample of size 20 with replacement from the num_users column of amir_deals, and take the mean. Store the mean in samp_20.
Take mean of samp_20.
Repeat this 100 times using a for loop and store as sample_means. This will take 100 different samples and calculate the mean of each.
Convert sample_means into a pd.Series, create a histogram of the sample_means, and show the plot.
Take 30 samples (with replacement) of size 20 from all_deals['num_users'] and take the mean of each sample. Store the sample means in sample_means_1.
Print mean of sample_means_1.
Print the mean of the num_users column of amir_deals.
Code
# Create a histogram of the num_users column of amir_deals and show the plot.amir_deals['num_users'].hist()plt.title("The Distribution of Number of Users in Amir's deal")plt.show()# Set the seed to 104.np.random.seed(104)# Take a sample of size 20 with replacement from the num_users column of amir_deals, and take the mean.samp_20 = amir_deals['num_users'].sample(20, replace=True)# Take mean of samp_20print(f"The Mean of the 20 samples from Number of Users is: {np.mean(samp_20)}")# Repeat this 100 times using a for loop and store as sample_means. This will take 100 different samples and calculate the mean of each.# Set seed to 104np.random.seed(104)# Sample 20 num_users with replacement from amir_deals and take meansamp_20 = amir_deals['num_users'].sample(20, replace=True)np.mean(samp_20)sample_means = []# Loop 100 timesfor i inrange(100):# Take sample of 20 num_users samp_20 = amir_deals['num_users'].sample(20, replace=True)# Calculate mean of samp_20 samp_20_mean = np.mean(samp_20)# Append samp_20_mean to sample_means sample_means.append(samp_20_mean)print(f"Distribution of sample means (n=20, iterations=100): {sample_means}")# Convert sample_means into a pd.Series, create a histogram of the sample_means, and show the plot.sample_means_series = pd.Series(sample_means)sample_means_series.hist()plt.title("Distribution of sample means (n=20, iterations=100)")# Show plotplt.show()# Set the random seed to 321.np.random.seed(321)# Take 30 samples (with replacement) of size 20 from all_deals['num_users'] and take the mean of each sample. Store the sample means in sample_means_1.# sample_means_1 = []# Loop 30 times to take 30 means# for i in range(30):# Take sample of size 20 from num_users col of all_deals with replacement# cur_sample = all_deals['num_users'].sample(20, replace = True)# Take mean of cur_sample# cur_mean = np.mean(cur_sample)# Append cur_mean to sample_means# sample_means_1.append(cur_mean)# Print mean of sample_means_1# print(np.mean(sample_means_1))# Print the mean of the num_users column of amir_deals.print(f"Amir's average number of users: {amir_deals['num_users'].mean()}")
Amir's average number of users: 37.651685393258425
Expected output:Overall average number of users: 38.31333333333332Amir's average number of users: 37.651685393258425
Conclusion:
We can see that Amir’s average number of users is very close to the overall average, so it looks like he’s meeting expectations. Make sure to note this in his performance review!
Chapter 4: The Poisson distribution
In this lesson, we’ll talk about another probability distribution called the Poisson distribution.
Poisson processes
Before we talk about probability, let’s define a Poisson process. A Poisson process is a process where events appear to happen at a certain rate, but completely at random. For example, the number of animals adopted from an animal shelter each week is a Poisson process - we may know that on average there are 8 adoptions per week, but this number can differ randomly. Other examples would be the number of people arriving at a restaurant each hour, or the number of earthquakes per year in California. The time unit like, hours, weeks, or years, is irrelevant as long as it’s consistent.
Poisson distribution
The Poisson distribution describes the probability of some number of events happening over a fixed period of time. We can use the Poisson distribution to calculate the probability of at least 5 animals getting adopted in a week, the probability of 12 people arriving in a restaurant in an hour, or the probability of fewer than 20 earthquakes in California in a year.
Lambda (\(\lambda\))
The Poisson distribution is described by a value called lambda, which represents the average number of events per time period. In the animal shelter example, this would be the average number of adoptions per week, which is 8. This value is also the expected value of the distribution! The Poisson distribution with \(\lambda\) equals 8 looks like this. Notice that it’s a discrete distribution since we’re counting events, and 7 and 8 are the most likely number of adoptions to happen in a week.
Lambda is the distribution’s peak
Lambda changes the shape of the distribution, so a Poisson distribution with \(\lambda\) equals 1, in blue, looks quite different than a Poisson distribution with lambda equals 8, in green, but no matter what, the distribution’s peak is always at its lambda value.
Probability of a single value
Given that the average number of adoptions per week is 8, what’s the probability of 5 adoptions in a week? Just like the other probability distributions, we can import poisson from scipy.stats. We’ll use the poisson.pmf function, passing 5 as the first argument and 8 as the second argument to indicate the distribution’s mean. This gives us about 9%.
Probability of less than or equal to
To get the probability that 5 or fewer adoptions will happen in a week, use the poisson.cdf function, passing in the same numbers. This gives us about 20%.
Probability of greater than
Just like other probability functions you’ve learned about so far, take 1 minus the “less than or equal to 5” probability to get the probability of more than 5 adoptions. There’s an 81% chance that more than 5 adoptions will occur. If the average number of adoptions rises to 10 per week, there will be a 93% chance that more than 5 adoptions will occur.
Sampling from a Poisson distribution
Just like other distributions, we can take samples from Poisson distributions using poisson-dot-rvs. Here, we’ll simulate 10 different weeks at the animal shelter. In one week, there are 14 adoptions, but only 6 in another.
The CLT still applies!
Just like other distributions, the sampling distribution of sample means of a Poisson distribution looks normal with a large number of samples.
Exercise 4.1: Tracking lead responses
Your company uses sales software to keep track of new sales leads. It organizes them into a queue so that anyone can follow up on one when they have a bit of free time. Since the number of lead responses is a countable outcome over a period of time, this scenario corresponds to a Poisson distribution. On average, Amir responds to 4 leads each day. In this exercise, you’ll calculate probabilities of Amir responding to different numbers of leads.
Import poisson from scipy.stats and calculate the probability that Amir responds to 5 leads in a day, given that he responds to an average of 4.
Amir’s coworker responds to an average of 5.5 leads per day. What is the probability that she answers 5 leads in a day?
What’s the probability that Amir responds to 2 or fewer leads in a day?
What’s the probability that Amir responds to more than 10 leads in a day?
Code
# Import poisson from scipy.stats and calculate the probability that Amir responds to 5 leads in a day, given that he responds to an average of 4.from scipy.stats import poisson# Probability of 5 responsesprob_5 = poisson.pmf(5, 4)print(f"The probability that Amir responds to 5 leads in a day: {prob_5}")# 0.1562934518505317 (15.6%)# Amir's coworker responds to an average of 5.5 leads per day. What is the probability that she answers 5 leads in a day?prob_coworker = poisson.pmf(5, 5.5)print(f"The probability Amir's coworker responds to an average of 5.5 leads per day: {prob_coworker}")# 0.17140068409793663 (17.1%)# What's the probability that Amir responds to 2 or fewer leads in a day?prob_2_or_less = poisson.cdf(2, 4)print(f"The probability that Amir responds to 2 or fewer leads in a day: {prob_2_or_less}")# 0.23810330555354436 (23.8%)# What's the probability that Amir responds to more than 10 leads in a day?prob_over_10 =1- poisson.cdf(10, 4)print(f"The probability that Amir responds to more than 10 leads in a day: {prob_over_10}")# 0.0028397661205137315 (0.28397661%)
The probability that Amir responds to 5 leads in a day: 0.1562934518505317
The probability Amir's coworker responds to an average of 5.5 leads per day: 0.17140068409793663
The probability that Amir responds to 2 or fewer leads in a day: 0.23810330555354436
The probability that Amir responds to more than 10 leads in a day: 0.0028397661205137315
Chapter 4.1: More probability distributions
In this lesson, we’ll discuss a few other probability distributions.
Exponential distribution
The first distribution is the exponential distribution, which represents the probability of a certain time passing between Poisson events. We can use the exponential distribution to predict, for example, the probability of more than 1 day between adoptions, the probability of fewer than 10 minutes between restaurant arrivals, and the probability of 6-8 months passing between earthquakes. Just like the Poisson distribution, the time unit doesn’t matter as long as it’s consistent. The exponential distribution uses the same \(\lambda\) value, which represents the rate, that the Poisson distribution does. Note that lambda and rate mean the same value in this context. It’s also continuous, unlike the Poisson distribution, since it represents time.
Customer service requests
For example, let’s say that one customer service ticket is created every 2 minutes. We can rephrase this so it’s in terms of a time interval of one minute, so half of a ticket is created each minute. We’ll use 0.5 as the \(\lambda\) value. The exponential distribution with a rate of one half looks like this.
Lambda in exponential distribution
The rate affects the shape of the distribution and how steeply it declines.
Expected value of exponential distribution
Recall that lambda is the expected value of the Poisson distribution, which measures frequency in terms of rate or number of events. In our customer service ticket example, this means that the expected number of requests per minute is 0.5. The exponential distribution measures frequency in terms of time between events. The expected value of the exponential distribution can be calculated by taking 1 divided by lambda. In our example, the expected time between requests is 1 over one half, which is 2, so there is an average of 2 minutes between requests.
How long until a new request is created?
Similar to other continuous distributions, we can use expon.cdf to calculate probabilities. The probability of waiting less than 1 minute for a new request is calculated using expon.cdf, passing in 1 followed by a 2, which gives us about an 40% chance. Note that we’re passing in 2, not the lambda value which is 0.5. The probability of waiting more than 4 minutes can be found using 1 minus expon.cdf of 4, 2, giving a 13% chance. Finally, the probability of waiting between 1 and 4 minutes can be found by taking expon.cdf of 4 and subtracting expon.cdf of 1. There’s a 50% chance you’ll wait between 1 and 4 minutes.
(Student’s) t-distribution
The next distribution is the t-distribution, which is also sometimes called Student’s t-distribution. Its shape is similar to the normal distribution, but not quite the same. If we compare the normal distribution, in blue, with the t-distribution with one degree of freedom, in orange, the t-distribution’s tails are thicker. This means that in a t-distribution, observations are more likely to fall further from the mean.
Degrees of freedom
The t-distribution has a parameter called degrees of freedom, which affects the thickness of the distribution’s tails. Lower degrees of freedom results in thicker tails and a higher standard deviation. As the number of degrees of freedom increases, the distribution looks more and more like the normal distribution.
Log-normal distribution
The last distribution we’ll discuss is the log-normal distribution. Variables that follow a log-normal distribution have a logarithm that is normally distributed. This results in distributions that are skewed, unlike the normal distribution. There are lots of real-world examples that follow this distribution, such as the length of chess games, blood pressure in adults, and the number of hospitalizations in the 2003 SARS outbreak.
Exercise 4.2: Modeling time between leads
To further evaluate Amir’s performance, you want to know how much time it takes him to respond to a lead after he opens it. On average, he responds to 1 request every 2.5 hours. In this exercise, you’ll calculate probabilities of different amounts of time passing between Amir receiving a lead and sending a response.
Import expon from scipy.stats. What’s the probability it takes Amir less than an hour to respond to a lead?
What’s the probability it takes Amir more than 4 hours to respond to a lead?
What’s the probability it takes Amir 3-4 hours to respond to a lead?
Code
# Import expon from scipy.stats. What's the probability it takes Amir less than an hour to respond to a lead?from scipy.stats import expon# Print probability response takes < 1 hourprint(f"The probability it takes Amir less than an hour to respond to a lead {expon.cdf(1, scale=2.5)}")# 0.3296799539643607 (32.97%)# What's the probability it takes Amir more than 4 hours to respond to a lead?print(f"The probability it takes Amir more than 4 hours to respond to a lead: {1- expon.cdf(4, scale =2.5)}")# 0.20189651799465536 (20.2%)# What's the probability it takes Amir 3-4 hours to respond to a lead?print(f"The probability it takes Amir 3-4 hours to respond to a lead: {expon.cdf(4, scale =2.5) - expon.cdf(3, scale =2.5)}")# 0.09929769391754684 (9.93%)
The probability it takes Amir less than an hour to respond to a lead 0.3296799539643607
The probability it takes Amir more than 4 hours to respond to a lead: 0.20189651799465536
The probability it takes Amir 3-4 hours to respond to a lead: 0.09929769391754684
Chapter 5: Correlation
Welcome to the final chapter of the course, where we’ll talk about correlation and experimental design.
Relationships between two variables
Before we dive in, let’s talk about relationships between numeric variables. We can visualize these kinds of relationships with scatter plots - in this scatterplot, we can see the relationship between the total amount of sleep mammals get and the amount of REM sleep they get. The variable on the x-axis is called the explanatory or independent variable, and the variable on the y-axis is called the response or dependent variable.
Correlation coefficient
We can also examine relationships between two numeric variables using a number called the correlation coefficient. This is a number between -1 and 1, where the magnitude corresponds to the strength of the relationship between the variables, and the sign, positive or negative, corresponds to the direction of the relationship.
Magnitude = strength of relationship
Here’s a scatterplot of 2 variables, x and y, that have a correlation coefficient of 0.99. Since the data points are closely clustered around a line, we can describe this as a near-perfect or very strong relationship. If we know what x is, we’ll have a pretty good idea of what the value of y could be. Here, x and y have a correlation coefficient of 0.75, and the data points are a bit more spread out. In this plot, x and y have a correlation of 0.56 and are therefore moderately correlated. A correlation coefficient around 0.2 would be considered a weak relationship. When the correlation coefficient is close to 0, x and y have no relationship and the scatterplot looks completely random. This means that knowing the value of x doesn’t tell us anything about the value of y.
Sign = direction
The sign of the correlation coefficient corresponds to the direction of the relationship. A positive correlation coefficient indicates that as x increases, y also increases. A negative correlation coefficient indicates that as x increases, y decreases.
Visualizing relationships
To visualize relationships between two variables, we can use a scatterplot. We’ll use seaborn, which is a plotting package built on top of matplotlib. We import seaborn as sns, which is the alias commonly used for seaborn. We create a scatterplot using sns.scatterplot, passing it the name of the variable for the x-axis, the name of the variable for the y-axis, as well as the msleep DataFrame to the data argument. Finally, we call plt.show.
Adding a trendline
We can add a linear trendline to the scatterplot using seaborn’s lmplot() function. It takes the same arguments as sns.scatterplot, but we’ll set ci to None so that there aren’t any confidence interval margins around the line. Trendlines like this can be helpful to more easily see a relationship between two variables.
Computing correlation
To calculate the correlation coefficient between two Series, we can use the .corr method. If we want the correlation between the sleep_total and sleep_rem columns of msleep, we can take the sleep_total column and call .corr on it, passing in the other Series we’re interested in. Note that it doesn’t matter which Series the method is invoked on and which is passed in since the correlation between x and y is the same thing as the correlation between y and x.
Many ways to calculate correlation
There’s more than one way to calculate correlation, but the method we’ve been using in this video is called the Pearson product-moment correlation, which is also written as r. This is the most commonly used measure of correlation. Mathematically, It’s calculated using this formula, where \(\bar{x}\) and \(\bar{y}\) are the means of \(x\) and \(y\), and \(\sigma_x\) and \(\sigma_y\) are the standard deviations of \(x\) and \(y\). The formula itself isn’t important to memorize, but know that there are variations of this formula that measure correlation a bit differently, such as Kendall’s tau (\(\tau\)) and Spearman’s rho (\(\rho\)), but those are beyond the scope of this course.
Exercise 5.1: Relationships between variables
In this chapter, you’ll be working with a dataset world_happiness containing results from the 2019 World Happiness Report. The report scores various countries based on how happy people in that country are. It also ranks each country on various societal aspects such as social support, freedom, corruption, and others. The dataset also includes the GDP per capita and life expectancy for each country.
Create a scatterplot of happiness_score vs. life_exp (without a trendline) using seaborn.
Create a scatterplot of happiness_score vs. life_exp with a linear trendline using seaborn, setting ci to None.
Based on the scatterplot, which is most likely the correlation between life_exp and happiness_score ?
Code
# Import the seaborn packageimport seaborn as sns# Import datasetworld_happiness = pd.read_csv("datasets/world_happiness.csv")# Create a scatterplot of happiness_score vs. life_exp (without a trendline) using seaborn.sns.scatterplot(x='life_exp', y='happiness_score', data=world_happiness)plt.title("The Scatterplot of Happiness Score vs. life expectancy (without a trendline)")# Show plotplt.show()# Create a scatterplot of happiness_score vs. life_exp with a linear trendline using seaborn, setting ci to None.sns.lmplot(x='life_exp', y='happiness_score', data=world_happiness, ci=None)plt.title("The Scatterplot of Happiness Score vs. life expectancy (with a trendline)")# Show plotplt.show()# Based on the scatterplot, which is most likely the correlation between life_exp and happiness_score?corr_happy_life = world_happiness['happiness_score'].corr(world_happiness['life_exp'])print("The correlation between life_exp and happiness_score: {corr_happy_life}")# 0.7802249053272062
The correlation between life_exp and happiness_score: {corr_happy_life}
Chapter 6: Correlation caveats
While correlation is a useful way to quantify relationships, there are some caveats.
Non-linear relationships
Consider a data. There is clearly a relationship between x and y, but when we calculate the correlation, we get 0.18.
Non-linear relationships
This is because the relationship between the two variables is a quadratic relationship, not a linear relationship. The correlation coefficient measures the strength of linear relationships, and linear relationships only.
Correlation only accounts for linear relationships
Just like any summary statistic, correlation shouldn’t be used blindly, and you should always visualize your data when possible.
Mammal sleep data
Let’s return to the mammal sleep data.
Body weight vs. awake time
Here’s a scatterplot of each mammal’s body weight versus the time they spend awake each day. The relationship between these variables is definitely not a linear one. The correlation between body weight and awake time is only about 0.3, which is a weak linear relationship.
Distribution of body weight
If we take a closer look at the distribution for bodywt, it’s highly skewed. There are lots of lower weights and a few weights that are much higher than the rest.
Log transformation
When data is highly skewed like this, we can apply a log transformation. We’ll create a new column called log_bodywt which holds the log of each body weight. We can do this using np.log. If we plot the log of bodyweight versus awake time, the relationship looks much more linear than the one between regular bodyweight and awake time. The correlation between the log of bodyweight and awake time is about 0.57, which is much higher than the 0.3 we had before.
Other transformations
In addition to the log transformation, there are lots of other transformations that can be used to make a relationship more linear, like taking the square root or reciprocal of a variable. The choice of transformation will depend on the data and how skewed it is. These can be applied in different combinations to x and y, for example, you could apply a log transformation to both x and y, or a square root transformation to x and a reciprocal transformation to y.
Why use a transformation?
So why use a transformation? Certain statistical methods rely on variables having a linear relationship, like calculating a correlation coefficient. Linear regression is another statistical technique that requires variables to be related in a linear manner, which you can learn all about in this course.
Correlation does not imply causation
Let’s talk about one more important caveat of correlation that you may have heard about before: correlation does not imply causation. This means that if x and y are correlated, x doesn’t necessarily cause y. For example, here’s a scatterplot of the per capita margarine consumption in the US each year and the divorce rate in the state of Maine. The correlation between these two variables is 0.99, which is nearly perfect. However, this doesn’t mean that consuming more margarine will cause more divorces. This kind of correlation is often called a spurious correlation.
Confounding
A phenomenon called confounding can lead to spurious correlations. Let’s say we want to know if drinking coffee causes lung cancer. Looking at the data, we find that coffee drinking and lung cancer are correlated, which may lead us to think that drinking more coffee will give you lung cancer. However, there is a third, hidden variable at play, which is smoking. Smoking is known to be associated with coffee consumption. It is also known that smoking causes lung cancer. In reality, it turns out that coffee does not cause lung cancer and is only associated with it, but it appeared causal due to the third variable, smoking. This third variable is called a confounder, or lurking variable. This means that the relationship of interest between coffee and lung cancer is a spurious correlation. Another example of this is the relationship between holidays and retail sales. While it might be that people buy more around holidays as a way of celebrating, it’s hard to tell how much of the increased sales is due to holidays, and how much is due to the special deals and promotions that often run around holidays. Here, special deals confound the relationship between holidays and sales.
Note
The correlation coefficient can’t account for any relationships that aren’t linear, regardless of strength.
Exercise 5.1
Create a scatterplot of happiness_score versus gdp_per_cap and calculate the correlation between them.
Add a new column to world_happiness called log_gdp_per_cap that contains the log of gdp_per_cap.
Create a seaborn scatterplot of happiness_score versus log_gdp_per_cap.
Calculate the correlation between log_gdp_per_cap and happiness_score.
Code
# Create a scatterplot of happiness_score versus gdp_per_cap and calculate the correlation between them.sns.scatterplot(x='gdp_per_cap', y='happiness_score', data=world_happiness)plt.title("A Scatterplot of Happiness Score versus GDP per capital")plt.show()# Calculate correlationcor = world_happiness['happiness_score'].corr(world_happiness['gdp_per_cap'])print(f"The correlation between happiness_score and gdp_per_cap is: {cor}")# Add a new column to world_happiness called log_gdp_per_cap that contains the log of gdp_per_cap.world_happiness['log_gdp_per_cap'] = np.log(world_happiness['gdp_per_cap'])# Create a seaborn scatterplot of happiness_score versus log_gdp_per_cap.sns.scatterplot(x='log_gdp_per_cap', y='happiness_score', data=world_happiness)plt.title("A Scatterplot of Happiness Score and the log of GDP per capital")plt.show()# Calculate the correlation between log_gdp_per_cap and happiness_score.cor_1 = world_happiness['happiness_score'].corr(world_happiness['log_gdp_per_cap'])print(f"The correlation between happiness_score and log of gdp_per_cap is: {cor_1}")
The correlation between happiness_score and gdp_per_cap is: 0.7279733012222976
The correlation between happiness_score and log of gdp_per_cap is: 0.804314600491829
Chapter 6: Design of experiments
Often, data is created as a result of a study that aims to answer a specific question. However, data needs to be analyzed and interpreted differently depending on how the data was generated and how the study was designed.
Vocabulary
Experiments generally aim to answer a question in the form, “What is the effect of the treatment on the response?” In this setting, treatment refers to the explanatory or independent variable, and response refers to the response or dependent variable. For example, what is the effect of an advertisement on the number of products purchased? In this case, the treatment is an advertisement, and the response is the number of products purchased.
Controlled experiments
In a controlled experiment, participants are randomly assigned to either the treatment group or the control group, where the treatment group receives the treatment and the control group does not. A great example of this is an A/B test. In our example, the treatment group will see an advertisement, and the control group will not. Other than this difference, the groups should be comparable so that we can determine if seeing an advertisement causes people to buy more. If the groups aren’t comparable, this could lead to confounding, or bias. If the average age of participants in the treatment group is 25 and the average age of participants in the control group is 50, age could be a potential confounder if younger people are more likely to purchase more, and this will make the experiment biased towards the treatment.
The gold standard of experiments will use…
The gold standard, or ideal experiment, will eliminate as much bias as possible by using certain tools. The first tool to help eliminate bias in controlled experiments is to use a randomized controlled trial. In a randomized controlled trial, participants are randomly assigned to the treatment or control group and their assignment isn’t based on anything other than chance. Random assignment like this helps ensure that the groups are comparable. The second way is to use a placebo, which is something that resembles the treatment, but has no effect. This way, participants don’t know if they’re in the treatment or control group. This ensures that the effect of the treatment is due to the treatment itself, not the idea of getting the treatment. This is common in clinical trials that test the effectiveness of a drug. The control group will still be given a pill, but it’s a sugar pill that has minimal effects on the response. In a double-blind experiment, the person administering the treatment or running the experiment also doesn’t know whether they’re administering the actual treatment or the placebo. This protects against bias in the response as well as the analysis of the results. These different tools all boil down to the same principle: if there are fewer opportunities for bias to creep into your experiment, the more reliably you can conclude whether the treatment affects the response.
Observational studies
The other kind of study we’ll discuss is the observational study. In an observational study, participants are not randomly assigned to groups. Instead, participants assign themselves, usually based on pre-existing characteristics. This is useful for answering questions that aren’t conducive to a controlled experiment. If you want to study the effect of smoking on cancer, you can’t force people to start smoking. Similarly, if you want to study how past purchasing behavior affects whether someone will buy a product, you can’t force people to have certain past purchasing behavior. Because assignment isn’t random, there’s no way to guarantee that the groups will be comparable in every aspect, so observational studies can’t establish causation, only association. The effects of the treatment may be confounded by factors that got certain people into the control group and certain people into the treatment group. However, there are ways to control for confounders, which can help strengthen the reliability of conclusions about association.
Longitudinal vs. cross-sectional studies
The final important distinction to make is between longitudinal and cross-sectional studies. In a longitudinal study, the same participants are followed over a period of time to examine the effect of treatment on the response. In a cross-sectional study, data is collected from a single snapshot in time. If you wanted to investigate the effect of age on height, a cross-sectional study would measure the heights of people of different ages and compare them. However, the results will be confounded by birth year and lifestyle since it’s possible that each generation is getting taller. In a longitudinal study,the same people would have their heights recorded at different points in their lives, so the confounding is eliminated. It’s important to note that longitudinal studies are more expensive, and take longer to perform, while cross-sectional studies are cheaper, faster, and more convenient.
Source Code
---title: "Introduction to Statistics in Python"author: - name: "Lawal's Note" affiliation: "Associate Data Science Course in Python by DataCamp Inc"date: "2025-09-26"toc: trueunnumbered: truehighlight-style: pygmentsformat: html: css: global/style/style.css code-fold: true code-tools: true pdf: geometry: - top=30mm - left=20mm fig-width: 4 fig-height: 3 pdf-engine: xelatex docx: defaultexecute: warning: false echo: true eval: true output: true error: false cache: false include: truejupyter: python3---# Chapter 1 {.unnumbered}## Chapter 1.1: Data type classification {#sec-C1.1}In the course , you will learn about two main types of data: numeric and categorical data. Numeric variables can be classified as either discrete or continuous, and categorical variables can be classified as either nominal or ordinal. These characteristics of a variable determine which ways of summarizing your data will work best.### Measure of Center {.unnumbered}In this lesson, we'll begin to discuss summary statistics, some of which you may already be familiar with, like mean and median.### Mammal sleep data {.unnumbered}In this lesson, we'll look at data about different mammals' sleep habits.### Histograms {.unnumbered}Before we dive in, let's remind ourselves how histograms work. A histogram takes a bunch of data points and separates them into bins, or ranges of values. Here, there's a bin for 0 to 2 hours, 2 to 4 hours, and so on. The heights of the bars represent the number of data points that fall into that bin, so there's one mammal in the dataset that sleeps between 0 to 2 hours, and nine mammals that sleep two to four hours. Histograms are a great way to visually summarize the data, but we can use numerical summary statistics to summarize even further.### How long do mammals in this dataset typically sleep? {.unnumbered}One way we could summarize the data is by answering the question, How long do mammals in this dataset typically sleep? To answer this, we need to figure out what the "typical" or "center" value of the data is. We'll discuss three different definitions, or measures, of center: mean, median, and mode.### Measures of center: mean {.unnumbered}The mean, often called the average, is one of the most common ways of summarizing data. To calculate mean, we add up all the numbers of interest and divide by the total number of data points, which is 83 here. This gives us 10-point-43 hours of sleep. In Python, we can use numpy's mean function, passing it the variable of interest.### Measures of center: median {.unnumbered}Another measure of center is the median. The median is the value where 50% of the data is lower than it, and 50% of the data is higher. We can calculate this by sorting all the data points and taking the middle one, which would be index 41 in this case. This gives us a median of 10-point-1 hours of sleep. In Python, we can use `np.median` to do the calculations for us.### Measures of center: mode {.unnumbered}The mode is the most frequent value in the data. If we count how many occurrences there are of each `sleep_total` and sort in descending order, there are 4 mammals that sleep for 12.5 hours, so this is the mode. The mode of the vore variable, which indicates the animal's diet, is herbivore. We can also find the mode using the mode function from the statistics module. Mode is often used for categorical variables, since categorical variables can be unordered and often don't have an inherent numerical representation.### Adding an outlier {.unnumbered}Now that we have lots of ways to measure center, how do we know which one to use? Let's look at an example. Here, we have all of the insectivores in the dataset. We get a mean sleep time of 16.5 hours and a median sleep time of 18.9 hours. Now let's say we've discovered a new mystery insectivore that never sleeps. If we take the mean and median again, we get different results. The mean went down by more than 3 hours, while the median changed by less than an hour. This is because the mean is much more sensitive to extreme values than the median.### Which measure to use? {.unnumbered}Since the mean is more sensitive to extreme values, it works better for symmetrical data like this. Notice that the mean, in black, and median, in red, are quite close.### Skew {.unnumbered}However, if the data is skewed, meaning it's not symmetrical, like this, median is usually better to use. In this histogram, the data is piled up on the right, with a tail on the left. Data that looks like this is called left-skewed data. When data is piled up on the left with a tail on the right, it's right-skewed.### Which measure to use? {.unnumbered}When data is skewed, the mean and median are different. The mean is pulled in the direction of the skew, so it's lower than the median on the left-skewed data, and higher than the median on the right-skewed data. Because the mean is pulled around by the extreme values, it's better to use the median since it's less affected by outliers.### Exercise 1.1 {.unnumbered}#### Only show final grouped result {.unnumbered}```pythonresult = be_and_usa.groupby('country')['consumption'].agg(['mean', 'median'])result```#### Mean and median {.unnumbered}In this chapter, you'll be working with the 2018 Food Carbon Footprint Index from nu3. The `food_consumption` dataset contains information about the kilograms of food consumed per person per year in each country in each food category (consumption) as well as information about the carbon footprint of that food category (co2_emissions) measured in kilograms of carbon dioxide, or CO2, per person per year in each country.In this exercise, you'll compute measures of center to compare food consumption in the US and Belgium using your pandas and numpy skills.1. Import `numpy` with the alias `np`.2. Create two DataFrames: one that holds the rows of `food_consumption` for 'Belgium' and another that holds rows for 'USA'. Call these `be_consumption` and `usa_consumption`.3. Calculate the mean and median of kilograms of food consumed per person per year for both countries.```{python}import warningswarnings.filterwarnings("ignore")# Import numpy with alias npimport numpy as npimport pandas as pd# Importing the datasetfood_consumption = pd.read_csv("datasets/food_consumption.csv")# Filter for Belgiumbe_consumption = food_consumption[food_consumption['country'] =='Belgium']# ORbe_consumption = food_consumption.set_index('country').loc[['Belgium']]# Filter for USAusa_consumption = food_consumption[food_consumption['country'] =="USA"]# ORusa_consumption = food_consumption.set_index('country').loc[['USA']]# Calculate mean and median consumption in Belgiumprint(f"Mean consumption in Belgium is: {np.mean(be_consumption['consumption']):.4f}")print(f"Median consumption in Belgium is: {np.median(be_consumption['consumption']):.4f}")# Calculate mean and median consumption in USAprint(f"Mean consumption in USA is: {np.mean(usa_consumption['consumption']):.4f}")print(f"Median consumption in USA is: {np.median(usa_consumption['consumption']):.4f}")# Subset food_consumption for rows with data about Belgium and the USA.# Group the subsetted data by country and select only the consumption column.# Calculate the mean and median of the kilograms of food consumed per person per year in each country using .agg().# Subset for Belgium and USA onlybe_and_usa = food_consumption[(food_consumption['country'] =='Belgium') | (food_consumption['country'] =='USA')]# Group by country, select consumption column, and compute mean and medianprint("\n The mean and median of the kilograms of food consumed per person per year in Belgium and USA using .agg()")print(be_and_usa.groupby('country')['consumption'].agg([np.mean, np.median]))```### Exercise 1.2 {.unnumbered}1. Import `matplotlib.pyplot` with the alias `plt`.2. Subset food_consumption to get the rows where `food_category` is 'rice'.3. Create a histogram of `co2_emission` for rice and show the plot.4. Use `.agg()` to calculate the mean and median of `co2_emission` for rice.```pythonresult = be_and_usa.groupby('country')['consumption'].agg(['mean', 'median'])result```#### Mean and median {.unnumbered}```{python}import warningswarnings.filterwarnings("ignore")# Import matplotlib.pyplot with the alias plt.import matplotlib.pyplot as plt# Subset food_consumption to get the rows where food_category is 'rice'.rice_consumption = food_consumption[food_consumption['food_category'] =='rice']# Create a histogram of co2_emission for rice and show the plot.rice_consumption.hist(column='co2_emission')# orrice_consumption['co2_emission'].hist()plt.show()# Use .agg() to calculate the mean and median of co2_emission for rice.mean_rice, median_rice = mean_rice, median_rice = rice_consumption['co2_emission'].agg([np.mean, np.median])print(f"Mean rice CO2 Emission: {mean_rice}")print(f"Median rice CO2 Emission: {median_rice}")```## Chapter 1.2: Measure of Dispersion {.unnumbered}### Measures of spread {.unnumbered}In this lesson, we'll talk about another set of summary statistics: measures of spread.### What is spread? {.unnumbered}Spread is just what it sounds like - it describes how spread apart or close together the data points are. Just like measures of center, there are a few different measures of spread.### Variance {.unnumbered}The first measure, variance, measures the average distance from each data point to the data's mean.### Calculating variance {.unnumbered}To calculate the variance, we start by calculating the distance between each point and the mean, so we get one number for every data point. We then square each distance and then add them all together. Finally, we divide the sum of squared distances by the number of data points minus 1, giving us the variance. The higher the variance, the more spread out the data is. It's important to note that the units of variance are squared, so in this case, it's 19.8 hours squared. We can calculate the variance in one step using `np.var`, setting the `ddof` argument to `1`. If we don't specify `ddof` equals `1`, a slightly different formula is used to calculate variance that should only be used on a full population, not a sample.### Standard deviation {.unnumbered}The standard deviation is another measure of spread, calculated by taking the square root of the variance. It can be calculated using `np.std`. Just like `np.var`, we need to set `ddof` to `1`. The nice thing about standard deviation is that the units are usually easier to understand since they're not squared. It's easier to wrap your head around 4 and a half hours than 19.8 hours squared.### Mean absolute deviation {.unnumbered}Mean absolute deviation takes the absolute value of the distances to the mean, and then takes the mean of those differences. While this is similar to standard deviation, it's not exactly the same. Standard deviation squares distances, so longer distances are penalized more than shorter ones, while mean absolute deviation penalizes each distance equally. One isn't better than the other, but SD is more common than MAD.### Quantiles {.unnumbered}Before we discuss the next measure of spread, let's quickly talk about quantiles. Quantiles, also called percentiles, split up the data into some number of equal parts. Here, we call `np.quantile`, passing in the column of interest, followed by 0.5. This gives us 10.1 hours, so 50% of mammals in the dataset sleep less than 10.1 hours a day, and the other 50% sleep more than 10.1 hours, so this is exactly the same as the median. We can also pass in a list of numbers to get multiple quantiles at once. Here, we split the data into 4 equal parts. These are also called quartiles. This means that 25% of the data is between 1.9 and 7.85, another 25% is between 7.85 and 10.10, and so on.### Boxplots use quartiles {.unnumbered}The boxes in box plots represent quartiles. The bottom of the box is the first quartile, and the top of the box is the third quartile. The middle line is the second quartile, or the median.### Quantiles using np.linspace() {.unnumbered}Here, we split the data in five equal pieces, but we can also use `np.linspace` as a shortcut, which takes in the starting number, the stopping number, and the number intervals. We can compute the same quantiles using `np.linspace starting at zero, stopping at one, splitting into 5 different intervals.### Interquartile range (IQR) {.unnumbered}The interquartile range, or IQR, is another measure of spread. It's the distance between the 25th and 75th percentile, which is also the height of the box in a boxplot. We can calculate it using the quantile function, or using the `iqr` function from `scipy.stats` to get 5.9 hours.### Outliers {.unnumbered}Outliers are data points that are substantially different from the others. But how do we know what a substantial difference is ? A rule that's often used is that any data point less than the first quartile minus 1.5 times the IQR is an outlier, as well as any point greater than the third quartile plus 1.5 times the IQR.### Finding outliers {.unnumbered}To find outliers, we'll start by calculating the IQR of the mammals' body weights. We can then calculate the lower and upper thresholds following the formulas from the previous slide. We can now subset the DataFrame to find mammals whose body weight is below or above the thresholds. There are eleven body weight outliers in this dataset, including the cow and the Asian elephant.15. All in one go {.unnumbered}Many of the summary statistics we've covered so far can all be calculated in just one line of code using the `.describe` method, so it's convenient to use when you want to get a general sense of your data.### Exercise 1.2.1 {.unnumbered}1. Calculate the quartiles of the `co2_emission` column of `food_consumption`.2. Calculate the six quantiles that split up the data into 5 pieces (quintiles) of the `co2_emission` column of `food_consumption`.3. Calculate the eleven quantiles of `co2_emission` that split up the data into ten pieces (deciles).4. Calculate the variance and standard deviation of `co2_emission` for each food_category by grouping and aggregating.5. Import matplotlib.pyplot with alias plt.6. Create a histogram of `co2_emission` for the beef `food_category` and show the plot.7. Create a histogram of `co2_emission` for the eggs `food_category` and show the plot.```{python}import warningswarnings.filterwarnings("ignore")# Calculate the quartiles of the co2_emission column of food_consumption.print("The quartiles of the co2_emission column of food_consumption: \n")print(np.quantile(food_consumption['co2_emission'], np.linspace(0, 1, 5)))# Calculate the six quantiles that split up the data into 5 pieces (quintiles) of the co2_emission column of food_consumption.print("The quintiles of the co2_emission column of food_consumption: \n")print(np.quantile(food_consumption['co2_emission'], np.linspace(0, 1, 6)))# Calculate the eleven quantiles of co2_emission that split up the data into ten pieces (deciles).print("The 11 quantiles of the co2_emission column of food_consumption: \n")print(np.quantile(food_consumption['co2_emission'], np.linspace(0, 1, 11)))# Calculate the variance and standard deviation of co2_emission for each food_category by grouping and aggregating.print("The variance and standard deviation of co2_emission for each food_category by grouping and aggregating \n")print(food_consumption.groupby('food_category')['co2_emission'].agg([np.var, np.std]))# Import matplotlib.pyplot with alias plt.import matplotlib.pyplot as plt# Create a histogram of co2_emission for the beef food_category and show the plot.food_consumption[food_consumption['food_category'] =='beef'].hist(column ='co2_emission')plt.title("Histogram of CO2 emission for the Beef food category")plt.show()# orfood_consumption[food_consumption['food_category'] =='beef']['co2_emission'].hist()plt.show()# Create a histogram of co2_emission for the eggs food_category and show the plot.food_consumption[food_consumption['food_category'] =='eggs'].hist(column ='co2_emission')plt.title("Histogram of CO2 emission for the Egg food category")plt.show()# orfood_consumption[food_consumption['food_category'] =='eggs']['co2_emission'].hist()plt.show()```### Exercise 1.2.2 {.unnumbered}#### Finding outliers using IQR {.unnumbered}Outliers can have big effects on statistics like mean, as well as statistics that rely on the mean, such as variance and standard deviation. Interquartile range, or IQR, is another way of measuring spread that's less influenced by outliers. IQR is also often used to find outliers. The outlier rule states that values are considered outliers if $x < Q_1 - 1.5 \times IQR$ or $x > Q_3 + 1.5 \times IQR$. In fact, this is how the lengths of the whiskers in a matplotlib box plot are calculated.1. Calculate the total `co2_emission` per country by grouping by country and taking the sum of `co2_emission`. Store the resulting DataFrame as `emissions_by_country`.2. Compute the first and third quartiles of `emissions_by_country` and store these as `q1` and `q3`.3. Calculate the interquartile range of `emissions_by_country` and store it as `iqr`.4. Calculate the lower and upper cutoffs for outliers of `emissions_by_country`, and store these as `lower` and `upper`.5. Subset `emissions_by_country` to get countries with a total emission greater than the upper cutoff or a total emission less than the lower cutoff.```{python}# Calculate the total co2_emission per country by grouping by country and taking the sum of co2_emission. Store the resulting DataFrame as emissions_by_country.emissions_by_country = food_consumption.groupby('country')['co2_emission'].sum()print("The Total CO2 Emission by Country \n")print(emissions_by_country)# Compute the first and third quartiles of emissions_by_country and store these as q1 and q3.q1 = np.quantile(emissions_by_country, 0.25)q3 = np.quantile(emissions_by_country, 0.75)# Calculate the interquartile range of emissions_by_country and store it as iqr.iqr = q3 - q1# Calculate the lower and upper cutoffs for outliers of emissions_by_country, and store these as lower and upper.lower = q1 -1.5* iqrupper = q3 +1.5* iqr# Subset emissions_by_country to get countries with a total emission greater than the upper cutoff or a total emission less than the lower cutoff.outliers = emissions_by_country[(emissions_by_country < lower) | (emissions_by_country > upper)]print("The Outliers (countries with a total emission greater than the upper cutoff or a total emission less than the lower cutoff) are")print(outliers)```# Chapter 2 {.unnumbered}## Chapter 2.1: Random Numbers and Probability {.unnumbered}### What are the chances? {.unnumbered}People talk about chance pretty frequently, like what are the chances of closing a sale, of rain tomorrow, or of winning a game? But how exactly do we measure chance?### Measuring chance {.unnumbered}We can measure the chances of an event using probability. We can calculate the probability of some event by taking the number of ways the event can happen and dividing it by the total number of possible outcomes. For example, if we flip a coin, it can land on either heads or tails. To get the probability of the coin landing on heads, we divide the 1 way to get heads by the two possible outcomes, heads and tails. This gives us one half, or a fifty percent chance of getting heads. Probability is always between zero and 100 percent. If the probability of something is zero, it's impossible, and if the probability of something is 100%, it will certainly happen.### Assigning salespeople {.unnumbered}Let's look at a more complex scenario. There's a meeting coming up with a potential client, and we want to send someone from the sales team to the meeting. We'll put each person's name on a ticket in a box and pull one out randomly to decide who goes to the meeting. Brian's name gets pulled out. The probability of Brian being selected is one out of four, or 25%.### Sampling from a DataFrame {.unnumbered}We can recreate this scenario in Python using the `sample()` method. By default, it randomly samples one row from the DataFrame. However, if we run the same thing again, we may get a different row since the sample method chooses randomly. If we want to show the team how we picked Brian, this won't work well.### Setting a random seed {.unnumbered}To ensure we get the same results when we run the script in front of the team, we'll set the random seed using `np.random.seed`. The seed is a number that Python's random number generator uses as a starting point, so if we orient it with a seed number, it will generate the same random value each time. The number itself doesn't matter. We could use 5, 139, or 3 million. The only thing that matters is that we use the same seed the next time we run the script. Now, we, or one of the sales-team members, can run this code over and over and get Brian every time.### A second meeting {.unnumbered}Now there's another potential client who wants to meet at the same time, so we need to pick another salesperson. Brian haas already been picked and he can't be in two meetings at once, so we'll pick between the remaining three. This is called **sampling without replacement**, since we aren't replacing the name we already pulled out. This time, Claire is picked, and the probability of this is one out of three, or about 33%.### Sampling twice in Python {.unnumbered}To recreate this in Python, we can pass 2 into the sample method, which will give us 2 rows of the DataFrame.### Sampling with replacement {.unnumbered}Now let's say the two meetings are happening on different days, so the same person could attend both. In this scenario, we need to return Brian's name to the box after picking it. This is called **sampling with replacement**. Claire gets picked for the second meeting, but this time, the probability of picking her is 25%.### Sampling with/without replacement in Python {.unnumbered}To sample with replacement, set the `replace` argument to `True`, so names can appear more than once. If there were 5 meetings, all at different times, it's possible to pick some rows multiple times since we're replacing them each time.### Independent events {.unnumbered}Let's quickly talk about independence. Two events are independent if the probability of the second event isn't affected by the outcome of the first event. For example, if we're sampling with replacement, the probability that Claire is picked second is 25%, no matter who gets picked first. In general, when sampling with replacement, each pick is independent.### Dependent events {.unnumbered}Similarly, events are considered dependent when the outcome of the first changes the probability of the second. If we sample without replacement, the probability that Claire is picked second depends on who gets picked first. If Claire is picked first, there's 0% probability that Claire will be picked second. If someone else is picked first, there's a 33% probability Claire will be picked second. In general, when sampling without replacement, each pick is dependent.## Exercise 2.1.1: Calculating probabilities {#sec-E2.1.1}You're in charge of the sales team, and it's time for performance reviews, starting with Amir. As part of the review, you want to randomly select a few of the deals that he's worked on over the past year so that you can look at them more deeply. Before you start selecting deals, you'll first figure out what the chances are of selecting certain deals.1. Count the number of deals Amir worked on for each product type and store in counts.2. Calculate the probability of selecting a deal for the different product types by dividing the counts by the total number of deals Amir worked on. Save this as `probs`.```{python}# Importing the datasetamir_deals = pd.read_csv("datasets/amir_deals.csv")# Count the number of deals Amir worked on for each product type and store in counts.counts = amir_deals['product'].value_counts()# orcounts = amir_deals.value_counts('product')print(f"The number of deals Amir worked on for each product type: {counts}")# Calculate the probability of selecting a deal for the different product types by dividing the counts by the total number of deals Amir worked on. Save this as probs.probs = amir_deals['product'].value_counts()/178print(f"The probability of selecting a deal for the different product types: {probs}")```## Exercise 2.1.2 : Sampling dealsIn the previous exercise @sec-E2.1.1, you counted the deals Amir worked on. Now it's time to randomly pick five deals so that you can reach out to each customer and ask if they were satisfied with the service they received. You'll try doing this both with and without replacement.Additionally, you want to make sure this is done randomly and that it can be reproduced in case you get asked how you chose the deals, so you'll need to set the random seed before sampling from the deals.1. Import the necessary packages.2. Set the random seed to 24.3. Take a sample of 5 deals without replacement and store them as `sample_without_replacement`.4. Take a sample of 5 deals with replacement and save as `sample_with_replacement`.```{python}import pandas as pdimport numpy as np# Set the random seed to 24.np.random.seed(24)# Take a sample of 5 deals without replacement and store them as sample_without_replacement.sample_without_replacement = amir_deals.sample(5)print("Sample of 5 deals without replacement \n")print(sample_without_replacement)# Take a sample of 5 deals with replacement and save as sample_with_replacement.sample_with_replacement = amir_deals.sample(5, replace =True)print("Sample of 5 deals with replacement \n")print(sample_with_replacement)```What type of sampling is better to use for this situation? If you sample with replacement, you might end up calling the same customer twice.## Chapter 2.2: Discrete distributions {.unnumbered #sec-C2.2}In this lesson, we'll take a deeper dive into probability and begin looking at probability distributions.### Rolling the dice {.unnumbered}Let's consider rolling a standard, six-sided die. There are six numbers, or six possible outcomes, and every number has one-sixth, or about a 17 percent chance of being rolled. This is an example of a probability distribution.### Choosing salespeople {.unnumbered}This is similar to the scenario from earlier, except we had names instead of numbers. Just like rolling a die, each outcome, or name, had an equal chance of being chosen.### Probability distribution {.unnumbered}A probability distribution describes the probability of each possible outcome in a scenario. We can also talk about the expected value of a distribution, which is the mean of a distribution. We can calculate this by multiplying each value by its probability (one-sixth in this case) and summing, so the expected value of rolling a fair die is 3.5.### Visualizing a probability distribution {.unnumbered}We can visualize this using a barplot, where each bar represents an outcome, and each bar's height represents the probability of that outcome.### Probability = area {.unnumbered}We can calculate probabilities of different outcomes by taking areas of the probability distribution. For example, what's the probability that our die roll is less than or equal to 2? To figure this out, we'll take the area of each bar representing an outcome of 2 or less. Each bar has a width of 1 and a height of one-sixth, so the area of each bar is one-sixth. We'll sum the areas for 1 and 2, to get a total probability of one-third.### Uneven die {.unnumbered}Now let's say we have a die where the two got turned into a three. This means that we now have a 0% chance of getting a 2, and a 33% chance of getting a 3. To calculate the expected value of this die, we now multiply 2 by 0, since it's impossible to get a 2, and 3 by its new probability, one-third. This gives us an expected value that's slightly higher than the fair die.### Visualizing uneven probabilities {.unnumbered}When we visualize these new probabilities, the bars are no longer even.### Adding areas {.unnumbered}With this die, what's the probability of getting something less than or equal to 2? There's a one-sixth probability of getting 1, and zero probability of getting 2, which sums to one-sixth.### Discrete probability distributions {.unnumbered}The probability distributions you've seen so far are both discrete probability distributions, since they represent situations with discrete outcomes. Recall from chapter 1, @sec-C1.1 that discrete variables can be thought of as counted variables. In the case of a die, we're counting dots, so we can't roll a 1.5 or 4.3. When all outcomes have the same probability, like a fair die, this is a special distribution called a discrete uniform distribution.### Sampling from discrete distributions {.unnumbered}Just like we sampled names from a box, we can do the same thing with probability distributions like the ones we've seen. Here's a DataFrame called die that represents a fair die, and its expected value is 3.5. We'll sample from it 10 times to simulate 10 rolls. Notice that we sample with replacement so that we're sampling from the same distribution every time.### Visualizing a sample {.unnumbered}We can visualize the outcomes of the ten rolls using a histogram, defining the bins we want using `np.linspace`.### Sample distribution vs. theoretical distribution {.unnumbered}Notice that we have different numbers of 1's, 2's, 3's, and so on since the sample was random, even though on each roll we had the same probability of rolling each number. The mean of our sample is 3.0, which isn't super close to the 3.5 we were expecting.### A bigger sample {.unnumbered}If we roll the die 100 times, the distribution of the rolls looks a bit more even, and the mean is closer to 3.5.### An even bigger sample {.unnumbered}If we roll 1000 times, it looks even more like the theoretical probability distribution and the mean closely matches 3.5.### Law of large numbers {.unnumbered}This is called the law of large numbers, which is the idea that as the size of your sample increases, the sample mean will approach the theoretical mean.## Exercise 2.2: Creating a probability distribution {.unnumbered #sec-E2.2}A new restaurant opened a few months ago, and the restaurant's management wants to optimize its seating space based on the size of the groups that come most often. On one night, there are 10 groups of people waiting to be seated at the restaurant, but instead of being called in the order they arrived, they will be called randomly. In this exercise, you'll investigate the probability of groups of different sizes getting picked first. Data on each of the ten groups is contained in the `restaurant_groups` DataFrame.1. Create a histogram of the `group_size` column of `restaurant_groups`, setting bins to [2, 3, 4, 5, 6]. Remember to show the plot.2. Count the number of each group_size in `restaurant_groups`, then divide by the number of rows in `restaurant_groups` to calculate the probability of randomly selecting a group of each size. Save as `size_dist`.3. Reset the index of `size_dist`.4. Rename the columns of `size_dist` to `group_size` and `prob`.5. Calculate the expected value of the `size_dist`, which represents the expected group size, by multiplying the `group_size` by the `prob` and taking the `sum`.6. Calculate the probability of randomly picking a group of 4 or more people by subsetting for groups of size 4 or more and summing the probabilities of selecting those groups.7. Sum the probabilities of `groups_4_or_more`.```python# Create a histogram of the group_size column of restaurant_groups, setting bins to [2, 3, 4, 5, 6]. Remember to show the plot.restaurant_groups['group_size'].hist(bins = [2, 3, 4, 5, 6] )plt.show()# Count the number of each group_size in restaurant_groups, then divide by the number of rows in restaurant_groups to calculate the probability of randomly selecting a group of each size. Save as size_dist.# Reset the index of size_dist.# Rename the columns of size_dist to group_size and prob.size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]# Reset index and rename columnssize_dist = size_dist.reset_index()size_dist.columns = ['group_size', 'prob']# Calculate the expected value of the size_dist, which represents the expected group size, by multiplying the group_size by the prob and taking the sum.expected_value = (size_dist['group_size'] * size_dist['prob']).sum()# Orexpected_value = np.sum(size_dist['group_size'] * size_dist['prob'])print(expected_value)# Calculate the probability of randomly picking a group of 4 or more people by subsetting for groups of size 4 or more and summing the probabilities of selecting those groups.groups_4_or_more = size_dist[size_dist['group_size'] >=4]# Sum the probabilities of groups_4_or_moreprob_4_or_more = np.sum(groups_4_or_more['prob'])print(prob_4_or_more)```::: {.callout-note icon="false"}### Special note- You learned about the basics of probability distributions, focusing on discrete distributions, and how they apply to real-world scenarios. Specifically, you explored:- Probability Distributions: Understanding that a probability distribution describes the likelihood of each possible outcome in a scenario, like rolling a six-sided die where each outcome has an equal chance.- Expected Value: Learning to calculate the expected value of a distribution as the mean, demonstrated by multiplying each outcome's value by its probability and summing these products. For a fair die, the expected value is 3.5.- Visualizing Distributions: How to visualize probability distributions with bar plots, where each bar's height represents the outcome's probability, and histograms for sample outcomes.- Discrete Uniform Distribution: Identifying when all outcomes have the same probability, such as with a fair die, this represents a discrete uniform distribution.- Sampling and the Law of Large Numbers: Through examples, you saw how sampling from a distribution (like rolling a die multiple times) and calculating the sample mean can illustrate the law of large numbers. The larger the sample, the closer the sample mean will be to the theoretical mean.:::```{python}# Example of calculating expected value for a fair dieexpected_value =sum([i * (1/6) for i inrange(1, 7)])```## Chapter 2.3 Continuous distributions {.unnumbered #sec-C2.3}We can use discrete distributions to model situations that involve discrete or countable variables, but how can we model continuous variables?### Waiting for the bus {.unnumbered}Let's start with an example. The city bus arrives once every twelve minutes, so if you show up at a random time, you could wait anywhere from 0 minutes if you just arrive as the bus pulls in, up to 12 minutes if you arrive just as the bus leaves.### Continuous uniform distribution {.unnumbered}Let's model this scenario with a probability distribution. There are an infinite number of minutes we could wait since we could wait 1 minute, 1.5 minutes, 1.53 minutes, and so on, so we can't create individual blocks like we could with a discrete variable.### Continuous uniform distribution {.unnumbered}Instead, we'll use a continuous line to represent probability. The line is flat since there's the same probability of waiting any time from 0 to 12 minutes. This is called the continuous uniform distribution.### Probability still = area {.unnumbered}Now that we have our distribution, let's figure out what the probability is that we'll wait between 4 and 7 minutes. Just like with discrete distributions, we can take the area from 4 to 7 to calculate probability. The width of this rectangle is 7 minus 4 which is 3. The height is one-twelfth. Multiplying those together to get area, we get 3/12 or 25%.### Uniform distribution in Python {.unnumbered}Let's use the uniform distribution in Python to calculate the probability of waiting 7 minutes or less. We need to import `uniform` from `scipy.stats`. We can call `uniform.cdf` and pass it 7, followed by the lower and upper limits, which in our case is 0 and 12. The probability of waiting less than 7 minutes is about 58%.### "Greater than" probabilities {.unnumbered}If we want the probability of waiting more than 7 minutes, we need to take 1 minus the probability of waiting less than 7 minutes.### Combining multiple uniform.cdf() calls {.unnumbered}How do we calculate the probability of waiting 4 to 7 minutes using Python? We can start with the probability of waiting less than 7 minutes, then subtract the probability of waiting less than 4 minutes. This gives us 25%.### Total area = 1 {.unnumbered}To calculate the probability of waiting between 0 and 12 minutes, we multiply 12 by 1/12, which is 1, or 100%. This makes sense since we're certain we'll wait anywhere from 0 to 12 minutes.### Generating random numbers according to uniform distribution {.unnumbered}To generate random numbers according to the uniform distribution, we can use `uniform.rvs`, which takes in the minimum value, maximum value, followed by the number of random values we want to generate. Here, we generate 10 random values between 0 and 5.### Other continuous distributions {.unnumbered}Continuous distributions can take forms other than uniform where some values have a higher probability than others. No matter the shape of the distribution, the area beneath it must always equal 1.### Other special types of distributions {.unnumbered}This will also be true of other distributions you'll learn about later on in the course, like the normal distribution or exponential distribution, which can be used to model many real-life situations.## Exercise 2.3.1 {.unnumbered #sec-E2.3.1}### Data back-upsThe sales software used at your company is set to automatically back itself up, but no one knows exactly what time the back-ups happen. It is known, however, that back-ups happen exactly every 30 minutes. Amir comes back from sales meetings at random times to update the data on the client he just met with. He wants to know how long he'll have to wait for his newly-entered data to get backed up. Use your new knowledge of continuous uniform distributions to model this situation and answer Amir's questions.To model how long Amir will wait for a back-up using a continuous uniform distribution, save his lowest possible wait time as `min_time` and his longest possible wait time as `max_time`. Remember that back-ups happen every 30 minutes.1. Import `uniform` from `scipy.stats` and calculate the probability that Amir has to wait less than 5 minutes, and store in a variable called `prob_less_than_5`.2. Calculate the probability that Amir has to wait more than 5 minutes, and store in a variable called `prob_greater_than_5`.3. Calculate the probability that Amir has to wait between 10 and 20 minutes, and store in a variable called `prob_between_10_and_20`.```{python}min_time =0max_time =30from scipy.stats import uniformprob_less_than_5 = uniform.cdf(5, 0, 30)print(f"The probability that Amir has to wait less than 5 minutes: {prob_less_than_5}")# Calculate the probability that Amir has to wait more than 5 minutes, and store in a variable called prob_greater_than_5.prob_greater_than_5 =1- uniform.cdf(5, 0, 30)print(f"The probability that Amir has to wait more than 5 minutes: {prob_greater_than_5}")# Calculate the probability that Amir has to wait between 10 and 20 minutes, and store in a variable called prob_between_10_and_20.prob_between_10_and_20 = uniform.cdf(20, 0, 30) - uniform.cdf(10, 0, 30)print(f"The probability that Amir has to wait between 10 and 20 minutes: {prob_between_10_and_20}")```## Exercise 2.3.2 {.unnumbered #sec-E2.3.2}### Simulating wait timesTo give Amir a better idea of how long he'll have to wait, you'll simulate Amir waiting 1000 times and create a histogram to show him what he should expect. Recall from the last exercise that his minimum wait time is 0 minutes and his maximum wait time is 30 minutes.1. Set the random seed to 334.2. Import `uniform` from `scipy.stats`.3. Generate 1000 wait times from the continuous uniform distribution that models Amir's wait time. Save this as `wait_times`.4. Create a histogram of the simulated wait times and show the plot.```{python}# Set the random seed to 334.np.random.seed(334)# Import uniform from scipy.stats.from scipy.stats import uniform# Generate 1000 wait times from the continuous uniform distribution that models Amir's wait time. Save this as wait_times.wait_times = uniform.rvs(0, 30, size =1000)print("The Wait times Distribution \n")print(wait_times)# Create a histogram of the simulated wait times and show the plot.plt.hist(wait_times)plt.show()```## Chapter 2.4 The binomial distribution {.unnumbered #sec-C2.4} It's time to further expand your toolbox of distributions. In this lesson, you'll learn about the binomial distribution.### Coin flipping {.unnumbered}We'll start by flipping a coin, which has two possible outcomes, heads or tails, each with a probability of 50%.### Binary outcomes {.unnumbered}This is just one example of a binary outcome, or an outcome with two possible values. We could also represent these outcomes as a 1 and a 0, a success or a failure, and a win or a loss.### A single flip {.unnumbered}In Python, we can simulate this by importing `binom` from `scipy.stats` and using the `binom.rvs` function, which takes in the number of coins we want to flip, the probability of heads or success, and an argument called size, which is number of trials. size is a named argument, so we'll need to explicitly specify that the third argument corresponds to size, or we'll get incorrect results. This call will return a 1, which we'll count as a head, or a 0, which we'll count as tails. We can use `binom.rvs` 1, 0.5, size equals 1 to flip 1 coin, with a 50% probability of heads, 1 time.### One flip many times {.unnumbered}To perform eight coin flips, we can change the size argument to 8, which will flip 1 coin with a 50% chance of heads 8 times. This gives us a set of 8 ones and zeros.### Many flips one time {.unnumbered}If we swap the first and last arguments, we flip eight coins one time. This gives us one number, which is the total number of heads or successes.### Many flips many times {.unnumbered}Similarly, we can pass 3 as the first argument, and set size equal to 10 to flip 3 coins. This returns 10 numbers, each representing the total number of heads from each set of flips.### Other probabilities {.unnumbered}We could also have a coin that's heavier on one side than the other, so the probability of getting heads is only 25%. To simulate flips with this coin, we'll adjust the second argument of `binom.rvs` to 0.25. The result has lower numbers, since getting multiple heads isn't as likely with the new coin.### Binomial distribution {.unnumbered}The binomial distribution describes the probability of the number of successes in a sequence of independent trials. In other words, it can tell us the probability of getting some number of heads in a sequence of coin flips. Note that this is a discrete distribution since we're working with a countable outcome. The binomial distribution can be described using two parameters, `n` and `p`. `n` represents the total number of trials being performed, and `p` is the probability of success. `n` and `p` are also the third and second arguments of `binom.rvs`. Here's what the distribution looks like for 10 coins. We have the biggest chance of getting 5 heads total, and a much smaller chance of getting 0 heads or 10 heads.### What's the probability of 7 heads? {.unnumbered}To get the probability of getting 7 heads out of 10 coins, we can use `binom.pmf`. The first argument is the number of heads or successes. The second argument is the number of trials, n, and the third is the probability of success, p. If we flip 10 coins, there's about a 12% chance that exactly 7 of them will be heads.### What's the probability of 7 or fewer heads? {.unnumbered}`binom.cdf` gives the probability of getting a number of successes less than or equal to the first argument. The probability of getting 7 or fewer heads out of 10 coins is about 95%.### What's the probability of more than 7 heads? {.unnumbered}We can take 1 minus the probability of getting 7 or fewer heads to get the probability of a number of successes greater than the first argument.### Expected value {.unnumbered}The expected value of the binomial distribution can be calculated by multiplying `n` times `p`. The expected number of heads we'll get from flipping 10 coins is 10 times 0.5, which is 5.### Independence {.unnumbered}It's important to remember that in order for the binomial distribution to apply, each trial must be independent, so the outcome of one trial shouldn't have an effect on the next. For example, if we're picking randomly from these cards with zeros and ones, we have a 50-50 chance of getting a 0 or a 1. But since we're sampling without replacement, the probabilities for the second trial are different due to the outcome of the first trial. Since these trials aren't independent, we can't calculate accurate probabilities for this situation using the binomial distribution.## Exercise 2.4.1: Simulating sales deals {#sec-E2.4.1}Assume that Amir usually works on 3 deals per week, and overall, he wins 30% of deals he works on. Each deal has a binary outcome: it's either lost, or won, so you can model his sales deals with a binomial distribution. In this exercise, you'll help Amir simulate a year's worth of his deals so he can better understand his performance.1. Import binom from scipy.stats and set the random seed to 10.2. Simulate 1 deal worked on by Amir, who wins 30% of the deals he works on.3. Simulate a typical week of Amir's deals, or one week of 3 deals.4. Simulate a year's worth of Amir's deals, or 52 weeks of 3 deals each, and store in `deals`.5. Print the mean number of deals he won per week.```{python}# Import binom from scipy.stats and set the random seed to 10.from scipy.stats import binomnp.random.seed(10)# Simulate 1 deal worked on by Amir, who wins 30% of the deals he works on.print(f"Probability of 30% won: {binom.rvs(1, 0.3, size=1)}")# Simulate a typical week of Amir's deals, or one week of 3 deals.print(f"Probability of 1 week of 3 deals: {binom.rvs(3, 0.3, size =1)}")# Simulate a year's worth of Amir's deals, or 52 weeks of 3 deals each, and store in deals.# Print the mean number of deals he won per week.deals = binom.rvs(3, 0.3, size =52)print(f"The mean number of deals he won per week: {np.mean(deals)}")```## Exercise 2.4.2: Calculating binomial probabilities {#sec-E2.4.2}Just as in the last exercise, @sec-E2.4.1, assume that Amir wins 30% of deals. He wants to get an idea of how likely he is to close a certain number of deals each week. In this exercise, you'll calculate what the chances are of him closing different numbers of deals using the binomial distribution.1. What's the probability that Amir closes 1 or fewer deals in a week? Save this as `prob_less_than_or_equal_1`.2. What's the probability that Amir closes more than 1 deal? Save this as `prob_greater_than_1`.```{python}prob_3 = binom.pmf(3, 3, 0.3)print(f"The likely he is to close a certain number of deals each week: {prob_3}")# What's the probability that Amir closes 1 or fewer deals in a week? Save this as prob_less_than_or_equal_1.prob_less_than_or_equal_1 = binom.cdf(1, 3, 0.3)print(f"The probability that Amir closes 1 or fewer deals in a week: {prob_less_than_or_equal_1}")# What's the probability that Amir closes more than 1 deal? Save this as prob_greater_than_1.prob_greater_than_1 =1- binom.cdf(1, 3, 0.3)print(f"Tthe probability that Amir closes more than 1 deal: {prob_greater_than_1}")```## Exercise 2.4.3: How many sales will be won? {#sec-E2.4.3}Now Amir wants to know how many deals he can expect to close each week if his win rate changes. Luckily, you can use your binomial distribution knowledge to help him calculate the expected value in different situations. Recall from the lesson that the expected value of a binomial distribution can be calculated by `n x p`.1. Calculate the expected number of sales out of the 3 he works on that Amir will win each week if he maintains his 30% win rate.2. Calculate the expected number of sales out of the 3 he works on that he'll win if his win rate drops to 25%.3. Calculate the expected number of sales out of the 3 he works on that he'll win if his win rate rises to 35%.```{python}# Calculate the expected number of sales out of the 3 he works on that Amir will win each week if he maintains his 30% win rate.won_30pct =3*0.3print(f"Number of sales out of 3 works at 30% win rate: {won_30pct}")# Calculate the expected number of sales out of the 3 he works on that he'll win if his win rate drops to 25%.won_25pct =3*0.25print(f"Number of sales out of 3 works at 25% win rate: {won_25pct}")# Calculate the expected number of sales out of the 3 he works on that he'll win if his win rate rises to 35%.won_35pct =3*0.35print(f"Number of sales out of 3 works at 35% win rate: {won_35pct}")```::: {.callout-note icon="false"}## Key Points:You learned about the binomial distribution, a fundamental concept in probability that models events with two possible outcomes, such as flipping a coin. Key points included:- Understanding binary outcomes, which can be success/failure, win/loss, or heads/tails, and how these can be represented numerically (1 or 0).- Using the `binom.rvs` function from `scipy.stats` to simulate random variables following a binomial distribution. This function requires specifying the number of trials (`n`), the probability of success (`p`), and the size, which determines how many times the experiment is run.- The difference between simulating a single trial multiple times and multiple trials in one go was illustrated with coin flips.- Adjusting the probability of success (`p`) to model biased outcomes, like a weighted coin, and observing how it affects the results.- Calculating probabilities with the binomial distribution using binom.pmf for the probability of a specific number of successes, and binom.cdf for the probability of up to a certain number of successes.- The expected value of a binomial distribution, which is the average number of successes over many trials, can be calculated with `n * p`.- For example, to calculate the expected number of sales Amir will win each week with different win rates, you used the formula for the expected value in a binomial distribution:```python# Expected number won with 30% win ratewon_30pct =3*0.3print(won_30pct)# Expected number won with 25% win rate won_25pct =3*0.25print(won_25pct) # Expected number won with 35% win rate won_35pct =3*0.35print(won_35pct)```- This lesson emphasized the importance of understanding and applying the binomial distribution to model real-world scenarios with binary outcomes, enhancing your ability to analyze and predict the probability of events.:::## Chapter 2.5: The normal distribution {#set-C2.5}The next probability distribution we'll discuss is the normal distribution. It's one of the most important probability distributions you'll learn about since a countless number of statistical methods rely on it, and it applies to more real-world situations than the distributions we've covered so far.### What is the normal distribution? {.unnumbered}The normal distribution looks like this. Its shape is commonly referred to as a "bell curve". The normal distribution has a few important properties.### Symmetrical {.unnumbered}First, it's symmetrical, so the left side is a mirror image of the right.### Area = 1 {.unnumbered}Second, just like any continuous distribution, the area beneath the curve is 1.### Curve never hits 0 {.unnumbered}Second, the probability never hits 0, even if it looks like it does at the tail ends. Only 0.006% of its area is contained beyond the edges of this graph.### Described by mean and standard deviation {.unnumbered}The normal distribution is described by its mean and standard deviation. Here is a normal distribution with a mean of 20 and standard deviation of 3, and here is a normal distribution with a mean of 0 and a standard deviation of 1. When a normal distribution has mean 0 and a standard deviation of 1, it's a special distribution called the standard normal distribution.### Areas under the normal distribution {.unnumbered}For the normal distribution, 68% of the area is within 1 standard deviation of the mean. 95% of the area falls within 2 standard deviations of the mean, and 99.7% of the area falls within three standard deviations. This is sometimes called the 68-95-99.7 rule.### Lots of histograms look normal {.unnumbered}There's lots of real-world data shaped like the normal distribution. For example, here is a histogram of the heights of women that participated in the National Health and Nutrition Examination Survey. The mean height is around 161 centimeters and the standard deviation is about 7 centimeters.### Approximating data with the normal distribution {.unnumbered}Since this height data closely resembles the normal distribution, we can take the area under a normal distribution with mean 161 and standard deviation 7 to approximate what percent of women fall into different height ranges.### What percent of women are shorter than 154 cm? {.unnumbered}For example, what percent of women are shorter than 154 centimeters? We can answer this using `norm.cdf` from `scipy.stats`, which takes the area of the normal distribution less than some number. We pass in the number of interest, 154, followed by the mean and standard deviation of the normal distribution we're using. This gives us about 16% of women are shorter than 154 centimeters.### What percent of women are taller than 154 cm? {.unnumbered}To find the percent of women taller than 154 centimeters, we can take 1 minus the area on the left of 154, which equals the area to the right of 154.### What percent of women are 154-157 cm? {.unnumbered}To get the percent of women between 154 and 157 centimeters tall we can take the area below 157 and subtract the area below 154, which leaves us the area between 154 and 157.### What height are 90% of women shorter than? {.unnumbered}We can also calculate percentages from heights using `norm.ppf`. To figure out what height 90% of women are shorter than, we pass 0.9 into `norm.ppf` along with the same mean and standard deviation we've been working with. This tells us that 90% of women are shorter than 170 centimeters tall.### What height are 90% of women taller than? {.unnumbered}We can figure out the height 90% of women are taller than, since this is also the height that 10% of women are shorter than. We can take 1 minus 0.9 to get 0.1, which we'll use as the first argument of `norm.ppf`.### Generating random numbers {.unnumbered}Just like with other distributions, we can generate random numbers from a normal distribution using `norm.rvs`, passing in the distribution's mean and standard deviation, as well as the sample size we want. ## Exercise 2.5.1 {#sec-E2.5.1}1. Create a histogram with 10 bins to visualize the distribution of the amount. Show the plot.2. What's the probability of Amir closing a deal worth less than $7500?3. What's the probability of Amir closing a deal worth more than $1000?4. What's the probability of Amir closing a deal worth between $3000 and $7000?5. What amount will 25% of Amir's sales be less than?```{python}from scipy.stats import norm# Create a histogram with 10 bins to visualize the distribution of the amount. Show the plot.amir_deals['amount'].hist( bins=10)plt.title("The distribution of the Amir's amount")plt.show()# What's the probability of Amir closing a deal worth less than $7500?prob_less_7500 = norm.cdf(7500, 5000, 2000)print(f"The probability of Amir closing a deal worth less than $7500 is: {prob_less_7500}")# What's the probability of Amir closing a deal worth more than $1000?prob_over_1000 =1- norm.cdf(1000, 5000, 2000)print(f"The probability of Amir closing a deal worth more than $1000 IS: {prob_over_1000}")# What's the probability of Amir closing a deal worth between $3000 and $7000?prob_3000_to_7000 = norm.cdf(7000, 5000, 2000) - norm.cdf(3000, 5000, 2000)print(f"The probability of Amir closing a deal worth between $3000 and $7000 is: {prob_3000_to_7000}")# What amount will 25% of Amir's sales be less than?pct_25 = norm.ppf(0.25, 5000, 2000)print(f"The amount of 25% of Amir's sales be will less than: {pct_25}")```## Exercise 2.5.2: Simulating sales under new market conditions {#sec-E2.5.2}The company's financial analyst is predicting that next quarter, the worth of each sale will increase by 20% and the volatility, or standard deviation, of each sale's worth will increase by 30%. To see what Amir's sales might look like next quarter under these new market conditions, you'll simulate new sales amounts using the normal distribution and store these in the `new_sales` DataFrame, which has already been created for you.1. Currently, Amir's average sale amount is $5000. Calculate what his new average amount will be if it increases by 20% and store this in `new_mean`.2. Amir's current standard deviation is $2000. Calculate what his new standard deviation will be if it increases by 30% and store this in `new_sd`.3. Create a variable called new_sales, which contains 36 simulated amounts from a normal distribution with a mean of `new_mean` and a standard deviation of `new_sd`.4. Plot the distribution of the `new_sales` amounts using a histogram and show the plot.```{python}# Currently, Amir's average sale amount is $5000. Calculate what his new average amount will be if it increases by 20% and store this in new_mean.new_mean = (0.2*5000) +5000# Amir's current standard deviation is $2000. Calculate what his new standard deviation will be if it increases by 30% and store this in new_sd.new_sd = (0.3*2000) +2000# Create a variable called new_sales, which contains 36 simulated amounts from a normal distribution with a mean of new_mean and a standard deviation of new_sd.new_sales = norm.rvs(new_mean, new_sd, size =36)# Plot the distribution of the new_sales amounts using a histogram and show the plot.plt.hist(new_sales)plt.title("The distribution of the New Sales amounts")plt.show()```## Chapter 3: The central limit theorem {#sec-C3}Now that you're familiar with the normal distribution, it's time to learn about what makes it so important.### Rolling the dice 5 times {.unnumbered}Let's go back to our dice rolling example. We have a Series of the numbers 1 to 6 called die. To simulate rolling the die 5 times, we'll call `die.sample`. We pass in the Series we want to sample from, the size of the sample, and set `replace` to `True`. This gives us the results of 5 rolls. Now, we'll take the mean of the 5 rolls, which gives us 2. If we roll another 5 times and take the mean, we get a different mean. If we do it again, we get another mean.### Rolling the dice 5 times 10 times {.unnumbered}Let's repeat this 10 times: we'll roll 5 times and take the mean. To do this, we'll use a `for loop`. We start by creating an empty list called `sample_means` to hold our means. We loop from 0 to 9 so that the process is repeated 10 times. Inside the loop, we roll 5 times and append the sample's mean to the `sample_means` list. This gives us a list of 10 different sample means. Let's plot these sample means.### Sampling distributions {.unnumbered}A distribution of a summary statistic like this is called a sampling distribution. This distribution, specifically, is a sampling distribution of the sample mean.### 100 sample means {.unnumbered}Now let's do this 100 times. If we look at the new sampling distribution, its shape somewhat resembles the normal distribution, even though the distribution of the die is uniform.### 1000 sample means {.unnumbered}Let's take 1000 means. This sampling distribution more closely resembles the normal distribution.### Central limit theorem (CLT) {.unnumbered}This phenomenon is known as the **central limit theorem**, which states that a sampling distribution will approach a normal distribution as the number of trials increases. In our example, the sampling distribution became closer to the normal distribution as we took more and more sample means. It's important to note that the central limit theorem only applies when samples are taken randomly and are independent, for example, randomly picking sales deals with replacement.### Standard deviation and the CLT {.unnumbered}The central limit theorem, or CLT, applies to other summary statistics as well. If we take the standard deviation of 5 rolls 1000 times, the sample standard deviations are distributed normally, centered around 1.9, which is the distribution's standard deviation.### Proportions and the CLT {.unnumbered}Another statistic that the CLT applies to is proportion. Let's sample from the sales team 10 times with replacement and see how many draws have Claire as the outcome. In this case, 10% of draws were Claire. If we draw again, there are 40% Claires.### Sampling distribution of proportion {.unnumbered}If we repeat this 1000 times and plot the distribution of the sample proportions, it resembles a normal distribution centered around 0.25, since Claire's name was on 25% of the tickets.### Mean of sampling distribution {.unnumbered}Since these sampling distributions are normal, we can take their mean to get an estimate of a distribution's mean, standard deviation, or proportion. If we take the mean of our sample means from earlier, we get 3.48. That's pretty close to the expected value, which is 3.5! Similarly, the mean of the sample proportions of Claires isn't far off from 0.25. In these examples, we know what the underlying distributions look like, but if we don't, this can be a useful method for estimating characteristics of an underlying distribution. The central limit theorem also comes in handy when you have a huge population and don't have the time or resources to collect data on everyone. Instead, you can collect several smaller samples and create a sampling distribution to estimate what the mean or standard deviation is.## Exercise 3.1 {#sec-E3.1}1. Create a histogram of the `num_users` column of `amir_deals` and show the plot.2. Set the seed to 104.3. Take a sample of size 20 with replacement from the `num_users` column of `amir_deals`, and take the mean. Store the mean in `samp_20`.4. Take mean of `samp_20`.5. Repeat this 100 times using a `for loop` and store as `sample_means`. This will take 100 different samples and calculate the mean of each.6. Convert `sample_means` into a `pd.Series`, create a histogram of the `sample_means`, and show the plot.7. Take 30 samples (with replacement) of size 20 from `all_deals['num_users']` and take the mean of each sample. Store the sample means in `sample_means_1`.8. Print mean of `sample_means_1`.9. Print the mean of the `num_users` column of `amir_deals`.```{python}# Create a histogram of the num_users column of amir_deals and show the plot.amir_deals['num_users'].hist()plt.title("The Distribution of Number of Users in Amir's deal")plt.show()# Set the seed to 104.np.random.seed(104)# Take a sample of size 20 with replacement from the num_users column of amir_deals, and take the mean.samp_20 = amir_deals['num_users'].sample(20, replace=True)# Take mean of samp_20print(f"The Mean of the 20 samples from Number of Users is: {np.mean(samp_20)}")# Repeat this 100 times using a for loop and store as sample_means. This will take 100 different samples and calculate the mean of each.# Set seed to 104np.random.seed(104)# Sample 20 num_users with replacement from amir_deals and take meansamp_20 = amir_deals['num_users'].sample(20, replace=True)np.mean(samp_20)sample_means = []# Loop 100 timesfor i inrange(100):# Take sample of 20 num_users samp_20 = amir_deals['num_users'].sample(20, replace=True)# Calculate mean of samp_20 samp_20_mean = np.mean(samp_20)# Append samp_20_mean to sample_means sample_means.append(samp_20_mean)print(f"Distribution of sample means (n=20, iterations=100): {sample_means}")# Convert sample_means into a pd.Series, create a histogram of the sample_means, and show the plot.sample_means_series = pd.Series(sample_means)sample_means_series.hist()plt.title("Distribution of sample means (n=20, iterations=100)")# Show plotplt.show()# Set the random seed to 321.np.random.seed(321)# Take 30 samples (with replacement) of size 20 from all_deals['num_users'] and take the mean of each sample. Store the sample means in sample_means_1.# sample_means_1 = []# Loop 30 times to take 30 means# for i in range(30):# Take sample of size 20 from num_users col of all_deals with replacement# cur_sample = all_deals['num_users'].sample(20, replace = True)# Take mean of cur_sample# cur_mean = np.mean(cur_sample)# Append cur_mean to sample_means# sample_means_1.append(cur_mean)# Print mean of sample_means_1# print(np.mean(sample_means_1))# Print the mean of the num_users column of amir_deals.print(f"Amir's average number of users: {amir_deals['num_users'].mean()}")``````pythonExpected output:Overall average number of users: 38.31333333333332Amir's average number of users: 37.651685393258425```::: {.callout-important icon="false"}### Conclusion: We can see that Amir's average number of users is very close to the overall average, so it looks like he's meeting expectations. Make sure to note this in his performance review!:::## Chapter 4: The Poisson distribution {#sec-C4}In this lesson, we'll talk about another probability distribution called the Poisson distribution.### Poisson processesBefore we talk about probability, let's define a Poisson process. A Poisson process is a process where events appear to happen at a certain rate, but completely at random. For example, the number of animals adopted from an animal shelter each week is a Poisson process - we may know that on average there are 8 adoptions per week, but this number can differ randomly. Other examples would be the number of people arriving at a restaurant each hour, or the number of earthquakes per year in California. The time unit like, hours, weeks, or years, is irrelevant as long as it's consistent.### Poisson distributionThe Poisson distribution describes the probability of some number of events happening over a fixed period of time. We can use the Poisson distribution to calculate the probability of at least 5 animals getting adopted in a week, the probability of 12 people arriving in a restaurant in an hour, or the probability of fewer than 20 earthquakes in California in a year.### Lambda ($\lambda$)The Poisson distribution is described by a value called lambda, which represents the average number of events per time period. In the animal shelter example, this would be the average number of adoptions per week, which is8. This value is also the expected value of the distribution! The Poisson distribution with $\lambda$ equals 8 looks like this. Notice that it's a discrete distribution since we're counting events, and7and8 are the most likely number of adoptions to happen in a week.### Lambda is the distribution's peakLambda changes the shape of the distribution, so a Poisson distribution with $\lambda$ equals 1, in blue, looks quite different than a Poisson distribution withlambda equals 8, in green, but no matter what, the distribution's peak is always at its lambda value.### Probability of a single valueGiven that the average number of adoptions per week is8, what's the probability of 5 adoptions in a week? Just like the other probability distributions, we can import `poisson` from `scipy.stats`. We'll use the `poisson.pmf` function, passing 5as the first argument and8as the second argument to indicate the distribution's mean. This gives us about 9%.### Probability of less than or equal toTo get the probability that 5or fewer adoptions will happen in a week, use the `poisson.cdf` function, passing in the same numbers. This gives us about 20%.### Probability of greater thanJust like other probability functions you've learned about so far, take 1 minus the "less than or equal to 5" probability to get the probability of more than 5 adoptions. There's an 81% chance that more than 5 adoptions will occur. If the average number of adoptions rises to 10 per week, there will be a 93% chance that more than 5 adoptions will occur.### Sampling from a Poisson distributionJust like other distributions, we can take samples from Poisson distributions using poisson-dot-rvs. Here, we'll simulate 10 different weeks at the animal shelter. In one week, there are 14 adoptions, but only 6 in another.### The CLT still applies!Just like other distributions, the sampling distribution of sample means of a Poisson distribution looks normal with a large number of samples.## Exercise 4.1: Tracking lead responses {#sec-E4.1}Your company uses sales software to keep track of new sales leads. It organizes them into a queue so that anyone can follow up on one when they have a bit of free time. Since the number of lead responses is a countable outcome over a period of time, this scenario corresponds to a Poisson distribution. On average, Amir responds to 4 leads each day. In this exercise, you'll calculate probabilities of Amir responding to different numbers of leads.1. Import poisson from `scipy.stats` and calculate the probability that Amir responds to 5 leads in a day, given that he responds to an average of 4.2. Amir's coworker responds to an average of 5.5 leads per day. What is the probability that she answers 5 leads in a day?3. What's the probability that Amir responds to 2 or fewer leads in a day?4. What's the probability that Amir responds to more than 10 leads in a day?```{python}# Import poisson from scipy.stats and calculate the probability that Amir responds to 5 leads in a day, given that he responds to an average of 4.from scipy.stats import poisson# Probability of 5 responsesprob_5 = poisson.pmf(5, 4)print(f"The probability that Amir responds to 5 leads in a day: {prob_5}")# 0.1562934518505317 (15.6%)# Amir's coworker responds to an average of 5.5 leads per day. What is the probability that she answers 5 leads in a day?prob_coworker = poisson.pmf(5, 5.5)print(f"The probability Amir's coworker responds to an average of 5.5 leads per day: {prob_coworker}")# 0.17140068409793663 (17.1%)# What's the probability that Amir responds to 2 or fewer leads in a day?prob_2_or_less = poisson.cdf(2, 4)print(f"The probability that Amir responds to 2 or fewer leads in a day: {prob_2_or_less}")# 0.23810330555354436 (23.8%)# What's the probability that Amir responds to more than 10 leads in a day?prob_over_10 =1- poisson.cdf(10, 4)print(f"The probability that Amir responds to more than 10 leads in a day: {prob_over_10}")# 0.0028397661205137315 (0.28397661%)```## Chapter 4.1: More probability distributionsIn this lesson, we'll discuss a few other probability distributions.### Exponential distributionThe first distribution is the exponential distribution, which represents the probability of a certain time passing between Poisson events. We can use the exponential distribution to predict, for example, the probability of more than 1 day between adoptions, the probability of fewer than 10 minutes between restaurant arrivals, and the probability of 6-8 months passing between earthquakes. Just like the Poisson distribution, the time unit doesn't matter as long as it's consistent. The exponential distribution uses the same $\lambda$ value, which represents the rate, that the Poisson distribution does. Note that lambdaand rate mean the same value in this context. It's also continuous, unlike the Poisson distribution, since it represents time.### Customer service requestsFor example, let's say that one customer service ticket is created every 2 minutes. We can rephrase this so it's in terms of a time interval of one minute, so half of a ticket is created each minute. We'll use 0.5 as the $\lambda$ value. The exponential distribution with a rate of one half looks like this.### Lambda in exponential distributionThe rate affects the shape of the distribution and how steeply it declines.### Expected value of exponential distributionRecall that lambdais the expected value of the Poisson distribution, which measures frequency in terms of rate or number of events. In our customer service ticket example, this means that the expected number of requests per minute is0.5. The exponential distribution measures frequency in terms of time between events. The expected value of the exponential distribution can be calculated by taking 1 divided by lambda. In our example, the expected time between requests is1 over one half, which is2, so there is an average of 2 minutes between requests.### How long until a new request is created?Similar to other continuous distributions, we can use `expon.cdf` to calculate probabilities. The probability of waiting less than 1 minute for a new request is calculated using `expon.cdf`, passing in1 followed by a 2, which gives us about an 40% chance. Note that we're passing in 2, not the lambda value which is 0.5. The probability of waiting more than 4 minutes can be found using 1 minus `expon.cdf` of 4, 2, giving a 13% chance. Finally, the probability of waiting between 1 and 4 minutes can be found by taking `expon.cdf` of 4 and subtracting `expon.cdf` of 1. There's a 50% chance you'll wait between 1 and 4 minutes.### (Student's) t-distributionThe next distribution is the t-distribution, which is also sometimes called Student's t-distribution. Its shape is similar to the normal distribution, but not quite the same. If we compare the normal distribution, in blue, with the t-distribution with one degree of freedom, in orange, the t-distribution's tails are thicker. This means that in a t-distribution, observations are more likely to fall further from the mean.### Degrees of freedomThe t-distribution has a parameter called degrees of freedom, which affects the thickness of the distribution's tails. Lower degrees of freedom results in thicker tails and a higher standard deviation. As the number of degrees of freedom increases, the distribution looks more and more like the normal distribution.### Log-normal distributionThe last distribution we'll discuss is the log-normal distribution. Variables that follow a log-normal distribution have a logarithm that is normally distributed. This results in distributions that are skewed, unlike the normal distribution. There are lots of real-world examples that follow this distribution, such as the length of chess games, blood pressure in adults, and the number of hospitalizations in the 2003 SARS outbreak.## Exercise 4.2: Modeling time between leads {#sec-E4.2}To further evaluate Amir's performance, you want to know how much time it takes him to respond to a lead after he opens it. On average, he responds to 1 request every 2.5 hours. In this exercise, you'll calculate probabilities of different amounts of time passing between Amir receiving a lead and sending a response.1. Import `expon` from `scipy.stats`. What's the probability it takes Amir less than an hour to respond to a lead?2. What's the probability it takes Amir more than 4 hours to respond to a lead?3. What's the probability it takes Amir 3-4 hours to respond to a lead?```{python}# Import expon from scipy.stats. What's the probability it takes Amir less than an hour to respond to a lead?from scipy.stats import expon# Print probability response takes < 1 hourprint(f"The probability it takes Amir less than an hour to respond to a lead {expon.cdf(1, scale=2.5)}")# 0.3296799539643607 (32.97%)# What's the probability it takes Amir more than 4 hours to respond to a lead?print(f"The probability it takes Amir more than 4 hours to respond to a lead: {1- expon.cdf(4, scale =2.5)}")# 0.20189651799465536 (20.2%)# What's the probability it takes Amir 3-4 hours to respond to a lead?print(f"The probability it takes Amir 3-4 hours to respond to a lead: {expon.cdf(4, scale =2.5) - expon.cdf(3, scale =2.5)}")# 0.09929769391754684 (9.93%)```## Chapter 5: Correlation {#sec-C5}Welcome to the final chapter of the course, where we'll talk about correlation and experimental design.### Relationships between two variablesBefore we dive in, let's talk about relationships between numeric variables. We can visualize these kinds of relationships with scatter plots - in this scatterplot, we can see the relationship between the total amount of sleep mammals get and the amount of REM sleep they get. The variable on the x-axis is called the explanatory or independent variable, and the variable on the y-axis is called the response or dependent variable.### Correlation coefficientWe can also examine relationships between two numeric variables using a number called the correlation coefficient. This is a number between -1and1, where the magnitude corresponds to the strength of the relationship between the variables, and the sign, positive or negative, corresponds to the direction of the relationship.### Magnitude = strength of relationshipHere's a scatterplot of 2 variables, x and y, that have a correlation coefficient of 0.99. Since the data points are closely clustered around a line, we can describe this as a **near-perfect or very strong relationship**. If we know what x is, we'll have a pretty good idea of what the value of y could be.Here, `x` and `y` have a correlation coefficient of 0.75, and the data points are a bit more spread out.In this plot, x and y have a correlation of 0.56and are therefore **moderately correlated**. A correlation coefficient around 0.2 would be considered a **weak relationship**. When the correlation coefficient is close to 0, x and y have no relationship and the scatterplot looks completely random. This means that knowing the value of x doesn't tell us anything about the value of y.### Sign = directionThe sign of the correlation coefficient corresponds to the direction of the relationship. A positive correlation coefficient indicates that as x increases, y also increases. A negative correlation coefficient indicates that as x increases, y decreases.### Visualizing relationshipsTo visualize relationships between two variables, we can use a scatterplot. We'll use `seaborn`, which is a plotting package built on top of `matplotlib`. We import `seaborn` as `sns`, which is the alias commonly used for `seaborn`. We create a scatterplot using `sns.scatterplot`, passing it the name of the variable for the x-axis, the name of the variable for the y-axis, as well as the `msleep` DataFrame to the data argument. Finally, we call `plt.show`.### Adding a trendlineWe can add a linear trendline to the scatterplot using seaborn's `lmplot()` function. It takes the same arguments as `sns.scatterplot`, but we'll set `ci` to `None` so that there aren't any confidence interval margins around the line. Trendlines like this can be helpful to more easily see a relationship between two variables.### Computing correlationTo calculate the correlation coefficient between two Series, we can use the `.corr` method. If we want the correlation between the `sleep_total` and `sleep_rem` columns of `msleep`, we can take the `sleep_total` column and call `.corr` on it, passing in the other Series we're interested in. Note that it doesn't matter which Series the method is invoked on and which is passed in since the correlation between x and y is the same thing as the correlation between y and x.### Many ways to calculate correlationThere's more than one way to calculate correlation, but the method we've been using in this video is called the Pearson product-moment correlation, which is also written as `r`. This is the most commonly used measure of correlation. Mathematically, It's calculated using this formula, where $\bar{x}$ and $\bar{y}$ are the means of $x$ and $y$, and $\sigma_x$ and $\sigma_y$ are the standard deviations of $x$ and $y$. The formula itself isn't important to memorize, but know that there are variations of this formula that measure correlation a bit differently, such as*Kendall's tau* ($\tau$) and *Spearman's rho* ($\rho$), but those are beyond the scope of this course.## Exercise 5.1: Relationships between variablesIn this chapter, you'll be working with a dataset `world_happiness` containing results from the 2019 World Happiness Report. The report scores various countries based on how happy people in that country are. It also ranks each country on various societal aspects such as social support, freedom, corruption, and others. The dataset also includes the GDP per capita and life expectancy for each country.1. Create a scatterplot of `happiness_score` vs. `life_exp` (without a trendline) using `seaborn`.2. Create a scatterplot of `happiness_score` vs. `life_exp` with a linear trendline using `seaborn`, setting `ci` to None.3. Based on the scatterplot, which is most likely the correlation between `life_exp` and `happiness_score` ?```{python}# Import the seaborn packageimport seaborn as sns# Import datasetworld_happiness = pd.read_csv("datasets/world_happiness.csv")# Create a scatterplot of happiness_score vs. life_exp (without a trendline) using seaborn.sns.scatterplot(x='life_exp', y='happiness_score', data=world_happiness)plt.title("The Scatterplot of Happiness Score vs. life expectancy (without a trendline)")# Show plotplt.show()# Create a scatterplot of happiness_score vs. life_exp with a linear trendline using seaborn, setting ci to None.sns.lmplot(x='life_exp', y='happiness_score', data=world_happiness, ci=None)plt.title("The Scatterplot of Happiness Score vs. life expectancy (with a trendline)")# Show plotplt.show()# Based on the scatterplot, which is most likely the correlation between life_exp and happiness_score?corr_happy_life = world_happiness['happiness_score'].corr(world_happiness['life_exp'])print("The correlation between life_exp and happiness_score: {corr_happy_life}")# 0.7802249053272062```## Chapter 6: Correlation caveatsWhile correlation is a useful way to quantify relationships, there are some caveats.### Non-linear relationshipsConsider a data. There is clearly a relationship between x and y, but when we calculate the correlation, we get 0.18.### Non-linear relationshipsThis is because the relationship between the two variables is a quadratic relationship, not a linear relationship. The correlation coefficient measures the strength of linear relationships, and linear relationships only.### Correlation only accounts for linear relationshipsJust like any summary statistic, correlation shouldn't be used blindly, and you should always visualize your data when possible.### Mammal sleep dataLet's return to the mammal sleep data.### Body weight vs. awake timeHere's a scatterplot of each mammal's body weight versus the time they spend awake each day. The relationship between these variables is definitely not a linear one. The correlation between body weight and awake time is only about 0.3, which is a weak linear relationship.### Distribution of body weightIf we take a closer look at the distribution for `bodywt`, it's highly skewed. There are lots of lower weights and a few weights that are much higher than the rest.### Log transformationWhen data is highly skewed like this, we can apply a log transformation. We'll create a new column called `log_bodywt` which holds the log of each body weight. We can do this using `np.log`. If we plot the log of bodyweight versus awake time, the relationship looks much more linear than the one between regular bodyweight and awake time. The correlation between the log of bodyweight and awake time is about 0.57, which is much higher than the 0.3 we had before.### Other transformationsIn addition to the log transformation, there are lots of other transformations that can be used to make a relationship more linear, like taking the square root or reciprocal of a variable. The choice of transformation will depend on the data and how skewed it is. These can be applied in different combinations to x and y, for example, you could apply a log transformation to both x and y, or a square root transformation to x and a reciprocal transformation to y.### Why use a transformation?So why use a transformation? Certain statistical methods rely on variables having a linear relationship, like calculating a correlation coefficient. Linear regression is another statistical technique that requires variables to be related in a linear manner, which you can learn all about in this course.### Correlation does not imply causationLet's talk about one more important caveat of correlation that you may have heard about before: correlation does not imply causation. This means that if x and y are correlated, x doesn't necessarily cause y. For example, here's a scatterplot of the per capita margarine consumption in the US each year and the divorce rate in the state of Maine. The correlation between these two variables is 0.99, which is nearly perfect. However, this doesn't mean that consuming more margarine will cause more divorces. This kind of correlation is often called a spurious correlation.### ConfoundingA phenomenon called confounding can lead to spurious correlations. Let's say we want to know if drinking coffee causes lung cancer. Looking at the data, we find that coffee drinking and lung cancer are correlated, which may lead us to think that drinking more coffee will give you lung cancer. However, there is a third, hidden variable at play, which is smoking. Smoking is known to be associated with coffee consumption. It is also known that smoking causes lung cancer. In reality, it turns out that coffee does not cause lung cancer and is only associated with it, but it appeared causal due to the third variable, smoking. This third variable is called a confounder, or lurking variable. This means that the relationship of interest between coffee and lung cancer is a spurious correlation. Another example of this is the relationship between holidays and retail sales. While it might be that people buy more around holidays as a way of celebrating, it's hard to tell how much of the increased sales is due to holidays, and how much is due to the special deals and promotions that often run around holidays. Here, special deals confound the relationship between holidays and sales.::: {.callout-note}The correlation coefficient can't account for any relationships that aren't linear, regardless of strength.:::## Exercise 5.11. Create a scatterplot of `happiness_score` versus `gdp_per_cap` and calculate the correlation between them.2. Add a new column to `world_happiness` called `log_gdp_per_cap` that contains the `log of gdp_per_cap`.3. Create a seaborn scatterplot of `happiness_score` versus `log_gdp_per_cap`.4. Calculate the correlation between `log_gdp_per_cap` and `happiness_score`.```{python}# Create a scatterplot of happiness_score versus gdp_per_cap and calculate the correlation between them.sns.scatterplot(x='gdp_per_cap', y='happiness_score', data=world_happiness)plt.title("A Scatterplot of Happiness Score versus GDP per capital")plt.show()# Calculate correlationcor = world_happiness['happiness_score'].corr(world_happiness['gdp_per_cap'])print(f"The correlation between happiness_score and gdp_per_cap is: {cor}")# Add a new column to world_happiness called log_gdp_per_cap that contains the log of gdp_per_cap.world_happiness['log_gdp_per_cap'] = np.log(world_happiness['gdp_per_cap'])# Create a seaborn scatterplot of happiness_score versus log_gdp_per_cap.sns.scatterplot(x='log_gdp_per_cap', y='happiness_score', data=world_happiness)plt.title("A Scatterplot of Happiness Score and the log of GDP per capital")plt.show()# Calculate the correlation between log_gdp_per_cap and happiness_score.cor_1 = world_happiness['happiness_score'].corr(world_happiness['log_gdp_per_cap'])print(f"The correlation between happiness_score and log of gdp_per_cap is: {cor_1}")```## Chapter 6: Design of experimentsOften, data is created as a result of a study that aims to answer a specific question. However, data needs to be analyzed and interpreted differently depending on how the data was generated and how the study was designed.### VocabularyExperiments generally aim to answer a question in the form, "What is the effect of the treatment on the response?" In this setting, treatment refers to the explanatory or independent variable, and response refers to the response or dependent variable. For example, what is the effect of an advertisement on the number of products purchased? In this case, the treatment is an advertisement, and the response is the number of products purchased.### Controlled experimentsIn a controlled experiment, participants are randomly assigned to either the treatment group or the control group, where the treatment group receives the treatment and the control group does not. A great example of this is an A/B test. In our example, the treatment group will see an advertisement, and the control group will not. Other than this difference, the groups should be comparable so that we can determine if seeing an advertisement causes people to buy more. If the groups aren't comparable, this could lead to confounding, or bias. If the average age of participants in the treatment group is 25 and the average age of participants in the control group is 50, age could be a potential confounder if younger people are more likely to purchase more, and this will make the experiment biased towards the treatment.### The gold standard of experiments will use...The gold standard, or ideal experiment, will eliminate as much bias as possible by using certain tools. The first tool to help eliminate bias in controlled experiments is to use a randomized controlled trial. In a randomized controlled trial, participants are randomly assigned to the treatment or control group and their assignment isn't based on anything other than chance. Random assignment like this helps ensure that the groups are comparable. The second way is to use a placebo, which is something that resembles the treatment, but has no effect. This way, participants don't know if they're in the treatment or control group. This ensures that the effect of the treatment is due to the treatment itself, not the idea of getting the treatment. This is common in clinical trials that test the effectiveness of a drug. The control group will still be given a pill, but it's a sugar pill that has minimal effects on the response. In a double-blind experiment, the person administering the treatment or running the experiment also doesn't know whether they're administering the actual treatment or the placebo. This protects against bias in the response as well as the analysis of the results. These different tools all boil down to the same principle: if there are fewer opportunities for bias to creep into your experiment, the more reliably you can conclude whether the treatment affects the response.### Observational studiesThe other kind of study we'll discuss is the **observational study**. In an observational study, participants are not randomly assigned to groups. Instead, participants assign themselves, usually based on pre-existing characteristics. This is useful for answering questions that aren't conducive to a controlled experiment. If you want to study the effect of smoking on cancer, you can't force people to start smoking. Similarly, if you want to study how past purchasing behavior affects whether someone will buy a product, you can't force people to have certain past purchasing behavior. Because assignment isn't random, there's no way to guarantee that the groups will be comparable in every aspect, so observational studies can't establish causation, only association. The effects of the treatment may be confounded by factors that got certain people into the control group and certain people into the treatment group. However, there are ways to control for confounders, which can help strengthen the reliability of conclusions about association.### Longitudinal vs. cross-sectional studiesThe final important distinction to make is between **longitudinal and cross-sectional studies**. In a **longitudinal study**, the same participants are followed over a period of time to examine the effect of treatment on the response. In a **cross-sectional study**, data is collected from a single snapshot in time. If you wanted to investigate the effect of age on height, a **cross-sectional study** would measure the heights of people of different ages and compare them. However, the results will be confounded by birth year and lifestyle since it's possible that each generation is getting taller. In a longitudinal study,the same people would have their heights recorded at different points in their lives, so the confounding is eliminated. It's important to note that **longitudinal studies** are more expensive, and take longer to perform, while**cross-sectional studies** are cheaper, faster, and more convenient.