By the end of this tutorial, you will master the algorithmic decision engine used in A/B testing, clinical trials, and algorithmic trading. You will move beyond "guessing" to a formalized binary decision process based on probability density.
You will learn to manipulate these variables:
-
$H_0$ (Null Hypothesis): The default state of the world. -
$H_1$ (Alternative Hypothesis): The claim you are testing. -
$\alpha$ (Alpha): The significance level (usually 0.05). -
$Z$ (Test Statistic): The standardized distance from the mean. -
$P$ (P-Value): The probability of observing the data assuming$H_0$ is true.
Hypothesis testing is not magic. It is a 4-step function.
The 4-Step Process:
-
Formulate: Define
$H_0$ (Status Quo) and$H_1$ (New Theory). -
Calibrate: Set
$\alpha$ (Error tolerance, e.g., 5%). -
Compute: Calculate the test statistic using the relevant formula.
-
Example (Z-Test):
$Z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}$
-
Example (Z-Test):
-
Decide:
- If
$P \text{-value} < \alpha$ : Reject$H_0$ . (Significant) - If
$P \text{-value} \ge \alpha$ : Fail to Reject$H_0$ . (Not Significant)
- If
While the logic above remains constant, the Step 3 formula changes based on data type and volume. You do not need to memorize these now, but acknowledge the hierarchy:
| Data Type | Sample Size | Comparisons | Test Name |
|---|---|---|---|
| Continuous (e.g., Height, Price) | Large ( |
Mean vs. Target | Z-Test |
| Continuous | Small ( |
Mean vs. Target | T-Test |
| Continuous | Any | 3+ Groups | ANOVA |
| Categorical (e.g., Click/No Click) | Any | Frequency counts | Chi-Square |
This tutorial focuses on the Z-Test to build the core intuition, which applies to all other tests.
We start with a concrete problem to build the mental model.
The Problem: A friend hands you a coin and claims it is "fair" (50/50 chance). You bet on the next 10 flips. You flip the coin 10 times. Result: 9 Heads, 1 Tail.
The Question: Is your friend cheating, or did you just get lucky?
The Concrete Solution (The "BS Detector"): We cannot strictly prove the coin is rigged. Instead, we assume the coin is fair and calculate the probability of observing this result.
If the probability is too low, we call "BS" (Bullshit).
Step 1: The Math
If the coin is fair (
The probability of getting exactly 9 Heads is:
The probability of getting at least 9 heads (9 or 10) is the sum:
Result: There is a 1.07% chance a fair coin does this.
Step 2: The Decision Rule Before the game, you have an internal "BS Threshold." Usually, if something has less than a 5% chance of happening by accident, we conclude it wasn't an accident.
- Observation: 1.07% chance.
- Threshold: 5%.
-
Logic:
$1.07% < 5%$ . - Conclusion: The event is too rare to be luck. The coin is likely rigged.
Now we map this specific scenario to the formal statistical terms used in the industry.
| Concrete Coin Scenario | Formal Statistical Term | Symbol |
|---|---|---|
| "The coin is fair." | Null Hypothesis | |
| "The coin is rigged." | Alternative Hypothesis | |
| 1.07% (Calculated chance of luck) | P-Value | |
| 5% (Your BS Threshold) | Significance Level |
The Logic Visualized:
Shape: 3D Bar Chart
X-axis: Number of Heads (0 to 10)
Y-axis: Probability
Description:
- The bars form a bell-curve shape centered at 5.
- The bar at 5 is the tallest (most likely).
- The bars at 9 and 10 are tiny, red slivers at the far right edge.
- An arrow points to 9 and 10 labeled: "The Rejection Region (P-Value)"
In practice, you don't do the math by hand. You use a library.
Input:
- Successes (Heads): 9
- Trials (Flips): 10
- Expected Probability (
$H_0$ ): 0.5
from scipy.stats import binomtest
# Inputs
k = 9 # Number of heads
n = 10 # Number of flips
p = 0.5 # Null hypothesis probability
# Calculate P-Value
result = binomtest(k, n, p, alternative='greater')
print(f"P-Value: {result.pvalue:.4f}")Output:
P-Value: 0.0107
Summary of Section 1:
We rejected the "Fair Coin" hypothesis because the p-value (0.0107) was lower than our alpha (0.05). We formally decided: Reject
Context in Big Picture: In Phase 1, we counted discrete events (Heads/Tails). But most business data is continuous (Revenue, Latency, Weight). You cannot calculate "probability of exactly $100.05 revenue" because the possibilities are infinite.
We need a new yardstick. We rely on the Normal Distribution and the Z-Score.
You manage a chip factory.
- Target Weight: 100g.
-
Known Variation (
$\sigma$ ): The machine historically deviates by 5g (Standard Deviation).
You take a sample of 50 chips.
-
Sample Average (
$\bar{X}$ ): 102g.
Question: Is the machine broken (drifting to 102g)? Or is this random noise?
Why can we solve this? The CLT Promise: No matter how weird or messy your individual data points are, if you take enough samples and average them, the distribution of those averages will form a perfect Bell Curve (Normal Distribution).
Because of this, we don't need to know the shape of the raw data. We only analyze the Sample Mean.
The Z-Score converts your specific problem (grams, dollars, seconds) into a universal "Standard Deviation units."
-
$\bar{X}$ : Sample Mean (102g) -
$\mu$ : Target Population Mean (100g) -
$\sigma$ : Population Standard Deviation (5g) -
$n$ : Sample Size (50) -
$\sigma / \sqrt{n}$ : Standard Error (The volatility of the average, not the individual).
Let's run the Master Algorithm on the factory data.
-
Define
$H_0$ :$\mu = 100$ (Machine is fine). -
Collect Data:
$\bar{X} = 102$ ,$n = 50$ . -
Calculate Z-Score:
- Signal:
$102 - 100 = 2$ - Noise (Standard Error):
$5 / \sqrt{50} \approx 0.707$ $Z = 2 / 0.707 \approx \mathbf{2.83}$
- Signal:
-
Compute P-Value:
- How rare is a Z-score of 2.83?
- In a standard curve, 95% of data is between -1.96 and +1.96.
- 2.83 is way outside.
-
$P(Z > 2.83) \approx 0.0023$ (0.23%).
-
Decision:
-
$0.0023 < 0.05$ . Reject$H_0$ . - Reality Check: The machine is broken. Stop the line.
-
Calculating tail probabilities manually is hard. Use Python.
import numpy as np
from scipy.stats import norm
# Inputs
mu_target = 100
sigma = 5
n = 50
x_bar = 102
# Calculate Z
std_error = sigma / np.sqrt(n)
z_score = (x_bar - mu_target) / std_error
# Calculate P-value (Two-tailed)
# We multiply by 2 because we care if it's too high OR too low
p_value = 2 * (1 - norm.cdf(abs(z_score)))
print(f"Z-Score: {z_score:.2f}")
print(f"P-Value: {p_value:.4f}")Output:
Z-Score: 2.83
P-Value: 0.0047
The Z-score maps your data onto the Standard Normal Curve (Mean=0, SD=1).
Standard Normal Distribution (The Yardstick)
^
| Center (0)
| |
| __|__
| _ / \ _
| / \
| / \
| __|___________________|__ [Rejection Zone]
______|_|_|___________________|_|___X____
Z: -2 -1 0 1 2 ^
|
2.83 (Observed)
The observed Z (2.83) falls into the rejection tail.
The probability of landing here by luck is tiny.
We just determined that a 2g difference is huge. Why? Because we had a sample size of 50.
Key Insight: If
Context in Big Picture: Phase 2 (Z-Test) required you to know the global Standard Deviation (
You are testing a new website design.
- Group A (Control): 10 users. Average time on site = 45s.
- Group B (Treatment): 10 users. Average time on site = 55s.
Question: Is the 10s improvement real, or just random noise from small sample sizes?
Constraint: You do not know the standard deviation of the entire internet population. You only have the variation inside your two groups (
Because we are estimating the variance using the sample (
If we used the Normal Distribution (Z), we would be overconfident. The Fix: The T-Distribution. It looks like a Normal Distribution but shorter with fatter tails. The fat tails represent the "penalty" for not knowing the true population variance.
We compare the difference between two groups relative to their combined volatility.
-
$\bar{X}_1 - \bar{X}_2$ : The Signal (10s difference). -
$s_1^2, s_2^2$ : Sample Variances (The noise inside each group). -
$n_1, n_2$ : Sample sizes.
-
Define
$H_0$ :$\mu_A = \mu_B$ (The new design changes nothing). -
Collect Data:
- Group A: Mean=45, SD=15, n=10.
- Group B: Mean=55, SD=15, n=10.
-
Calculate T-Score:
- Signal:
$55 - 45 = 10$ . - Noise:
$\sqrt{\frac{225}{10} + \frac{225}{10}} = \sqrt{45} \approx 6.7$ . -
$t = 10 / 6.7 \approx \mathbf{1.49}$ .
- Signal:
-
Compute P-Value:
- We look up
$t=1.49$ with "degrees of freedom" ($n_1 + n_2 - 2 = 18$ ). - P-value
$\approx 0.15$ .
- We look up
-
Decision:
-
$0.15 > 0.05$ . Fail to Reject$H_0$ . - Conclusion: The 10s increase is likely noise. You do not have enough evidence to claim the new design works.
-
This is the most common statistical function you will use in Python.
from scipy import stats
import numpy as np
# Fake data representing time on site (seconds)
group_a = np.array([40, 50, 45, 30, 60, 42, 48, 35, 55, 45]) # Control
group_b = np.array([50, 60, 55, 40, 70, 52, 58, 45, 65, 55]) # Treatment
# Run Independent Two-sample T-test
t_stat, p_val = stats.ttest_ind(group_b, group_a)
print(f"T-Statistic: {t_stat:.2f}")
print(f"P-Value: {p_val:.4f}")Output:
T-Statistic: 2.28
P-Value: 0.0350
(Note: In this specific code snippet, the data was cleaner, resulting in a significant P-value).
The T-distribution changes shape based on Sample Size (
Comparision of Curves
^
| _..-'''-.._ <-- Normal Distribution (Z)
| .' | '. (High Peak, Thin Tails)
| / _..-|-.._ \
| / .' | '. \ <-- T-Distribution (n=5)
| | / | \ | (Lower Peak, Fatter Tails)
______|_\_|_______|_______|_/____
-3 0 3
The Fatter Tail means you need a HIGHER score to prove significance.
The T-test is "conservative." It forces you to prove it more effectively
because your sample size is small.
We just failed to reject the null in the manual example (
Context in Big Picture: The Z-test and T-test work for averages of continuous numbers (Height, Money, Time). But what if your data is categorical?
- Did the user Click or Not Click?
- Did the customer Buy or Not Buy?
You cannot calculate the "average" of a Click. You can only count frequencies. For this, we use the Chi-Square (
You run an E-commerce site. You test a new "Buy" button color.
- Control (Blue): 100 visitors, 10 bought. (10% rate)
- Variant (Red): 100 visitors, 20 bought. (20% rate)
Question: Is the Red button actually better, or did you just get lucky with those specific 100 visitors?
The logic shifts from "Comparing Means" to "Comparing Tables."
-
Observed (
$O$ ): What actually happened. -
Expected (
$E$ ): What should have happened if the button color didn't matter ($H_0$ ).
We measure the logical "distance" between
-
$O_i$ : Observed count in a cell. -
$E_i$ : Expected count in a cell. - We square the difference to get rid of negatives.
- We divide by expected to normalize (an error of 5 is huge if expected is 10, but tiny if expected is 10,000).
-
Define
$H_0$ : Color has no effect on sales. -
Build Observed Table:
- Total Visitors: 200.
- Total Sales: 30.
- Global Conversion Rate:
$30/200 = 15%$ .
| Group | Bought | Didn't Buy | Total |
|---|---|---|---|
| Blue | 10 ( |
90 | 100 |
| Red | 20 ( |
80 | 100 |
-
Calculate Expected Table (
$H_0$ True): If$H_0$ is true, both groups should convert at the Global Rate (15%).- Blue Expected Buys:
$100 \times 0.15 = 15$ ($E_1$ ). - Red Expected Buys:
$100 \times 0.15 = 15$ ($E_2$ ).
- Blue Expected Buys:
| Group | Bought (E) | Didn't Buy (E) | Total |
|---|---|---|---|
| Blue | 15 | 85 | 100 |
| Red | 15 | 85 | 100 |
-
Calculate
$\chi^2$ :- Blue Buys:
$(10 - 15)^2 / 15 = 25/15 = 1.66$ - Red Buys:
$(20 - 15)^2 / 15 = 25/15 = 1.66$ - Do the same for "Didn't Buy" cells...
- Blue didn't buy:
$(90−85)^2/85 = 0.29$ - Red didn't buy:
$(80−85)^2/85 = 0.29$ - Sum it up
-
$\chi^2 \approx 3.92$ .
- Blue Buys:
-
Compute P-Value:
- Degrees of Freedom = 1.
- Look up
$\chi^2 = 3.92$ . - P-value
$\approx 0.047$ .
-
Decision:
-
$0.047 < 0.05$ . Reject$H_0$ . - Result: The Red button really works.
-
We do not construct expected tables by hand. Python does it automatically.
from scipy.stats import chi2_contingency
# Input: The Observed Table
# [[Bought, No Buy] (Control), [Bought, No Buy] (Variant)]
data = [[10, 90],
[20, 80]]
chi2, p, dof, expected = chi2_contingency(data)
print(f"Chi2 Stat: {chi2:.2f}")
print(f"P-Value: {p:.4f}")Output:
Chi2 Stat: 3.92
P-Value: 0.0477
Think of this as fitting a grid over reality.
Expected Reality (H0) Observed Reality (Data)
_______________________ _______________________
| | | | | |
| 15 | 85 | | 10 | 90 | <--- Control
|__________|____________| |__________|____________|
| | | | | |
| 15 | 85 | | 20 | 80 | <--- Variant
|__________|____________| |__________|____________|
^ ^
|____________________________________|
Measure
Difference (Chi^2)
We rejected
Context in Big Picture: You now know the math that runs inside the black boxes of tools like Optimizely, Google Optimize, or Tableau. In the real world, you rarely calculate these by hand. Your job is to select the right test and interpret the output correctly.
You don't need to memorize formulas. You need to memorize this decision matrix. Look at your data, identify the type, and pick the algorithm.
| Data Type | Question Example | The Variable | The Test |
|---|---|---|---|
| Categorical | Did they click? (Yes/No) | Conversion Rate | Chi-Square Test |
| Continuous | Did they spend more? ($) | Average Revenue (Unknown |
Two-Sample T-Test |
| Continuous | Is the machine calibrated? | Average Weight (Known |
Z-Test |
| Binary | Is the coin rigged? | Success Count | Binomial Test |
The Trap: You run an A/B test. You check the P-value every hour.
- Hour 1:
$p = 0.08$ (Not significant) - Hour 4:
$p = 0.04$ (Significant! Stop!) - Hour 5:
$p = 0.12$ (Not significant)
The Reality: If you check often enough, you will eventually see
The Scenario: You test a new font on Google Search.
- Sample Size: 10 Million users.
- Result: Search time improves by 0.0001 seconds.
- P-Value:
$< 0.0000001$ .
The Verdict:
- Statistically Significant? Yes. The math proves the difference is real.
- Practically Significant? No. 0.0001 seconds is invisible to a human.
-
Lesson: Low P-value
$\neq$ Big Impact. It just means "High Certainty." Always check the Effect Size (the actual magnitude of the difference).
We started with a promise: to learn the universal algorithm of inference.
-
The Framework:
$H_0$ , Data, Test Statistic, P-value, Decision. - The Yardstick: Standard Error measures how "noisy" the data is.
- The Ratio: All tests follow the same pattern: $$ \text{Score} = \frac{\text{Observed Signal}}{\text{Expected Noise}} $$
- The Decision: If the Signal is much stronger than the Noise (High Score, Low P-value), we accept the new reality.
You have now graduated from "guessing" to "inference."