📊 Statistics Fundamentals

A comprehensive guide to statistical tests and sample size calculations

Introduction

Statistical hypothesis testing is fundamental to data analysis and research. Understanding which test to use and how to calculate the required sample size ensures your analysis has sufficient statistical power to detect meaningful effects.

💡 Key Concept: Sample size directly impacts your ability to detect true effects (statistical power) while controlling for false positives (Type I error).

T

T-Test

Comparing means when population variance is unknown

When to Use

  • One-sample t-test: Compare sample mean to known population mean
  • Two-sample t-test: Compare means of two independent groups
  • Sample size is small (typically n < 30)
  • Population standard deviation is unknown
  • Data is approximately normally distributed

Test Statistic

\[ t = \frac{\bar{x} - \mu}{s / \sqrt{n}} \]

where \(\bar{x}\) = sample mean, \(\mu\) = population mean, \(s\) = sample standard deviation, \(n\) = sample size

Sample Size Formula (Two-Sample T-Test)

\[ n = \frac{2(Z_{\alpha/2} + Z_{\beta})^2 \sigma^2}{\delta^2} \]

• \(Z_{\alpha/2}\) = critical value for significance level (e.g., 1.96 for α=0.05)

• \(Z_{\beta}\) = critical value for power (e.g., 0.84 for 80% power)

• \(\sigma\) = pooled standard deviation

• \(\delta\) = minimum detectable difference (effect size)

P

Paired T-Test

Comparing means of matched or repeated measurements

When to Use

  • Measurements are paired (before/after, matched subjects)
  • Same subjects measured twice under different conditions
  • Reduces variability by controlling for individual differences
  • Each pair is independent from other pairs

Test Statistic

\[ t = \frac{\bar{d}}{s_d / \sqrt{n}} \]

where \(\bar{d}\) = mean of differences, \(s_d\) = standard deviation of differences

Sample Size Formula (Paired T-Test)

\[ n = \frac{(Z_{\alpha/2} + Z_{\beta})^2 \sigma_d^2}{\delta^2} \]

• \(\sigma_d\) = standard deviation of paired differences

• \(\delta\) = minimum detectable mean difference

• Typically requires fewer subjects than independent samples due to reduced variability

Z

Z-Test

Comparing means when population variance is known

When to Use

  • Large sample size (typically n ≥ 30)
  • Population standard deviation is known
  • Data is normally distributed or sample is large enough for CLT
  • More powerful than t-test when conditions are met

Test Statistic

\[ Z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}} \]

where \(\sigma\) = known population standard deviation

Sample Size Formula (Z-Test)

\[ n = \frac{(Z_{\alpha/2} + Z_{\beta})^2 \sigma^2}{\delta^2} \]

• Uses known population variance \(\sigma^2\)

• Generally requires smaller sample than t-test due to known variance

âš¡

Statistical Power

Probability of detecting a true effect

Definition

Statistical power is the probability of correctly rejecting the null hypothesis when it is false (i.e., detecting a true effect). Power = 1 - β, where β is the Type II error rate.

Type I Error (α)

False positive - rejecting true null hypothesis. Typically set at 0.05 (5%).

Type II Error (β)

False negative - failing to reject false null hypothesis. Power = 1 - β.

Factors Affecting Power

  • ↑ Sample size: Larger n increases power
  • ↑ Effect size: Larger effects are easier to detect
  • ↑ Significance level (α): Higher α increases power (but more false positives)
  • ↓ Variability: Lower variability increases power

⚡ Conventional Power: Studies typically aim for 80% power (β = 0.20)

CI

Confidence Intervals

Range of plausible values for population parameter

Interpretation

A 95% confidence interval means that if we repeated the study many times, 95% of calculated intervals would contain the true population parameter.

Confidence Interval Formula

\[ CI = \bar{x} \pm t_{\alpha/2} \cdot \frac{s}{\sqrt{n}} \]

95% CI: \(\bar{x} \pm 1.96 \cdot SE\) (for large samples)

Sample Size for Desired Margin of Error

\[ n = \left(\frac{Z_{\alpha/2} \cdot \sigma}{E}\right)^2 \]

• \(E\) = desired margin of error (half-width of CI)

• Example: For 95% CI with margin of error ±5 and σ=20: n = (1.96×20/5)² ≈ 62

📋 Sample Size Calculation Summary

General Process

  1. 1. Define effect size (δ): The minimum meaningful difference you want to detect
  2. 2. Set significance level (α): Usually 0.05 (5% false positive rate)
  3. 3. Choose desired power (1-β): Usually 0.80 (80% chance to detect true effect)
  4. 4. Estimate variability (σ): From pilot data or literature
  5. 5. Calculate sample size: Using appropriate formula for your test
  6. 6. Add buffer: Account for dropouts/missing data (typically 10-20%)

Quick Reference

α = 0.05 (two-tailed) Z = 1.96
α = 0.01 (two-tailed) Z = 2.58
Power = 80% Z = 0.84
Power = 90% Z = 1.28

Rule of Thumb

  • ✓ Small effect: ~400 per group
  • ✓ Medium effect: ~64 per group
  • ✓ Large effect: ~26 per group
  • (Cohen's d: 0.2, 0.5, 0.8 respectively, α=0.05, power=0.80)

Key Takeaways

  • • Larger samples give more precise estimates and higher power
  • • Paired designs generally require fewer subjects than independent samples
  • • Z-tests require fewer samples than t-tests (when σ is known)
  • • Always calculate sample size before collecting data
  • • Consider practical constraints (budget, time, feasibility) in your design

📊 Statistical analysis is a tool for discovery, not a substitute for thinking.

Always consider the context and practical significance of your findings.