Super AP Stats Prep
📊

Super AP Stats Prep

Tags
Published
Author
By using these notes, you agree to use it solely for reference purposes and not to copy or plagiarize any of it. All work must be your own, and any information or ideas drawn from these notes must be properly cited. These notes are here to help guide you! These are public notes, and NOT your own. Failure to comply may result in academic consequences.
List created by Shaurya V.
  • On the AP Test, SAY WHAT THE VALUE THAT YOU FOUND IS before you explain the context. (ex: “r2 = 0.955. Context: 95.5% of the variation in ….”)

Semester 1 Review

Unit 1

Qualitative (Categorical): Takes on Categories of Data (Gender, Political Affiliation) - use with X2-distribution or Z-Distribution
  • Use: Tables, Pie Charts, Dotplots, Bar Charts, Segmented/Side-by-Side Bar Charts
Quantitative: Is Numerical (Age, Total Cost, etc.) - use w/ T-Distribution
  • Use: Histograms, Scatterplots, Dot Plots, Stem-and-Leaf plots, Boxplots
  • Types of Variables: Discrete (Countable) vs. Continuous (not countable - weight)
To Describe/Compare Distributions: S (Shape) O (Outliers) C (Center) S (Spread)
  • Shape: Modality (How many “Peaks” the data has - Unimodal, Bimodal, etc.) and Shape of Data (Symmetric, Skewed Left/Right, Uniform)
  • Skewed Left: Tail of Data to the LEFT/ Mean pulled towards the left -> Median > Mean
  • Skewed Right: Tail of Data to the RIGHT/ Mean pulled to the right -> Median < Mean
  • In Symmetric or Unimodal distributions, Mean = Median
  • Outliers: Use IQR Rule (1.5 * IQR) or use 2 S.D. Rule (For Normal Distributions, outliers exist outside 2 standard deviations of the mean -> μ± 2σ)
  • Center: Use mean/median/Q1 or Q3 measures, approximate if necessary.
  • Spread: Use Standard Deviation, Range, IQR, etc.
  • What is an IQR: Q3-Q1, represents the spread/range of the middle half (50% of the data) of the data
Largest S.D -> Data with the Largest Spread (Farthest from the Mean - Huge Gaps between Data, for example)
notion image
Frequency v. Rel Frequency: Frequency == the counts, Rel. Frequency == the proportions
Percentile: % of Values less than or equal to value -> 25th percentile has 25% of the values below it.
Z Score: Shows # of standard deviations above/below the mean, calculated w/ (xi - μ)/σ
  • Is Standardized (Doesn’t take Units)
  • Both Z-Scores and Percentiles can be used for ANY DISTRIBUTION
Normal Distribution: Symmetric, Bell Shaped
notion image
  • Use NormalCDF to calculate Proportions. When using z-scores w/ Normalcdf, μ=0, σ=1.
  • invNorm calculates the value OR z-score given the proportion (area).

Example Problems

  1. Scientists working for a water district measure the water level in a lake each day. The daily water level in the lake varies due to weather conditions and other factors. The daily water level has a distribution that is approximately normal with mean water level of 84.07 feet. The probability that the daily water level in the lake is at least 100 feet is 0.064. Which of the following is closest to the probability that on a randomly selected day the water level in the lake will be at least 90 feet?
  • To solve the probability, we’ll need to find the missing Standard Deviation. Find the Z-score of the proportion using invNorm(proportion, 0, 1) -> invNorm(0.064, 0, 1) = 1.522. Next, set the z-score equal to the z-score formula -> 1.522 = (100-84.07)/σ. You can now solve for S.D, and do the problem as normal.

Unit 2

Scatterplots: Relate bivariate (2 variable) data - x (explanatory/independent variable - influences or explains change) and y (response/dependent variable - outcome of the change caused by explanatory variable).
  • Describe scatterplots with Strength, Direction, and Form (ex: Strong/Moderate/Weak, Positive/Negative, Linear/Curved/No Pattern)
Least Squares Regression Line: Describes how y changes as x changes (predicts y given x value), and minimizes the residuals (Actual - Predicted y value). Sum of residuals ALWAYS 0.
  • Must Contain (x̄, ȳ)
  • Equation: ŷ = a + bx, ŷ = predicted y, x = reg x ALWAYS DEFINE VARIABLES!!
  • To Interpret Slope: “For every 1-unit increase in x, the y is predicted to change by b.”
  • To Interpret y-Intercept: “When x is 0, the predicted value of y is a.”
  • To Calculate Slope: b = r(Sy/Sx)
  • Extrapolation: When you attempt to estimate values with the LSR line using x-values that are far outside the data range. DO NOT!!!
Residual Plots: Scatterplot of residuals vs. x. Ideally, you want to see no pattern and small residual values -> allows you to conclude that the LSR equation is appropriate (Linear model is a good predictor)
r (Correlation Coefficient): Measure of direction/strength of a LINEAR relationship/ numerical value for correlation
  • Ranges from -1 to 1 (r=±1 -> Perfect Linear Relationship, r<0 -> Neg Association, r>0 -> Pos association - y MUST CHANGE WITH X!)
  • Unaffected by changing units/axes (switching x and y), but is not resistant to outliers.
  • Interpret w/ strength, direction, form (ex: strong, positive, linear)
r2 (Coefficient of Determination): % of change in y that is explained by the LSR line that uses x as the explanatory variable.
  • r2 = sum of residuals2, interpret w/ the sentence above.
~Large r and r2 values are good!! They help prove that the linear model is a good fit. ~
s (Standard Deviation of the Residuals): Average prediction error when using the LSR line. Ideally, a low s value is good.
  • Calculate w/ sqrt(r2/(n-2)), interpret w/ the sentence above.
Unusual Point: A Point that Stands out. Not important.
Influential Point: A data value that, when removed, will “substantially” change the slope, y-intercept, r, r2, or s values.
  • Outliers: Doesn’t follow the pattern of data AND has a large residual compared to others (Much smaller/larger y value compared to others).
  • High Leverage: Has a much smaller/larger x value compared to other data points.
  • Only High Leverage Points are Influential. Points that are both HL And Outliers are the MOST influential.

Unit 3

Biased Sampling Methods: Convenience Sample (selecting individuals that are easy to contact doesn’t represent population well), Voluntary Response Sample (taking volunteers for a sample can cause an over/underestimation of the true parameter)
Best Method is to use Random Sampling: Using chance to select a sample to minimize bias and better represent the population
  • SRS (Label individuals with number, select “n” unique individuals, identify the individuals)
  • Stratified Random Sampling: used to ensure all groups in population are represented proportionally. Divide population into strata (a group that share a characteristic relevant to the study), select SRS from each strata
  • Cluster Sampling: Makes sampling a large/spread-out population easier. Divide the population into “clusters” (heterogeneous group of individuals that are already mixed), select SRS of the clusters, sample all from the selected clusters
  • Systematic Random Sampling: Easiest of the “random” methods. Select every “kth individual.
  • Multistage Sampling: Allows for control over multiple characteristics, giving you a representative sample of a large population. Simply use more than 1 method (ex: multiple stratified random samples)
  • Using a representative sample allows you to make inferences about the population parameter.
Sampling Errors: Occur due to how we select the sample - can be avoided with proper sampling methods.
  • Undercoverage: Using a bad sampling method or a bad sampling frame (list from which the sample is chosen) can result in some groups of the population being left out of the sampling process; they are not represented.
  • Sample Size: Using a small sample size can prevent representing the population well.
  • Bad (Biased Sampling Methods)
Non-Sampling Errors: Happens after the sample is selected - we don’t have much control over these.
  • Response Bias: When individuals respond inaccurately to the sample. This can occur when the subject at hand is sensitive, and it can depend on how or who the question is asked to.
  • Nonresponse: When individuals chosen from the sample don’t or can’t respond.
  • Question-Wording: Leading questions (providing background information that shows bias towards an answer), open-ended questions (questions that are hard to answer, and therefore, hard to analyze), double negatives (can confuse individuals), word choice, overlapping options, etc.
For the Best Results, use Large Samples obtained using Random Sampling. It’s not that deep.
Experiments: Imposing treatments on subjects to observe a response. This helps to prove causation - response of one variable is due to the change in another.
  • Observational Studies involve simply surveying a sample, and not imposing treatments. With observational studies, you can only show association (e.g. ___ tends to increase/decrease ____)
  • Explanatory (Independent) Variable /Factor: The treatment(s)
  • Level: Strength or Size of an explanatory variable
  • Response (Dependent) Variable: The outcome we measure
  • Experimental Units: The objects that receive the treatments - can be groups (ex: Container of Insects) or Subjects (human/animals)
  • Treatment: Condition/Action applied to the experimental units
  • Control: Efforts to minimize variability in the way subjects are obtained and treated
  • Control Group: Group of subjects that receive no treatment/ a placebo treatment. Used as a comparison to the treatment groups, and help to eliminate lurking variables.
  • Placebo: “Dummy” treatment that should have no impact.
  • Placebo Effect: When subjects respond to a placebo due to a belief that they are receiving the true treatment.
  • Blind: Subjects (and experimenters - double blind) don’t know what treatment they are receiving
  • Randomization: Using random assignment of treatments to the subjects. This helps to reduce or eliminate confounding.
  • Replication: Using many subjects and having many trials
  • Statistical Significance: When an observed effect is so large that it could rarely happen by chance
Lurking Variables: When variables other than the explanatory variable influence the response variable/when an apparent association between two variables is actually common response to a third unseen variable.
  • Results from lack of control. Use blocked designs or stratification.
Confounding: When 2 variable’s effects on a response variable cannot be distinguished (ex: losing weight because of better diet or exercise?)
  • Results from bad experimental design
notion image
Experimental Designs: Use Control, Replication, and Randomization.
  • Randomized Comparative Experiment: Randomly assign subjects to treatment groups, compare results together.
  • Blocked Design: Stratifying subjects into groups (not random), assigning treatments to each group (random), comparing results of groups separately.
  • Allows us to control/isolate lurking variables.
  • Matched-Pairs Design: Assigning treatments to “paired” subjects - each subject in pair recieves different treatment, or each subject recieves both treatments in random order. Compare within pairs.

Unit 4

General Addition Rule: P(A U B) = P(A) + P(B) - P(A ∩ B)
  • If Mutually Exclusive: P(A B) = 0
General Multiplication Rule: P(A B) = P(A) * P(B|A)
  • If Independent: P(B|A) = P(B)
Conditional Probability Rule: P(A|B) = ( P(A B) )/ P(B)
Mutually Exclusive (Disjoint) vs. Independent: Sharing no common outcomes vs knowing the outcome of one event doesn’t change the probability of another event.
  • Two Events CANNOT be both disjoint and independent.
    • notion image
Random Variables: Numerical outcomes that are the result of a chance process. Set X equal to the random variable.
  • Can either be discrete (set # of outcomes, use a histogram or table to show each outcome + probability of each, use 1-var-stats to get μ and σ (outcomes in L1, probabilities in L2)) or continuous (Use the normal distribution to obtain probabilities).
  • Sample Space: Shows all possible outcomes of a chance variable, display it using a Probability Model
  • Transform random variables:
  • μa+bX = a+ b* μX | μX ± μY = μX±Y
  • σa+bX = √(b2σX2) | σX ± σY = √(σX2 + σY2)
  • Interpret μ: “The expected value of X in context is μ”
  • Interpret σ: “On a randomly selected ___, X in context will vary from the expected count of μ by σ on average.”
Average Value/Expected Value: sum of every Probability * outcome
Geometric Distribution: Follows BITS, μX = 1/p, σX = √(1-p) /
  • Binary (Only 2 Outcomes), Independent (Each trial independent - just specify), Trials (NO FIXED # of TRIALS), Success (Specify p = __)
  • Let X = # of trials until success, use geometcdf/geometpdf
Binomial Distribution: Follows BINS, μX = np, σX = √(np(1-p))
  • Binary (Only 2 Outcomes), Independent (Each trial independent - just specify OR use 10% rule if random sample), Number of Trials (Specify # of trials, n = __), Success (Specify p = __)
  • Let X = # of successes in the trials, use binomialcdf/binomialpdf
  • When Using ≤ or ≥: subtract 1 from the k-value/ n-value

Example Problems

  1. A mathematics competition uses the following scoring procedure to discourage students from guessing (choosing an answer randomly) on the multiple-choice questions. For each correct response, the score is 7. For each question left unanswered, the score is 2. For each incorrect response, the score is 0. If there are 5 choices for each question, what is the minimum number of choices that the student must eliminate before it is advantageous to guess among the rest?
  • Find the Expected Num of points you can earn for every choice you eliminate. Remember, you score 2 points if you miss the question, so you have to keep finding the expected number of points until it’s greater than 2.
  • Eliminate 1 Q: (¼ chance of getting it right * 7 points earned)+(¾ chance of getting it wrong * 0 points earned) = 7/4 < 2
  • Eliminate 2 Q: (⅔ chance of getting it wrong * 7 points earned) + (⅓ chance of getting it wrong * 0 points earned) = 7/3 > 2, therefore eliminate 2 questions!!!

Semester 2 Review

Unit 5

Parameter: Describes Entire Population, Statistic: Computed from a sample, used to estimate the parameter.
Sampling Variability: Statistic values vary in repeated random samples (natural occurrence), reduce by increasing sample size (reduces spread of statistic distribution)
Sampling Distribution: A distribution of all possible samples of the same size from a population. (Symbols: μx, σx)
  • Can show how much sample results vary from the parameter on average (Standard Error)
  • As n (size of each sample) increases on a sampling distribution: The Data gets closer together/ reduced spread of the distribution
Sample Distribution: A distribution of some samples of the same size from a population (Symbols: xx, Sx)
  • If a sample result is RARE (5% or less) based on a sample/sampling distribution: Then there IS convincing evidence against the claim (would be unusual to obtain proportions less than 5% - SIGNIFICANCE LVL)
Population Distribution: Includes ALL Individuals in the population (Symbols: μ, σ)
Unbiased Estimators: A statistic whose mean of the sampling distribution is equal to the population mean (true value of the parameter being estimated).
  • Means and Proportions are unbiased (μ = μx, p = μp̂), but the Standard Deviation of the sampling distribution is biased, depending on the sample size (σx = Sx/√n)
Each Dot (On Sample Distribution) Represents: “The Dot is the result of 1 sample of size n from a population where p = __. This sample had ___ out of n ___.”
Describing Distributions: Shape (Normality), Center (Mean or Proportion), and Spread (Standard Dev.)
4-Step Process FOR PROPORTIONS:
  • State: P(p̂ >, <, ≥, ≤ value), define p̂ in context (ex: proportion of students in a random sample of n)
  • Plan: Shape, Center, and Spread (Show Normality w/ np≥10, n(1-p)≥10, show that μp̂ = p, show 10% rule is valid, and then show σp̂ = √(p(1-p))/n)
  • Do: Use normalcdf using μp̂ and σp̂ or μx and σx
  • Conclude: There is a __ chance that (a random sample of n will have ___/ the difference between p1 and p2 will be ).
4 Step Process FOR MEANS:
  • State: P(x >, <, ≥, ≤ value), define x in context (ex: mean PSAT score of random sample of n students)
  • Plan: Shape, Center, and Spread (Show Normality w/ CLT, show that μx1 = μ1, show 10% rule is valid and then show σx = √(Sx2/n)
  • Do: Use normalcdf using μp̂ and σp̂ or μx and σx
  • Conclude: There is a __ chance that (the mean of x1 from a random sample will be ___/ the difference in means of x1 and x2 will be ___).
Central Limit Theorem: For the sampling distribution of the sample statistic (mean, median, etc) for samples of size n to be deemed approx. normal, n ≥ 30.

Unit 6

Significance Level: same as Probability of Type 1 Error, increasing significance level increases power because it gives you more chances to reject the null hypothesis correctly
Type 1 Error: rejecting the null hypothesis when the null hypothesis is true
Type 2 Error: not rejecting null hypothesis when null hypothesis is false, increase this by decreasing significance level, decrease this by decreasing standard  error (reduces variability and makes differences easier to detect)
Power: probability of successfully rejected null hypothesis AND the null hypothesis is false, FOR A SPECIFIC PARAMETER/TRUE VALUE
  • Increase Power by: increasing significance level, increasing sample size (reduces variability), and increasing the distance between your null hypothesis (Ho) and your specific parameter (this makes it easier to detect that the null doesn't equal the true value)
  • To maximize power: pick the largest significance level you're willing to risk (remember, it increases the chance of a type 1 error), and the largest sample size you can afford
  • Bigger Sample Size == smaller standard deviation of sampling distribution == minimum value to reject the null hypothesis decreases == power increases.
p-value: probability of obtaining a value at or more extreme than the observed statistic, given that the null hypothesis is true
test statistic: z value, represents how far a statistic diverges from our expectation if the null hypothesis is true
  • To Calculate: (p̂ - p0)/ √ (p0(1-p0) / n) OR  (p̂1-p̂2 - (p1-p2) )/ √ ( (p̂C(1-p̂C) / n1 + (p̂C(1-p̂C) / n2 )
  • TO CHECK CONDITIONS FOR A Z-TEST: USE np0, NOT np̂
confidence level: overall success rate for calculating confidence interval that contains true parameter value
  • How to interpret: --% of all possible samples of the same size (n) will result in calculating an interval that contains the true parameter (in context)
confidence interval: range of plausible values for the true parameter value based on sample data (point estimate +- margin of error). NOT Probability.
  • How to interpret: We are --% confident that the true parameter (in context) is contained in the interval point estimate +- margin of error
Margin of Error: accounts for sampling variability
  • To Calculate: Z* x √ (p̂(1-p̂) / n)
  • Increase Margin of Error By: increasing confidence level (makes interval wider for a better success rate), increasing POPULATION standard deviation
  • Decrease Margin of Error By: increasing sample size (reduces variability)
Use p̂c (Pooled Proportion) with: 2 Proportion Z-TESTS

Example Problems

  1. An experimenter conducted a two-tailed hypothesis test on a set of data and obtained a p-value of 0.44. If the experimenter had conducted a one-tailed test on the same set of data, which of the following is true about the possible p-value(s) that the experimenter could have obtained?
  • To get the possible p-values from a one-sided test, given the p value from a 2 sided test, remember that the p value could differ from left/right sided tests. As such, you’ll want to (obviously) halve the p-value. This is your first possible value for the one-sided test’s p value. To obtain the second possible value, you have to do 1- p value of the one sided test. This gives you 1-0.22 -> 0.78 as the second possible value.
Critical Values: t*, Z*, determines whether you should reject the null hypothesis (if test statistic > critical value, reject the null hypothesis.)

Unit 7

T Distribution: Symmetric around μ (approx. normal), but larger spread, more area in tails (Using Sx as estimate of α creates more variability), and shape is dependent on sample size (degrees of freedom -> k=n-1, ALWAYS STATE)
  • As k increases, the normality increases and tail area decreases (smaller t* value)
Robust: Considered valid and accurate even when a condition is violated
  • TLDR: non-normality of a population is fine in a t-distribution EXCEPT WHEN LARGE OUTLIERS AND HEAVY SKEW -> if CLT not met, use sample data graphs (N.P.P, Box Plot)
Paired t-test: Use when 2 treatments applied to 1 sample -> Examine difference in means of the two treatments, treat as 1 sample t-test
Test Statistic: (x-μ)/(Sx/√n) Interval: x ± t*  Sx/√n, calculate t* w/ invNorm((1-conf_lvl)/2, d.f)

Unit 8

X2  Distribution: Always skewed right, as k increases skew decreases.
Chi Squared for GOF: Use when trying to see if a distribution of ppl in a SINGLE CATEGORY from a SINGLE SAMPLE/POP is different from another distribution (AKA matches the expected counts) OR is uniformly distributed (all proportions are the same).
Ho: the distribution of ___ matches the distribution of expected_counts OR the distribution of ___ is uniformly distributed throughout the ___ (week, month, year, city, etc.)
~~ specify all proportions of the expected counts~~
Ha: not uniformly distributed/not the same as expected_counts_distribution
Chi Squared for Homogeneity: Use when trying to see if distributions from MULTIPLE SAMPLES or 2+ TREATMENTS have a difference or not.
Ho: There is no difference in the distributions of ____ based on _____ (response var dist based on treatment -effects based on the applied medicine- OR categorical variable based on population - job problems based on year surveyed)
Ha: There is a difference in the distributions based on ___ (treatment / population)
TO GRAPH: put treatments/population (medicine applied or year surveyed, for example) as X axis, and put different bars based on the types of responses.
Chi squared for Independence/Association: Use to see if the distributions of 2 CATEGORICAL VARIABLES from A SINGLE SAMPLE/SINGLE TREATMENT are associated or not associated (AKA not independent or independent). This does NOT PROVE CAUSATION.
Ho: There is no association between ___ and ___, for insert_pop_here
Ha: There is an association between __ and __ for insert_pop_here
ex: There is no association between jogging and blood pressure, for all U.S. citizens.

Unit 9

Conditions for a Test: LINER (Linear, Independent, Normal, Equal Variance, Random)
  • Linear: Make Scatterplot Graph and Residual Graph, comment on them (AP: The true relationship between y and x is linear)
  • Independent: 10% Rule OR Assume subjects/treatments independent
  • Normal: Make N.P.P of Residuals (basically scatterplot with X = reg x, Y = residuals), check if linear (AP: The values of y are approximately normally distributed for each value of x)
  • Equal Variance: Check residual graph for equal # of residuals above/below 0, and with random spread (AP: The standard deviation for y does not vary with x - As the value of the explanatory variable increases, the variability of the response variable stays the same (fails condition if scatterplot gets more varied as X increases) )
(violate equal variance == not equal spread across all X (residuals get bigger/smaller, biger in some places than others))
  • Random: Normally given - just specify “random sample/assignment given”
To Define B: Slope of the population LSR line for x vs. y
Confidence Interval: b ± t* x SEb
T Test Statistic: t= (b-B)/SEb, k= n-2
Concluding Statements:
  • T-Tests: There is a pos/neg linear relationship between y and x.
  • T-Intervals: We are __% confident that the slope of the population LSR line for y vs. x is contained in the interval _____
  • For Collegeboard’s T-Intervals: We are __% confident that a ___ increase in x will result in a predicted increase in y of between _________ (Ex: “We are 98 percent confident that a 10-point increase in placement score will result in a predicted increase in starting salary of between $3,150 and $3,360.”)
SEb: s/Sx√(n-1) -> standard error of B -> In all samples of size n, the slope of the sample LSR line differs from the population slope by SEb on average (Standard deviation of the sampling distribution)
s: Standard deviation of the residuals (prediction error) -> The predicted y value is expected to differ from the true y value by s on average -> The average prediction error of y when using the LSR line is s.
Two-Sided vs. One-Sided Interpretation of P-value:
  • Two-Sided (B ≠ 0): “observing a test statistic as extreme as ___ or more extreme”
  • One-Sided (B > 0, B<0): “Observing a test statistic of ___ or greater/smaller” (Depends on > or <)
  • TLDR: P-value interpretation MUST BE CONSISTENT with Ha)
Review of Unit 2 Stuff:
  • Extrapolation: When you attempt to estimate a value outside of an LSR line’s range of data values (lol)
  • r: Correlation Coefficient, interpret w/ strength, direction, shape (ex: strong, positive, linear)
  • r2: Coefficient of Determination, interpret by saying: r2 of the variation in y_var is accounted for by the LSR line that uses x_var as the explanatory variable.