Monday, December 19, 2022

statistics -- some foundation

Statistics overlaps into Math of course. We can record samples into a CSV or spreadsheet and do histograms, and (Algebra 1) regressions. Yet CSV data are arrays. Arrays are matrices, and thus vector, evaluable (Calc 3), as well as Python and R friendly. IMO, because it spans both simple and complex Math, and CS, Stats are a fun side project during one's Math progression, at any time.

Basics are Algebraic if one uses tables for area under curves. If not, a person could do them using simple Calc 1 calculations. TI-84's -- if available -- are also great for checking that kind of calculation (eliminating tables and Calc) and for offloading the drudgery of long lists.

HS AP Stats is essentially an Algebra-only college Stats1 class, and slightly more user friendly. A review book for AP Stats, the Princeton Review, Table of Contents is a list of roughly 35 subjects, split over 4 categories. I've used it below with some modifications to describe what a person might want. Beyond basic or AP, I've added a "next level" section at the bottom, since matrices connect Math to Statistics and arrays/data structures connect Stats to Data Science.

Introductory data
data collection
tabular methods
graphical methods - qualitative
graphical methods -quantitative
numerical methods - continuous
boxplots
add/mult a constant
comparing mult. groups
bivariate data: covar,regression
categorical frequency
Sampling and experiments
plan a study
data collection
plan a survey
bias in surveys
plan an experiment
Anticipating outcomes
probability
random variables
probability dist. - discrete random var's
probability dist. - continuous random var's
normal distribution
combining independent random var's
sampling distributions
Statistical inference
confirming models
parameters
point estimation
interval estimation
confidence interval
inference: significance tests
hypothesis: testing and accepting
estimation and inference: population proportion
estimation and inference: population mean
estimation and inference: 2 population proportions
estimation and inference: 2 population means
inference: categorical data

For a comprehensive course, there's Khan Academy. On YouTube, I prefer Brandon Foltz, M.Ed video compilation, esp at 0.5x speed. They were made 2010-ish, but ahead of their YouTube time1. Below there are some from Foltz and from many great teachers or vids which include nuggets in some way. As one goes along in review, it's a chance to relearn the annoying Greek symbols for parameters. The same Greeks as in finance, but different usage.

1Another brilliant info-sharing hero with early YT chops was Derek Banas. One of his best might be his investing video, which crosses data science with financials and Python. And of course Barry Brown for programming any of it in C.

distributions of data obtained

degrees of freedom, clt

To Z or t (38:16) Brandon Foltz, 2012. The notion behind this choice, without calculation. Almost all stats are samples, not entire population, ie, "the prevalence of depression of those over 65" - how to give all questionnaire? Can't. The smaller the number, the more chance of error. Avg temp in NYC, what if we just used a day, or 5 days? The larger the sample the more confidence we have captured reality.16:00, 33:00 decision. 19:00,30:00 degrees of freedom
Buying land (32:06) Brantley Blended, 2018. listing, PLAT recency, survey recency (due diligence period), boundaries,

bivariate data: linear regressions, covar

covariance (5:55) Ben Lambert, 2013. intuition behind covariance. Positive, negative, and none. Some formulas, but conceptual.
Variance vs. covariance (Webpage) Investopedia.

combinatorics, permutations, probability

nCr and nPr. These are the denominators of probability, since they give the universe by which we determine our odds. If "statistics" are the collected data, nested inside we have probability (based on prior events, statistics), and inside this is the "combinatorics and permuatations" which create the denominators of probability fractions.

Permutations and Combinations (17:40) Organic Chemistry Tutor, 2017. Simple review of when to use nCr, nPr, or just the factorial by itself. 14:00 How many ways can we arrange the letters in the word "Alabama".
problem - probablility (replace/don't replace) (10:11) Amy Krusemark, 2020. a slight political note at start, AP question. example of non-replacement probability, example of understanding permutation/combinatorics effecting denominator. 2:40 without other notice, 0.05 alpha is threshold for a reason to doubt.


next level

KL Divergence

KL Divergence (18:13) ritvikmath, 2023. Non-negative comparison of two distributions.

vectors and matrices in statistics

As my friend Bart texted...

Matrices are really just notation for a list [of] equations. Not so profound, not for a long time. Like if I have ten data points then i have that equation ten times. Ten y values. Ten x1 values, ten x2 values, etc. Thirty diff x values, so x could be written as a 10x3 grid. Ok enough [of] that
Beta is 1x3 and and x is 3x10. Then when multiplied out gives a 1x10 vector. To say that vector equals the 1x10 y vector is to say each component of y equals the corresponding for RHS. ie ten equations

And of course wherever we have a matrix, we have a potential vector.

Comp Sci v. Data Sci Matrix (10:34) ritvikmath, 2019. Reveals some clear differences between computer matrix use and math/datasci use and how they overlap (eg. CompSci efficiency can work on any math app).

data structure

necessary for data science. Data Science will evaluate these further, sometimes using Calculus, but we at least need to know what they are, IMO. Not on the AP test. The terminology transform is Statistics and Math use the term "matrix", but Data science/Computer science use the term "array". Depending on what we need to do with the matrix, computing will perform some function on a data array. Usage: imagine an R2 scatter plot, but where we have a third dimension with error information attached to each data point.

Linear Algebra: Transformational Matrices Part I (15:43) Computer Science, 2021. transformations on R2 matrices, using geometric examples. There's an entire playlist that's valuable.
Linear Algebra: Transformational Matrices Part II (9:23) Computer Science , 2021. transformations on R3 matrices, as we might do with data structures.
Trig Functions (9:15) PatrickJMT, 2011. this hero scores yet again. Just in case you need them again for the stuff above. So rotten.
1,2, and 3d structures (8:32) GridoWit, 2017. basic terminology and location tracking within different types of arrays and data structures. The inuition that arrays solve a storage problem: we don't want to have a new variable for each piece of data we have collected. C syntax is also provided.

No comments: