Statistics 101 統計學入門 --3

Statistics 101 統計學入門 --3

In this post, I'll share about how to measure statistical dispersion/variation. （離散程度）

Why we study Statistical Dispersion?

If we only focus on Central Tendency---the average μ, we still don't know about the value of all the subjects. For example, suppose we are studying the wealth gap of people in South Africa. We found that the average income of an adult in South Africa is \$5000 US dollars. Now the dispersion of it could be: (1) data is dispersed around \$5000 tightly, or (2) some people only earn \$100, while some people earn \$10000, very extreme.

We all know that (2) shows a relatively much bigger wealth gap. But this is something we cannot know if we only look into central tendency.

In that sense, we are gonna look into some of the porminent measures of dispersion here.

Statistical Dispersion （離散程度）

Range（全距）

Range is the difference between the maximum value and the minimum value.

Interquartile range（四分位差）IQR

Interquartile range is the difference between the 75th percentile and the 25th percentile.

Remember last time we said that Median is represented as:

The Median is the 50th percentile, meaning that 50% of the values are larger than the median and 50% of the values are smaller than the median. Same logic goes to 75th percentile and 25th percetile.

Variance （變異數/方差）

Variance measures how the values in a distribution is spread out from their average value.

Standard Deviation（標準差）SD

To understand what is Standard deviation, we need to consider its literal meaning.

Deviation means the difference between a value and the mean. If one subject has a value of 7 and the mean is 5, then the deviation is 7-5=2.

Standard means something like typical or average, a thing that has been "Standardized", which we'll talk about in later posts.

In all, the standard deviation along combines with our mean, work as a good indication of how our values are dispersed in a distribution.

More on Statistiacl Dispersion

Calculating variance and standard deviation

Two core things two consider when choosing which equation to use for our calculation:

1. The equation is fit for population or sample
2. How we understand the equation.

Note that: When the population parameter is unknown, we use sample statistic to estimate the value.

Recall that σ is population SD, s is sample SD, μ is population mean, and X bar is sample mean.

The equations that we use do not have much difference in appearance, it's only that the denominator changes from N to n-1.

How do sample size and n-1 affect the SD?

Impact of sample size:

It's obvious that when our "sum of squared deviations" is constant, the larger the sample size(N), the smaller the standard deviation.

Impact of n-1:

When the sample size is large, using N or n-1 as our denominator do not make much of a difference (less than 1/1000). But when the sample size is small, there is an obvious difference (about 1/5).

Understanding how we calculate variance

We can say variance is like an average of distribution. When calculating the average of something, it's intuitive to add all the things up and divide them with the number of values n.

But in here, we are interested in the average deviation, which is the average difference between individual value and the mean.

Do we just add up all the deviations?

No. Because some values may be larger than the mean and some are smaller. Then the deviation may offset each other after adding them up.

Therefore, statisticians proposed to square each of our deviation, and there will be no more negative values.

This is called the sum of sqaures deviations, sum of squares, SS.

Sum up

We have discussed about range, IQR, variance, standard deviation, and walked through how to calculate each of them.

In the next post, we will dig into nomal distribution.