Say you have some population distribution into some categories. Something like p_1% is A, p_2% is B, etc. And you have some method of sampling from this population, with a suspected by unknown bias. How can you quantify this bias? In particular, I'm interested in knowing a way to quantify the practical significance of the bias. For example, if we expect 50% A and we get 51% A, that doesn't seem practically significant. But if we expect 0.1% A and get 1% A, or if we expect 99% A and get 99.9% A, those seem to both seem to be very practically significant.

Alternate formulation, which I believe is equivalent: How can we measure the practical significance of a change from one distribution to the another. 50% to 51% does not seem significant, while 99% to 99.9% is.

This seems like a basic enough question that there's probably some standard metric or a few to quantify the significance of these changes.

## Statistics question: Quantifying sample bias?

**Moderators:** gmalivuk, Moderators General, Prelates

### Statistics question: Quantifying sample bias?

Last edited by Derek on Sun May 24, 2015 8:32 pm UTC, edited 1 time in total.

### Re: Statistics question: Quantifying sample bias?

How about a chi-squared test?

For example: suppose in a sample size of 1000 we expect 0.1%/99.9% (1 and 999) and find 1%/90% (10 and 990). Then our chi-squared statistic is (1-10)²/1+(999-990)²/999 = 81.081.

We have 1 degree of freedom so P(X² >= 81.081) ~= 0, thus it is significant.

With the 51/49 versus expected 50/50 case you can see that sample size affects the significance:

n = 100

X² = 1/51+1/49 = 0.0400160064

p = 0.8414

n = 1000

X² = 100/510+100/490 = 0.400160064

p = 0.5270

n = 10000

X² = 10000/5100+10000/4900 = 4.00160064

p = 0.04546

For example: suppose in a sample size of 1000 we expect 0.1%/99.9% (1 and 999) and find 1%/90% (10 and 990). Then our chi-squared statistic is (1-10)²/1+(999-990)²/999 = 81.081.

We have 1 degree of freedom so P(X² >= 81.081) ~= 0, thus it is significant.

With the 51/49 versus expected 50/50 case you can see that sample size affects the significance:

n = 100

X² = 1/51+1/49 = 0.0400160064

p = 0.8414

n = 1000

X² = 100/510+100/490 = 0.400160064

p = 0.5270

n = 10000

X² = 10000/5100+10000/4900 = 4.00160064

p = 0.04546

### Re: Statistics question: Quantifying sample bias?

Yes, this seems like the right metric, thanks. Chi-squared is something I never quite grasped in AP Stats (AP Stats is an awful class and we had an awful teacher, also it was years ago).

If you consider my alternate formulation instead, "How can we measure the practical significance of a change from one distribution to the another.", which I realize now is actually different because it involves known distributions instead of sample sizes, would it be reasonable to calculate the chi-squared without N, and set an arbitrary threshold for what is "practically significant" to you?

If you consider my alternate formulation instead, "How can we measure the practical significance of a change from one distribution to the another.", which I realize now is actually different because it involves known distributions instead of sample sizes, would it be reasonable to calculate the chi-squared without N, and set an arbitrary threshold for what is "practically significant" to you?

### Re: Statistics question: Quantifying sample bias?

It doesn't really make sense without N. If you flip a coin once and get 100%/0% instead of the expected 50%/50% it's not very surprising. But if you flip it a thousand times and get 100%/0% something is going on. If you just want some arbitrary threshold, play around with random N values and pick one that "feels" right.

### Who is online

Users browsing this forum: No registered users and 6 guests