22 September, 2012

A quick primer on statistics, pt 2. Inferential Stats and Simulation

Last time I talked about statistics, I limited my discussion to the statistics used to describe the distribution of results from random processes.  These methods are the fundamental parts that can be assembled to understand the stat methods used to understand unknown parameters, and differences between unknown parameters.

What follows below is a whirlwind tour of what is essentially at least a quarter long class in upper division undergraduate statistics.  Again, Wikipedia and Khan Academy are great resources to learn more.

Inferential Statistics

Inferential statistics describes the set of methods used to estimate the unknown parameters of random events.  The most common of these methods rely on observed data to produce estimates of these unknown parameter values.

A classic example of the use of inferential statistics is to estimate the probability of an unfair coin, i.e. a coin that may not come up heads as frequently as it comes up tails when flipped in the air.  Another similar example (more applicable) would be to calculate the probability that an Edge of the Empire Dice pool would produce more successes than failures.

Presume we have no reliable way to calculate how often this coin will come up heads based on its physical qualities, and need an alternate method method to estimate this probability.  Essentially, we have the following situation, expressed in the notation I explained before:

Pr(Heads) = p

But we do not know the value of p, beyond the fact it lies between 0 (it never comes up heads, the probability of heads is 0%) and 1 (it always comes up heads, the probability of heads 100%).

Now that we've identified the problem, and what we're trying to find (the value of p), we will make some assumptions to VASTLY simplify our problem:
  1. p is constant during the experiment, i.e. p does not change value between flips.
  2. The results of each flip are independent, i.e. the results of one flip do not affect the results of any other flip.
  3. The only possible outcomes for each flip is heads or tails
  4. Every flip produces a valid outcome (either heads or tails).
  5. The variable X (the total number of heads in a set of n trials) has a binomial distribution, with parameters p and n.
The first two are basic, and I'm not going to discuss them with any depth, but in statistics we call this iid.  The third and fourth assumptions allow us to make the fifth, which states that we will assume that X conforms to the binomial distribution.  This is very commonly used distribution when we want to calculate the probability of an event.  Technically, the number of heads produced from a number of flips, X, is what is truly binomially distributed (as stated above), not the probability, p, but I'll reconcile this in a moment.

With the distribution defined, we have a paradigm to work within, and useful defined equations to produce parameter estimates.  The distribution has two parameters: the number of trials (in this case, each flip is a 'trial') and the probability of the trial being a success (in this case, success is the flip coming up heads).  Note that it is this second parameter is exactly what we are interested in estimating: Pr(Heads) = p. We also have control over the number of trials we perform, n.  So, it can be shown that the estimated expected value of X is equivalent to the number of successes divided by the number of trials performed.  Essentially: 

X/n = E(p) =  Pr(Heads)

This formula represents 2 different concepts:
  1. X/n is the proportion of trials in our experiment that came up heads.
  2. X/n is the probability of a single trial in our experiment to coming up heads, Pr(Heads).
These two concepts are equivalent: The proportion of successes on all trials may be interpreted as the probability of success on a single roll.  This is will important below.

All we need now, is the data.  To generate the data, we have to perform an experiment, in which we can simply take the coin, and flip it 20 times.  or 100 times. Or 100,000 times.  But let's start out small, with 20 coin flips, and we'll say this produced 7 heads.  Now we can calculate X:

7 heads total/20 trials = .35 = Pr(Heads)

This is essentially the scenic route to do exactly what you would have done anyway to figure this out.  But by giving the justification and walking through these steps we have a simple example that shows our route to get what we need: an estimate, or inference of the value of a previously unknown and incalculable parameter in a situation where we don't fully understand the underlying mechanism that produces the results.

Point Estimates and Sample Sizes

The value reported above, p = 0.35, is a point estimate of the probability of heads.  Point estimates are a measure of centrality, and indicate the most likely value of the parameter, given the data.  If you look back to the previous post, you should also be reminded of the difference between an estimate and a parameter, and see that this is an estimate.  Now, if we repeated the experiment (flip a coin 20 times, count the total number of heads), we may get different point estimates.  This further shows that the result is not necessarily the parameter value.  

If we wanted to be more confident about our estimate, we could increase our sample size by increasing the number of times we flip the coin.  Many curious minds may ask "why does increasing our sample size increase our confidence about the estimate?" which is a great question.  The details are beyond the scope of this discussion, so I'll simply invoke the Weak Law of Large Numbers, which states "as the sample number increases, the observed mean converges on the actual expected value".  So larger samples tend to produce more reliable (but not necessarily perfect) estimates of parameters.

Simulation or: "How I Learned to Stop Worrying and Love The RNG"

So, we have shown how to estimate a parameter based on observed data from experiments we have performed when we do not understand the underlying mechanism that produces the result.  Now, let's we re-examine our 1d6 example from yesterday.  Let's say we wanted to find the probability of rolling a 5 or 6 on any roll.  In this case, we  do understand the underlying distribution that produces the results: There is a 1/6 chance of producing, respectively, a 1, 2, 3, 4, 5, or 6 on any roll of the die.   We could use our knowledge of expected values to find the parameter value (in this case it would be 1/3), but that would be boring!

Instead, we use what we just learned about binomial distribution and inferential statistics to perform an experiment.  We roll 1d6, physically, 20 times, and get 8 rolls that came up a 5 or a 6.  Based on what we did above, this would lead us to estimate that there is a 8/20 = .4 chance that we roll a 5 or 6 on a die.  [Note that it would be impossible to calculate the REAL probability of 1/3 based on this experiment].

Now if we were to desire a more reliable estimate, we could continue to roll the die many more times, recording each result and calculating the overall proportion of trials that produced 5s or 6s, which we can interpret as the probability as any roll coming up a 5 or a 6.  However, this method becomes rather tedious, and we have other tools at our disposal to automate this process.

With some code, we can create a program that will randomly select a value from the set: {1, 2, 3, 4, 5, 6}, each with 1/6 probability, which is exactly the distribution we are sampling from, and calculate the proportion of results that are 5's or 6's, which we have established can be interpreted as probability.  This is known as Monte Carlo sampling, and relies on the computer's (pseudo)random number generator to randomly sample from known distributions to estimate parameter values.  By invoking the weak law of large numbers, the results of such a simulation should produce parameter estimates that converge to the actual expected values.  This requires no explicit calculation of expected values, which can become very complex in some situations, and much larger sample sizes can be produced in much less time than similar experiments.  It simply requires that we have a very good understanding the underlying distributions.

Technically, the computer is unable to produce truly random numbers, but today's pseudo-random number generators are so good anymore,there is practically no difference.


Confidence Intervals, Hypothesis Testing, and Simulation

Typically, the purpose of invoking inferential statistical method is to estimate parameters that are unknown and cannot be calculated or to estimate the difference between two or more parameters.  The former of these is typically done by calculating confidence intervals (CI's) from observed data and the latter done using hypothesis testing.  Really these are two sides of the same coin.  What you need to know is, as the sample size, n, increases, the confidence intervals become narrower (to represent that the parameter estimates are more reliable) and observed parameter estimate differences are more likely to be different, because larger sample sizes are more reliable to detect smaller differences.

The term "p-value" comes into play at this point, and is frequently recognized and frequently poorly understood concept, even by professionals that use statistics on a daily basis.  For the purposes of this discussion, people passingly familiar with this concept need to understand that everything I say about hypothesis testing bears true for p-value as well.

Back to the point! Which is: our ability to hypothesis test for a difference in estimates is dependent, at least in part, on our sample size.  This means that as we are able to make our sample in simulations arbitrarily large, CIs and hypothesis testing becomes fucking useless.  Further, the CIs are used to describe the uncertainty and distribution of the mean of the distribution, and not the distribution itself.

Enter the Probability Interval

Probability intervals are a concept I was first introduced to in studying Bayesian statistics (seriously, don't worry about it), and are similar to bayesian credibility intervals.  They are typically defined by the narrowest interval that contains XX% of the observations from the entire distribution of observations.  They are derived from the raw observed (or simulated) data, and briefly describe the entire data set, not just the mean (as CI's do).  They become more reliable as sample sizes increase, but do not become substantially more narrow as sample size increases.  This makes them ideal for discussing and reporting simulation data.  

Some things to remember about PI's:

  • PI's are not centered on the mean, since mean is not used to calculate them in any way
  • PI's are not symmetric around the median or the mode, since distributions may be asymmetric.
  • PI's do not rely on or assume an underlying distribution (CI's rely on the normal)
  • PI's may be reported with different %'s, e.g. 90% PI means the PI covers 90% of the observations, and a 95% PI would cover 95% of the observations.
  • PI's are only a synopsis, there is information lost when ONLY a PI is reported.  Full histograms are usually necessary to fully visualize a distribution.
Alright... Thats enough for now.  With all the tools I need at least mentioned, even though nobody really cares, I can start talking about what I really want to talk about:

The probability implications of the Edge of the Empire dice system... FINALLY!!!

/endofline

EDIT: Sorry for the delay on this post.  It was sitting at around 90% finished most of the week, but I fell into EotE forum discussions and FTL... Which is AWESOME!!! TRY IT!!!  BUY IT!!!

No comments:

Post a Comment