13 September, 2012

A quick primer on statistics, pt 1. Descriptive Stats

I will be using statistics in my future posts to, support, explain, or justify opinions that I hold in regards to game design.  And, to that end, I'd like to have a previous post that I can refer readers to that explains some of the methods and terminology that I use.  

As a warning, I am oversimplifying many of these concepts because this is not intended to be an complete course on inferential or descriptive statistics, just a primer to familiarize somewhat educated individuals with the definitions of the terms that I must use to communicate these concepts.  If you see some egregious error, please let me know, but If you do know enough to see where I gloss over some concept nuances, great, but please don't correct every little omission or detail, its not gonna help anything.

If you want more information on any of these concepts, I strongly suggest you check wikipedia or spend some time on Khan Academy.

This is as good as any place to address my qualifications.  As an undergraduate, I took a single statistics course, and as a veterinarian, I received very little statistical training.  However, now, I am a PhD candidate in Epidemiology at a major public California university.  While my pursued degree is in epidemiology (the study of outbreaks and disease in populations), my focus is statistics and biostatistics.  I have over 60 credit hours of core and elective statistical training, and I have been teaching assistant for more than 50 hours of statistical courses.  My dissertation involves substantial amounts of simulation modeling and programming in R (an open source statistical package).

Anyway, lets get started.

Probability vs Odds

I'll use probability almost exclusively when discussing outcome.  Odds are not frequently reported (at least in North America, I'm told its more common in the UK).  I find odds to be less intuitive than probability, and probability to be easier to work with mathematically than odds.  It is easy to calculate the odds from probability, and the probability from odds.

The probability of an event occurring is denoted by "Pr(Event)".  For example, the probability of a 6-sided die (a "d6") coming up 3 is written "Pr(3)".  Since I can't place a bar over the word, which denotes the event not occurring, so I'm going to try representing the event not occurring as "Pr(Event)" or "Pr(Not Event)".  For example, I will write the probability of a d6 coming as not odd as "Pr(Odd)".  

A slightly more complicated version of  probability is conditional probability, which refers to the probability of an event under certain conditions.  For example, the conditional probability of a 6-sided die coming up 1 when the result is odd is 1/3.  This is represented by the notation "Pr(3|Odd)", which is read "The probability of the result being 3 given the result is odd."

Statistics & Parameters

These terms are largely synonymous, and I will probably use them interchangeably.  Technically speaking, statistics are estimated from, and used to briefly describe, data.   Parameters are the REAL values that we typically attempting to estimate, with statistics. The difference is that the true values of parameters are typically unknown, and probably unknowable, but can be estimated using statistics.  

While this seems like a really esoteric and meaningless difference, there is a reason I need to mention it.  When using various methods to estimate parameters, the results are just that, estimates.  When I state that a value is a parameter estimate, I am not claiming to this value to be the exact value of the estimated parameter, there may be some sources of error.  However, any estimates provided will be as accurate as possible.  Accuracy means that the point estimate (see below) is as near as possible to actual parameter value, and any interval (see further below) around the point estimate is as narrow as possible.

Distributions

A distribution is the combination of all of the possible outcomes of a random process, and the probability of each of the individual outcomes.  For example, the distribution of results of a 6 sided die would be: 

  • Pr(1) = 1/6
  • Pr(2) = 1/6
  • Pr(3) = 1/6
  • Pr(4) = 1/6
  • Pr(5) = 1/6
  • Pr(6) = 1/6

While this is the most informative way of presenting the results, its clear that for even slightly more complicated distributions, and joint distributions of 2 or more results, this method becomes far too cumbersome to report and we use other synoptic values to describe the distribution.

There are a large number of distributions that have described and named and studied and labeled.  The normal (aka Gaussian) distribution is probably the most well known of these distributions, but is very limited use in these situations.  It imposes a number of assumptions that may not be met in data I simulate, and it will be rarely invoked in this blog.  I will more frequently be using the binomial distribution, which describes the number of successes in a number of trials repeated under the same conditions.
Measure of Centrality

This phrase refers to statistics that describe the center point or most common results of distribution.  The arithmetic mean is the statistic most people are familiar with.  While this is valuable, it also frequently biased (lies away from the true center point of the data) and can poorly represent the actual information in the data.  The median is the value where 1/2 of the distribution lies above this value, and 1/2 lies below.  It is useful in some circumstances, especially in heavily skewed distributions with many outliers on one side.  In the discipline of statistics, however, we frequently use the expected value of a distribution as the real measure of centrality, and is abbreviated "E(X)" for the "Expected value of X".  The expected value of discrete distributions (basically ALL of the distributions will discuss as result of dice rolls) are calculated by by adding the value of the event result by the probability of the event occurring.  So, E(d6) would be calculated:

1/6(1) + 1/6(2) + 1/6(3) + 1/6(4) + 1/6(5) + 1/6(6) = 1/6 + 2/6 +...+ 6/6 = 21/6 = 3.5.

The arithmetic mean over a very very large number of rolls is identical to the expected value (Due to weak law of large numbers).  

Measure of Dispersion

The measures of dispersion describe how far away individual observations fall from the central point.  It may be thought of in a VERY limited sense as how "random" individual observations in this distribution tend to be.  Higher values indicate that individual observations tend to fall further away from the middle, than towards the middle.  The parameters variance and standard deviation, abbreviated Var and StDev, are commonly used measures of dispersion.  I'm not going to provide equations 

Measure of Dependence

Measures of dependence describe how related individual results from two distributions  are to each other.  The two most common measures of dependence, BY FAR, are correlation and covariance, which assume a linear relationship between the two distributions. The correlation of X and Y (there must ALWAYS be 2 outcomes for these) is denoted as
  • Cor(X,Y)
  • rX,Y
And covariance is denoted Cov(X,Y).

Its interesting to point out is that if rX,Y =/= 0, there ample evidence that X is not independent of Y, BUT if r = 0, there is NOT ample evidence to state X and Y are independent.  

Okay, that's the end of part one; This became a LOT longer than I anticipated, and I want to get SOMETHING posted.  Next time I'll discuss statistical inference and simulation, which will build on what's posted here to support the methods I use when evaluating game mechanics.

Any questions? Post 'em below.  Thanks for reading.

/endofline

No comments:

Post a Comment