Tuesday, February 6, 2018

Simple Statistics and the R Language

Statistics and the R Language

Statistics and the R Language

Introduction

Statistics plays such an important role in so many different fields that a large portion of College students will find themselves having to take a Statistics course. That course will generally involve learning how to use a statistics package. At the tail end of the second decade of the 21st Century that package will likely be the R language. As a new student to Statistics you aren't privy to the steps leading up to the R language. You didn't have to live throught the use of hand written routines to calculate out the statistical functions on a computer. You didn't have to live through the days of using a hand calculator with statistical function keys. You don't need to wade through statistical function tables in the appendix of a book to calculate your statistical distribution functions. At the same time though you become detatched with the actual mathematics that is statistics. The pre-packaged functions of the R statistical library hides all that hard work from you.

My introduction to statistics happened rather late in my life in the first decade of the 21st century in the Harvard extension Biostatistics class. At that time there were various statistics packages that cost a considerable amount of money to purchase (luckily there were educational discounts). The book used for the course provided different examples of solutions in a couple of different statistical packages but centered around one called Stata. By seeing the same problem solved with different package side by side, the actual mathematics of statistics became the basis for understanding the idiosyncracies between the statistics packages. This maintained a link between the mathematical representation of the statistics formulas and their use in the statistics software.

With R being the popular statistics language there is less of a tendency to dive back into the mathematical formulas and you may lose some understanding of the mathematics in the process. This blog post is meant to be an elementary bridge between some of the simple formulas of statistical mathematics and their R built-in counter parts. To do this I am going to use (actually rely on heavily for example data and answers) a paper of Dr. Keith Smillie at the University of Alberta. He produced a wonderful Statistics package for the J language and summarizes it in the paper "J Companion for Statistical Calculations"[1]. I am going to transpose some of that into R covering a subset of his implementation. Hopefully you will be able to then use that format to extrapolate the mathematical connections for subjects I haven't covered here. Use this type of framework to help you gain more understanding of the theory and implementation of statistics.

R is an Array Language

R uses Vectors or Arrays as a built-in element of it's language. If you aren't used to array languages it can seem pretty strange at first, but once you get the hang of it it's pretty cool. You do have to think in terms of operating on the whole vector. That's because R has functions that streamline vector operations. You are welcome to breakdown those operations so they look like a regular programming language, but that may slow down processing time when dealing with large amounts of data.

R is an interpreted language

This just means that R grabs your R code and tries to execute it as soon as you type it in and hit the enter key. You can also save code in a file and load it into the R console environment. But you can do quite a bit at the command prompt in the R Console. Look at the following R Console Session. The session uses the comments operator '#' to add commentary in line. The comment operator '#' is usually called the 'hash' sign (you might know it as the number sign). When R sees the hash sign it will ignore the sign and everything else to the end of the line. Anything that preceeds the hash will be executed as R code.

> # The hash comments out the rest of the line in R
> 2.3 # entering a numeric constant R returns it as a vector of one value
[1] 2.3
> # Assignment using the back arrow operator <-
> w <- 2.3 # assigning 2.3 to the variable w
> w 
[1] 2.3
> # add 2 numeric constants
> 2.3 + 2
[1] 4.3
> # you can use the = sign for assignment as well
> w = 3
> w 
[1] 3
> # see we have changed w from 2.3 to 3

Creating vectors(array or list if you prefer) of more than 1 value

You create vectors or lists by Concatenating (or Combining) values together. R has a c() function for doing just that.

> w <- c(2.3,5,3.5,6)  # w will now have multiple values in it
> # display w by typing it in at the command prompt
> w
[1] 2.3 5.0 3.5 6.0
> 

Arithmetic Mean

This is what is commonly known as the average of a list of values. We calculate it by taking a list of numbers, suming them and dividing by how many numbers are in the list

\[w = {2.3,5,3.5,6} \] \[n = 4 \]

Doing this by hand you compute the mean using the following formula: \[mean = \frac{\displaystyle\sum_{i=1}^{n}w_{i}}{n}\]

so

2.3 + 5 is 7.3
7.3 + 3.5 is 10.8
10.8 + 6 is 16.8

divide 16.8 by the total number of values (4): 16.8/4 is 4.2

How do we accomplish this in R? There are 3 categories of solutions:

  • The first is the old fashioned way: Loop through the values of interest that we stored in w, accumulate the sum of those numbers then after the looping is finished divide by the number of values.
  • The second is the vector way: use R vector operations to operate on the whole group of values
  • The third is to use an R built-in function to calculate in one step

Traditional looping average calculation

Lets create a function myave that does it the first way.

myave <- function (values_vector) {
 sum = 0
 for (i in 1:length(values_vector)) {
  sum = sum + values_vector[i]
 }
 # i should be the value of the last index and therefore the number of values
 sum/i
}

In the R console you will need to open the source editor by clicking the blank page icon in the tool bar at the top of the console. Save the above text into the editor window. Then click the 'source .r' icon and type in the name of your script.

> source("/Users/Nasty/Downloads/LearnBayes/R/myave.R") # where my file ended up
> myave
function (values_vector) {
 sum = 0
 for (i in 1:length(values_vector)) {
  sum = sum + values_vector[i]
 }
 # i should be the value of the last index and therefore the number of values
 sum/i
}
> # myave has a value of the function code so you can see it as above
> # Now lets use it create the vector of values
> w = c(2.3,5,3.5,6)
> w
[1] 2.3 5.0 3.5 6.0
> # plug it into myave
> myave(w)
[1] 4.2

A more vector way to calculate

> sum(w)
[1] 16.8
> sum(w)/length(w)
[1] 4.2
> 

This is pretty simple and to reduce typing you could create an R function to do it. You don't need the editor since it's a one-liner.

>  myave1 <- function(values) {sum(values)/length(values)}
> myave1
function(values) {sum(values)/length(values)}
> myave1(w)
[1] 4.2
> 

R built-in method

Now R being a statistics package it must have a predefined function that will do this. It does have something called 'ave'. But it has a wierd idiosyncracy that the above functions don't share. Let's try it.

> ave(w)
[1] 4.2 4.2 4.2 4.2

It turns out that the 'ave' function runs the average on subsets of our values. So it produces a vector of averages. Not quite what we wanted right now. So guess at a function name at your own risk. In the computer world there is an old acronym RTFM. Which in nice language stands for Read the Stinking Manual. So try to look up what you want to do (Google is your friend here).

So the built-in function we really want to use is 'mean'

> mean(w)
[1] 4.2

Now mean has other parameters that you can use that have default values. So for our usage it works with the default values. Depending on how in depth your statistics course is you may discover them. If you're interested check https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/mean

Frequencies

It's helpful to look at frequencies of occurances many times when analyzing data. So how would we use R vector operations to build a list of frequencies. The key is to know the complete range of values that the random process we are looking at can take on. This is because in a small sample some of the values may not show up and the number of occurances will be 0. To compare all the values of one vector with all the values of a second vector we use a function known as 'outer product'. Outer product will take an operator and execute it between values of the 2 vectors. R has an outer product built-in function called 'outer'. It takes 3 parameters, 2 vectors and an operator symbol. It then does all the heavy lifting of applying the operator between each element of vector1 to each element of vector2. This will produce a table of calculations. In our case using equality operator '==' we obtain a matrix of truth values.

> # frequencies from a list of die rolls stored in a vector
> D = c(4,5,1,4,3,6,5,4,6,4,6,1)
> D
 [1] 4 5 1 4 3 6 5 4 6 4 6 1
> # range of values of a single die
> r = c(1:6)
> r
[1] 1 2 3 4 5 6
> # use the outer product function 'outer' to create a table of what in D == values in r
> outer(r,D,"==")
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
[1,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
[5,] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
> # these are all logical values that we want to sum to create frequencies
> # if we multiply by 1 R will convert these to numeric values for us
> 1*outer(r,D,"==")
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,]    0    0    1    0    0    0    0    0    0     0     0     1
[2,]    0    0    0    0    0    0    0    0    0     0     0     0
[3,]    0    0    0    0    1    0    0    0    0     0     0     0
[4,]    1    0    0    1    0    0    0    1    0     1     0     0
[5,]    0    1    0    0    0    0    1    0    0     0     0     0
[6,]    0    0    0    0    0    1    0    0    1     0     1     0
> # sum the rows of the equality table to obtain the frequencies
> # there happens to be a function called 'rowSums' that will do just that
> rowSums(1*outer(r,D,"=="))
[1] 2 0 1 4 2 3
> 

Next we combine the range values with their respective frequencies into a table.

> # lets assign the frequencies to a variable f
> f = rowSums(1*outer(r,D,"=="))
> f
[1] 2 0 1 4 2 3
> # Now lets combine the range of values with their frequencies
> matrix(c(r,f),length(f),2)
     [,1] [,2]
[1,]    1    2
[2,]    2    0
[3,]    3    1
[4,]    4    4
[5,]    5    2
[6,]    6    3
> # to see them in horizontal representation use the matrix transpose function 't'
> t(matrix(c(r,f),length(f),2))
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    3    4    5    6
[2,]    2    0    1    4    2    3
> # R has a 'barplot' function to make a nice graph of the frequencies
> # however it's not quite smart enough to break up our matrix representation
> # so we will go back to using r and f vectors
> barplot(f)
>

> # Not very informative if you look up the function we can add some parameters
> # to name the bars and place labels and a title
> barplot(f,names.arg = r,main="Die Roll Frequencies",xlab="Die values",ylab="Occurances")
> 

Now R has a function built in to create a frequency table. It's called 'table'

> table(D)
D
1 3 4 5 6 
2 1 4 2 3 
> 

It doesn't quite provide the same thing we calculated. This is the distinct frequency list. Remember the 2 value of the die had no rolls in our list. So 'table' doesn't include it.

Median and Quartiles

Median is the "middle" observation when you look at a set of sorted data. So rather than being a complex formula this is more of a positional definition. Sometimes it actually is in the exact middle (which happens for an odd number of items) for example: 3 4 5 6 7 then 5 is the median and the actual middle number. But if you had: 4 5 6 7, there is no actual middle number. In this case 5.5 would be condidered the median. To find the median in R we would perform the following vector operations:

  • sort the data into a new sorted vector
  • find the index of the middle number (if there are a odd number of values) or find the middle 2 numbers (if an even number of values)
  • return the number found or calculated above.

Median for even length vector

> # Median by hand
> # sample data:
> M = c(22, 14, 32, 30, 19, 16, 28, 21, 25, 31)
> M
 [1] 22 14 32 30 19 16 28 21 25 31
> # need to sort the data. we will cheat and use the sort function in R rather than 
> # doing it by hand
> sM = sort(M)
> sM
 [1] 14 16 19 21 22 25 28 30 31 32
> length(sM)
[1] 10
> # length is even so we divide length by 2 and get the value at the calculated index
> # and we need the value at the calculated index + 1 as well
> midx = length(sM)/2
> midx
[1] 5
> midx+1
[1] 6
> # so jumping ahead a couple of steps and putting all together
> (sM[midx]+sM[midx+1])/2
[1] 23.5
> # Thats the median that lies between the 2 values
> sM[midx]
[1] 22
> # and
> sM[midx+1]
[1] 25
> 

Median for odd length vector

> # lets add a value to our even vector to make the length odd
> M1 = c(M,40)
> M1
 [1] 22 14 32 30 19 16 28 21 25 31 40
> # formula for the index of odd length
> m1idx = (length(M1)+1)/2
> m1idx
[1] 6
> # we still need to sort before we find the median
> sM1 = sort(M1)
> sM1
 [1] 14 16 19 21 22 25 28 30 31 32 40
> # just select the median value at m1idx now
> sM1[m1idx]
[1] 25
>

Package it into an R script/function

mymedian <- function(myvector) {
# we will need to sort the vector for both cases
# let's do that now
 sM = sort(myvector)
 if (0 == length(myvector)%%2) {
  # even length code
  midx = length(sM)/2
  z = (sM[midx]+sM[midx+1])/2
 } else {
  # odd length code
  midx = (length(sM)+1)/2
  z = sM[midx]
 }
 z
}

Now test the function on the 2 data sets and compare mymedian against R's median function

> # Test out mymedian and R's median function
> # first what were the data sets?
> M
 [1] 22 14 32 30 19 16 28 21 25 31
> M1
 [1] 22 14 32 30 19 16 28 21 25 31 40
> # source the mymedian function
> source("/Users/Nasty/Downloads/LearnBayes/R/mymedian.R")
> mymedian
function(myvector) {
# we will need to sort the vector for both cases
# let's do that now
 sM = sort(myvector)
 if (0 == length(myvector)%%2) {
  # even length code
  midx = length(sM)/2
  z = (sM[midx]+sM[midx+1])/2
 } else {
  # odd length code
  midx = (length(sM)+1)/2
  z = sM[midx]
 }
 z
}
> mymedian(M)
[1] 23.5
> mymedian(M1)
[1] 25
> # what about R's built in median function
> median(M)
[1] 23.5
> median(M1)
[1] 25
> 

For any statistical definition, you could code it your self using R. But R was built with statistics in mind. Someone has probably already beat you to it. This means that there is probably a built-in function or a library function that already has that functionality. However, by going from a statistics book that gives you a mathematical definition and using the R function you lose the transition from mathematics into implementation that would seal in the math stuff in your mind. Your professor is going to expect that you know the theoretical math to some extent so don't overlook it. If your having trouble understanding just what the math is driving at try to implement it (if you have the time). Your text should have some simple examples with data you can test with. You won't have to do it every time and you can always google an implementation and just read it to see what it would look like. It may give you some understanding of what the text book is trying to define mathematically.

Quartiles

> # Quartiles are the numbers that mark the boundries such that
> # one quarter of the data lies between any consequtive quartile number
> # the easiest quartile to find is quartile 2. It's just the median
> Q2 = median
> # Q1 is the median of all numbers less than the value found by Q2
> Q1 <- function (myvec){median(myvec[which(Q2(myvec)>myvec)])}
> # lets set up some test data
> u = c(22,14,32,30,19,16,28,21,25,31)
> Q2(u)
[1] 23.5
> Q1(u)
[1] 19
> # Q3 is the median of all numbers greater than Q2 (ie. the median)
> Q3 <- function (myvec){median(myvec[which(Q2(myvec)<myvec)])}
> Q3(u)
[1] 30
> 

five-statistic

This is a summary of the data producing the following values:

min,Q1,Q2,Q3,max

> # five-statistic definition
> five <- function(myvec){c(min(myvec),Q1(myvec),Q2(myvec),Q3(myvec),max(myvec))}
> five(u)
[1] 14.0 19.0 23.5 30.0 32.0
> 

Mode

One thing that R does not provide a built-in for is the statistical mode. This is defined as the most frequently occuring data value in your vector of data. So this one we have to define. If I follow Prof. Smillie's example for J but use R equivalents this is what the code will look like.

# return the indexes of the most frequent values
# the %in% tests membership and produces a vector of truth values
# which converts the TRUEs to their indexes
# imx - short for indexes of maximum value of a vector
imx <- function(myvec){which(myvec %in% max(myvec))}

# ufrq - returns a frequency list of all accounted for values.
# values in the domain that don't appear are not accounted for
# ufrq short for unique values frequencies
ufrq <- function(myvec){rowSums(selfclassify(myvec))}

# simpleMode - the simple most frequent value (statistical mode) in a vector list
# it relies on the fact that ufrq will produce frequencies in the order that 
# unique returns the uniques members of the data
simpleMode <- function(myvec){unique(myvec)[imx(ufrq(myvec))]}
> source("/Users/Nasty/Downloads/LearnBayes/R/simpleMode.R")
> imx
function(myvec){which(myvec %in% max(myvec))}
> ufrq
function(myvec){rowSums(selfclassify(myvec))}
> simpleMode
function(myvec){unique(myvec)[imx(ufrq(myvec))]}
> D
 [1] 4 5 1 4 3 6 5 4 6 4 6 1
> simpleMode(D)
[1] 4
> simpleMode(c(1,2,3,2,3,2,3,4))
[1] 2 3
> simpleMode(c(1,2,3,4))
[1] 1 2 3 4
> 

Variance

This is a measure of how close each element of a data set is to the mean of the data set. This involves summing the squares of the differences between the value and the mean. Some people have trouble with this because it introduces squaring. When logically if you wanted to find out how far a value is from the mean you would just subtract the smaller number from the larger number from each other and get your answer. So a naive set of steps might look like the following R-like pseudo code:

m = mean(values)
for (i=1:length(values)) {
 if (m > values[i]) {
  diff[i] = m - values[i]
 }
 else {
  diff[i] = values[i] - m 
 }
}
diff

Now you could fix that up and get the values for the distance from each mean but we don't have a very sussinct mathematical formula. So the mathematical way of doing this without all the if statements is the following:

  • for each i={1..N} valuesi - mean
  • now that will give some negative numbers so one mathematical way of getting rid of negatives is to square the numbers.
  • for each i={1..N} (valuesi - mean)2
  • This formula will give all positive numbers that will be related to the actual distance.
  • but having separate distance related numbers doesn't tell us much so one thing you could do is sum all the squared distances up and divide by the number of values. In essence take the mean of the squared distances.
  • Divide by N or N-1. What's up with that?
    • Population - this is the total universe of everybody or everthing you want to study. Now you may have data on every single subject of interest. In this case you would be dividing by N.
    • Sample - This is where you look at a few or some number less than the number of things in the population (ie. you 'sample' the population). Since you sample you could have been unlucky and chosen things that were very similar with very similar data. This means there is still larger variation in the population. To account for that in the variance you use N-1 to divide. This enlarges the variance but as your sample size approaches the population size the difference in dividing by N or N-1 is negligible. = for example: (N=2) 1/2 = 0.5 and 1/3=0.33 the difference is 0.17 = But: (N=100) 1/99 = 0.0101 1/100 = 0.01 the difference is 0.001 = Makes sense naively you would think the more samples you have the higher the accuracy. This is just a mathematical way to express it.
> # the built in variance calculation
> # remember the w data from before?
> w
[1] 2.3 5.0 3.5 6.0
> mean(w)
[1] 4.2
> w - mean(w)
[1] -1.9  0.8 -0.7  1.8
> (w - mean(w))^2
[1] 3.61 0.64 0.49 3.24
> sum((w - mean(w))^2)
[1] 7.98
> sum((w - mean(w))^2)/(length(w)-1)
[1] 2.66
> # or using the built-in var function in r
> var(w)
[1] 2.66
> 

Standard Deviation

variance is useful for a variety of things that you may go into in your course. But you are probably thinking if I just took the absolute value of my numbers I would have a more exact representation of the distance each value is from the mean. You would be correct but the accepted mathematical way to do this is to take the square root of the variance instead and call it Standard Deviation.

> var(w)^0.5   # you should remember that square root is the same as taking to the 1/2 power
[1] 1.630951
> sqrt(var(w)) # in case you didn't believe me
[1] 1.630951
> sd(w)
[1] 1.630951
>

Create a Statistics Summary

let's put this all together and create a statistics summary. Essentially this is a report or table of everything we have just covered.

ssummary <- function(myvec){
 cat("Sample size\t\t\t",length(D),"\n")
 cat("Minimun\t\t\t\t",min(D),"\n")
 cat("Maximum\t\t\t\t",max(D),"\n")
 cat("Arithmetic mean\t\t",mean(D),"\n")
 cat("Variance\t\t\t\t",var(D),"\n")
 cat("Standard deviation\t",sd(D),"\n")
 cat("First Quartile\t\t",Q1(D),"\n")
 cat("Median\t\t\t\t",median(D),"\n")
 cat("Third Quartile\t\t",Q3(D),"\n")
}

So lets use it in R

> # using summary on our dice data variable D
> D
 [1] 4 5 1 4 3 6 5 4 6 4 6 1
> ssummary(D)
Sample size          12 
Minimun    1 
Maximum    6 
Arithmetic mean   4.083333 
Variance   2.992424 
Standard deviation  1.729862 
First Quartile   1 
Median    4 
Third Quartile   6 
> 

Now if you want something more fancy, there are ways to export this stuff to latex or HTML, rather than just printing to the screen within the R console. That is for you to look up and figure out. By the way R has a built-in 'summary' function:

> summary(D)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.750   4.000   4.083   5.250   6.000 
>

Conclusion

There is much more in the main reference for statistics and it's implementation in J in Dr. Smillie's paper.

Bibliography

  1. Smillie, K. (1999, January). J Companion for Statistical Calculations. Retrieved February 05, 2018, from https://webdocs.cs.ualberta.ca/~smillie

Author: NASTY OLD DOG

Validate

No comments:

Post a Comment