Wednesday, February 21, 2018

R Basic Vector/Matrix Stuff (for the Statistically Inclined but Computer Programming Challenged)

R Basic Vector/Matrix Stuff (for the Statistically Inclined but Computer Programming Challenged)

R Basic Vector/Matrix Stuff (for the Statistically Inclined but Computer Programming Challenged)

Introduction

After some feedback on my previous R blog I have found that a 'Newbie' R/Statistics person needs to have a better foundation in the Vector arithmetic and representation that is the foundation of R. I thought the cursory look provided in my previous blog would suffice. I realize now that R provides multiple ways of accessing Vectors and Matrices (esp. Matrices) that hide the "Vectorness" that is inherent in the language. There are many thing in R that older programmers have already had experience with. The original vector language developed by IBM was known as APL. Dr. Ken Iverson developed a specialized math syntax while at Harvard. IBM hired him to implement that syntax into a computer programming language (Original concepts detailed in reference [2]). This all happened in the 1960s. For those that learned Computer Science in the 60s and 70s they would have had exposure to this language. It has continued on and there is even a free GNU version available today[6]. The problem for many people was the strange symbols that were the basis of the language. Since APL there have been many offshoots that have carried forward this idea of 'Vectors' being the built in data structure of the language but with a design change that uses standard characters found on your standard keyboard for syntax. The language K is probably the most successful commercial implementation of this offshoot[3]. R is probably the most successful open source implementation of these concepts. My personal favorite is the J language which the late Dr. Iverson developed as a redesign of his APL concepts. J has an active user forum and a great collection of articles on their website on the history of APL, Dr. Iverson and many technical articles showing various uses of J in many different areas(see reference [1]).

This history that many Professors and teachers experienced first hand make it difficult for them to explain. It is very easy to assume that something is a simple concept because you forget that you didn't learn it in R. You learned it in some other computer language, programming different types of things. Jumping into R was not that difficult and you appreciate how R has transformed some of the menial tasks into simple function calls. For the 'Newbie' they are left with many WTF moments as things seem to happen by magic. The goal of this blog post is to show you how the basic vector concepts are in everything that you do. This will help you as you try to dissect your data stored in a table. R has many layers on that data that help facilitate creating charts and statistics, but in the end it is all just vectors and matrices (aka arrays and tables).

Vectors/Arrays

A vector has its roots in physics. The idea behind it is that many physical properties are described by a value and a direction. I may push something along at 25 miles per hour but that is only part of the story. I am also pushing it along in a certain direction. Once I come up with a way of telling direction I now must carry 2 values along to let you know exactly what I am doing. So the concept of a vector is a way of carrying around multiple values to describe a single concept. In math and in computers it's not hard to envision that we might want to carry around more than just 2 values. Why not 3? There are after all 3 dimensions. Why not 10? Why not 1000? Hence for our purposes a vector is a way of carrying around multiple pieces of information and referencing them by a single name and an index. Mathematics uses a subscript to identify a particular item in a vector:

\[x = {2,4,6,8}\] \[x_1 = 2 \] \[x_4 = 8 \]

In R access to individual vector elements is accomplished as follows:

> x = c(2,4,6,8) #combine 2 4 6 8 into a vector and store it in x
> x
[1] 2 4 6 8
> # since subscripting is a pain in the neck R uses square brackets
> x[2]
[1] 4
> x[1]
[1] 2
> x[4]
[1] 8
> 

Seems easy enough. In math rather than write out every element of a vector we can use an ellipsis to continue an established pattern. So for example to represent the numbers from 1 to 100 in a vector in Math we do the following:

\[x = {1,2,3,4,\ldots,99,100} \] \[x_3 = 3 \] \[x_{98} = 98 \]

R rotates the ellipsis and uses the ':' (the colon) to implement similar functionality:

> x = c(1:100)
> x
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21
 [22]  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42
 [43]  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63
 [64]  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84
 [85]  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
> x[3]
[1] 3
> x[98]
[1] 98
> 

Now here is where R can be deceiving. The colon operator is like the ellipsis but not exactly alike. The colon is only good for generating an increment by one pattern. So for example in math

\[ x = {2,4,6,\ldots,20,22} \]

You instinctively understand I mean to count by 2's up to 22. Trying this in R with the colon operator just increments by 1s from 6 to 20:

> x = c(2,4,6:20,22)
> x
 [1]  2  4  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 22
> # from 6 to 20 R counts by 1s it doesn't try to infer my pattern 

Now that doesn't mean I have to enter in every value for R if I want to count by 2's. But it does mean I have to be more arithmetically distinct in what I tell R to do. Counting by 2's is just counting by 1's up to half the maximum value and multiplying the result by 2. So to accomplish the same thing in R:

> x = 2 * c(1:11)
> x
 [1]  2  4  6  8 10 12 14 16 18 20 22
>

R does do one bit of inference with this operator:

> # one thing R will infer is that if you reverse the order and put the larger number first
> # R will count backwards for you
> x = c(11:1)
> x
 [1] 11 10  9  8  7  6  5  4  3  2  1
> 

But if R didn't do this, it would be easy to reconstruct with some added R functionality: the reverse function 'rev'. This function gives the reverse order of a vector

> # Create a reverse order without switching
> x = c(1:11)
> x
 [1]  1  2  3  4  5  6  7  8  9 10 11
> rev(x)
 [1] 11 10  9  8  7  6  5  4  3  2  1
> # in one line
> x = rev(c(1:11))
> x
 [1] 11 10  9  8  7  6  5  4  3  2  1
> 

I hope at this point you can extrapolate and realize that by investigating the functions available in R we can create our own vectors of data without having to resort to reading it in from a file. This comes in handy for putting together some simple testing data.

Matrix/Matrices

A Matrix wasn't originally a computer driven reality to enslave people to provide power to machines. It is just a mathematical concept for a table of values. It is an extension of the concept of a vector. While a vector has multiple values it is considered a one-dimensional object. This means I only need one index to obtain a value. If I took a set of vectors of the same length and piled them on top of each other I would create a table or Matrix. In mathmatics notation you just put a table of numbers in parenthesis:

\[ M = \begin{pmatrix} 1 & 2 & 3 & 4 & 5 \\ 11 & 12 & 13 & 14 & 15 \\ 21 & 22 & 23 & 24 & 25 \end{pmatrix} \]

\[ M_{1,2} = 2 \] \[ M_{3,3} = 23 \]

Matrices can be created directly in R. But first a little segue to go from vectors to matrices In R start by creating 3 vectors of 5 elements each. Vector1 = {1,2,3,4,5}, Vector2={11,12,13,14,15} and Vector3={21,22,23,24,25}. To save typing call them V1, V2, and V3. Here is the R session to set that up.

> # 3 Vectors of length 5 (notice I use a little math to help create different values)
> V1 = c(1:5)
> V2 = 10+V1
> V3 = 20+V1
> V1
[1] 1 2 3 4 5
> V2
[1] 11 12 13 14 15
> V3
[1] 21 22 23 24 25
> # notice that R added a number to the whole vector V1
>

Even though I had to type each variable to display the data, notice the natural tabular form that appears when looking at the last 3 lines of numbers above. They look like 3 rows of a table. If I wanted the second element of the first row, the 4th element of the second row and the 1st element of the third row. I could access them all as follows (continuing with the vectors I have set up):

> V1[2]
[1] 2
> V2[4]
[1] 14
> V3[1]
[1] 21
> 

I named the vectors with numbers purposefully. If I could form a table and R could extend it's access to account for rows and columns (which it does) I could use one variable name and access any element by just giving the row and column number of that element. V1[2] would be M[1,2] in a table constucted of these vectors and stored in M. Similarly V2[4] -> M[2,4] and V3[1] -> M[3,1] Not only do I save typing but I can also create loops that would be able to go through every member in the matrix in almost any conceivable order I can imagine making looping programs do.

Experimenting with R and its matrix creation function I was able to use the vectors to create a table with each vector above as one row. I did have to use the matrix transpose function 't' (initially). Transpose will flip the matrix by swapping rows for columns (look up matrix transpose if you don't quite understand what it's doing from the session below). In the end I figured out the proper parameters for the matrix function to pile the vectors on top of each other (in row fashion) in one fell swoop.

> # Use matrix function to create a matrix from V1, V2, and Ve
> M = matrix(c(V1,V2,V3),nrow=3,ncol=5)
> M
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    4   12   15   23
[2,]    2    5   13   21   24
[3,]    3   11   14   22   25
> # matrix fills columns first not rows what to do?
> t(M)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5   11
[3,]   12   13   14
[4,]   15   21   22
[5,]   23   24   25
> # Lets flip the dimensions around and see what happens
> M = matrix(c(V1,V2,V3),nrow=5,ncol=3)
> M
     [,1] [,2] [,3]
[1,]    1   11   21
[2,]    2   12   22
[3,]    3   13   23
[4,]    4   14   24
[5,]    5   15   25
> # since matrix fills columns first lets fill a vector per column by switching dimensions
> # like above. Now transpose should get us the form we were looking for which is a 
> # vector per row
> t(M)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]   11   12   13   14   15
[3,]   21   22   23   24   25
> # so lets put it all into one line to make a matrix of our three vectors with each
> # vector in its own row
> M = t(matrix(c(V1,V2,V3),nrow=5,ncol=3))
> M
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]   11   12   13   14   15
[3,]   21   22   23   24   25
> # Now M[1,2] should match V1[2], M[2,4] = V2[4] and M[3,1] = V3[1]
> M[1,2]
[1] 2
> V1[2]
[1] 2
> M[2,4]
[1] 14
> V2[4]
[1] 14
> M[3,1]
[1] 21
> V3[1]
[1] 21
> # Had I dug a little deeper into the matrix function there is a flag to fill by called 'byrow'
M = matrix(c(V1,V2,V3),nrow=3,ncol=5,byrow=TRUE)
> M
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]   11   12   13   14   15
[3,]   21   22   23   24   25
> # got the matrix in 1 step

The above session has an important nuance. I assumed that R would think the way I do: Put vectors into rows. But as the session unfolded it was clear that R is column oriented by default. I was able to adjust once I saw the way R was doing things. This is important! As you begin to think in terms of vector and matrix operations you may find your answer coming from R is not formatted properly or the data doesn't seem to have the right appearance. When you see wierd things happening you must break down your operations and make sure you and R are on the same page (more so you since R is not going to change). When in doubt go to one operation per line, display the results of each operation (or a portion thereof if you have a considerable amount of data). Verify that each operation you are performing is what you expect. You would be surprised how one small typographical error can cause you hours of debugging and anxiety. Your mind will overlook the small error because it will fill in a missing operation as you are looking at it (or ignore it if there is an extra operation). By breaking it down you are verifying to yourself that each operation works as intended.

Row and Column names

I use term 'table' above rather loosely above. Don't confuse this with any add-on packages that have tables. I mean it in the simplest sense as a way of describing 2 dimensional data. R has another table type structure called a 'data frame'. So what's the difference between a matrix (which I have shown as a 'table' of numbers) and an R data frame? In an R data frame you can have a mix of data types between columns. Each individual column needs to have data of the same type but the next column can have a completely different datatype (as long as it's consistent within that column). So in a matrix all the data must be the same across all rows and columns and in a data frame there can be some mixing of data types on a column by column basis.

Now you access data in a 'data frame' by indexing the same way as you do with a matrix. The trick is not to do any operation on that data that is inconsitent with the datatype of the column. So in a matrix (since all the data is the same type) I can add together any 2 selected elements (if the data is of numeric type).

> # Create a vector of 25 elements from 1 to 25
> v <- 1:25
> v
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
> # Use vector v to create a matrix that is 5x5 of those elements
> m <- matrix (v,5)
> m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25
> # Add m[2,3] and m[3,2] together
> m[2,3]
[1] 12
> m[3,2]
[1] 8
> m[2,3]+m[3,2]
[1] 20
>

Nothing surprising. I make a matrix of integer values and I can add them together any way I please.

What about naming columns and rows? Here it turns out there are multiple ways of naming columns and rows depending if the underlying data structure is a matrix or 'data frame'. The following calls work the same across all of those structures. A 'data frame' has a built in $ operator it is used to access a whole column of data in a 'data frame' by name. I include its use the session below:

> # Give names to the columns and rows
> m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25
> colnames(m) <- c("C1","C2","C3","C4","c5")
> m
     C1 C2 C3 C4 c5
[1,]  1  6 11 16 21
[2,]  2  7 12 17 22
[3,]  3  8 13 18 23
[4,]  4  9 14 19 24
[5,]  5 10 15 20 25
> # Now the rows
> rownames(m) <- c("r1","R2","r3","R4","r5")
> m
   C1 C2 C3 C4 c5
r1  1  6 11 16 21
R2  2  7 12 17 22
r3  3  8 13 18 23
R4  4  9 14 19 24
r5  5 10 15 20 25
> # We can still access with number indexes as before
> m[2,3]
[1] 12
> # But now we can use names as indexes instead
> m ["R2","C3"]
[1] 12
> # Is this where we can start using the $ in the variable name?
> m$C2
Error in m$C2 : $ operator is invalid for atomic vectors
> # No we can't use that type of access for a matrix
> # Turn m into a dataframe d and see what we can do
> d <- as.data.frame(m)
> d
   C1 C2 C3 C4 c5
r1  1  6 11 16 21
R2  2  7 12 17 22
r3  3  8 13 18 23
R4  4  9 14 19 24
r5  5 10 15 20 25
> # It doesn't look that much different but here are the different ways
> # to access data.
> d[2,3]
[1] 12
> d["R2","C3"]
[1] 12
> d["R2",]$C3
[1] 12
> d$C3
[1] 11 12 13 14 15
> d[2,]
   C1 C2 C3 C4 c5
R2  2  7 12 17 22
> d["R2",]
   C1 C2 C3 C4 c5
R2  2  7 12 17 22
> 

Data Frames

The data frame's strength comes from being able to handle tabular data of different data types. The following session creates a data frame with a mix of data types and shows how you have to be careful what operations you choose to do. By supplying column names in the creation of the 'data frame' there is no need to perform a separte operation to insert them into the 'data frame'.

> d2 <- data.frame(C1=c(1:5),C2=c("a","b","c","d","e"),C3=c("john","joesph","james","jane","janet"))
> d2
  C1 C2     C3
1  1  a   john
2  2  b joesph
3  3  c  james
4  4  d   jane
5  5  e  janet
> d2[1,1]+d2[3,1]
[1] 4
> d2[1,1]+d2[1,2]
[1] NA
Warning message:
In Ops.factor(d2[1, 1], d2[1, 2]) : ‘+’ not meaningful for factors
> # We can do some comparisons on the character data
> "a" == d2[2,2]
[1] FALSE
> "a" == d2[1,2]
[1] TRUE
> "james" == d2[3,2]
[1] FALSE
> "james" == d2[3,3]
[1] TRUE
> d2[1,]
  C1 C2   C3
1  1  a john
> d2$C2
[1] a b c d e
Levels: a b c d e
> 

The other strength of a 'data frame' is that it can be used seamlessly with functions that read in comma separated values. This allows you to pull in data sets from databases or websites and operate on them easily. Since comma separated value files usually include a first line of column names, the 'data frame' will already have column names inside after a read operation.

Conclusion

These topics are covered in more depth in the pdf text "An Introduction to R" [7]. Hopefully this blog has provided some insight into the workings of R and vector languages in general. The purpose here was to give just enough vector stuff to get you through debugging a statistics assignment when things go wrong. Usually the data is structured in a manner that's different from how your mind is perceiving it. This causes you to make improper function calls. I can't say this enough when in doubt break things down! Try functions on smaller pieces of data and make sure you get an answer you expect. Once things are operating the way you expect you can extrapolate up to larger datasets.

References

  1. http://www.jsoftware.com/ great vector based language. Excellent forum to search various subjects. There is an R interface to the J language so you can work in J and use R when you need something statistical that J doesn't have. Search the website for Ken Iverson they have some execellent essays on the beginnings of APL and vector languages
  2. Iverson, Kenneth E. “A Programming Language.” A Programming Language, J Software Inc., 13 Oct. 2009, www.jsoftware.com/papers/APL.htm.
  3. https://kx.com/ The company that produces the K-language and Kdb (a database based on the K-language)
  4. http://www.r-tutor.com/ offers nice tutorials on various aspects of R. It also has some nice deep-learning info. Always seems to come up first when googling an R language reference
  5. https://stackoverflow.com/questions/2281353/row-names-column-names-in-r discussion on matrix and dataframe row and column names
  6. https://www.gnu.org/software/apl/ GNU's apl implementation
  7. https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf A good general (not so statistical) introduction to the language that covers many of these details in greater depth. It's a PDF you should download a copy

Author: NASTY OLD DOG

Validate

Tuesday, February 6, 2018

Simple Statistics and the R Language

Statistics and the R Language

Statistics and the R Language

Introduction

Statistics plays such an important role in so many different fields that a large portion of College students will find themselves having to take a Statistics course. That course will generally involve learning how to use a statistics package. At the tail end of the second decade of the 21st Century that package will likely be the R language. As a new student to Statistics you aren't privy to the steps leading up to the R language. You didn't have to live throught the use of hand written routines to calculate out the statistical functions on a computer. You didn't have to live through the days of using a hand calculator with statistical function keys. You don't need to wade through statistical function tables in the appendix of a book to calculate your statistical distribution functions. At the same time though you become detatched with the actual mathematics that is statistics. The pre-packaged functions of the R statistical library hides all that hard work from you.

My introduction to statistics happened rather late in my life in the first decade of the 21st century in the Harvard extension Biostatistics class. At that time there were various statistics packages that cost a considerable amount of money to purchase (luckily there were educational discounts). The book used for the course provided different examples of solutions in a couple of different statistical packages but centered around one called Stata. By seeing the same problem solved with different package side by side, the actual mathematics of statistics became the basis for understanding the idiosyncracies between the statistics packages. This maintained a link between the mathematical representation of the statistics formulas and their use in the statistics software.

With R being the popular statistics language there is less of a tendency to dive back into the mathematical formulas and you may lose some understanding of the mathematics in the process. This blog post is meant to be an elementary bridge between some of the simple formulas of statistical mathematics and their R built-in counter parts. To do this I am going to use (actually rely on heavily for example data and answers) a paper of Dr. Keith Smillie at the University of Alberta. He produced a wonderful Statistics package for the J language and summarizes it in the paper "J Companion for Statistical Calculations"[1]. I am going to transpose some of that into R covering a subset of his implementation. Hopefully you will be able to then use that format to extrapolate the mathematical connections for subjects I haven't covered here. Use this type of framework to help you gain more understanding of the theory and implementation of statistics.

R is an Array Language

R uses Vectors or Arrays as a built-in element of it's language. If you aren't used to array languages it can seem pretty strange at first, but once you get the hang of it it's pretty cool. You do have to think in terms of operating on the whole vector. That's because R has functions that streamline vector operations. You are welcome to breakdown those operations so they look like a regular programming language, but that may slow down processing time when dealing with large amounts of data.

R is an interpreted language

This just means that R grabs your R code and tries to execute it as soon as you type it in and hit the enter key. You can also save code in a file and load it into the R console environment. But you can do quite a bit at the command prompt in the R Console. Look at the following R Console Session. The session uses the comments operator '#' to add commentary in line. The comment operator '#' is usually called the 'hash' sign (you might know it as the number sign). When R sees the hash sign it will ignore the sign and everything else to the end of the line. Anything that preceeds the hash will be executed as R code.

> # The hash comments out the rest of the line in R
> 2.3 # entering a numeric constant R returns it as a vector of one value
[1] 2.3
> # Assignment using the back arrow operator <-
> w <- 2.3 # assigning 2.3 to the variable w
> w 
[1] 2.3
> # add 2 numeric constants
> 2.3 + 2
[1] 4.3
> # you can use the = sign for assignment as well
> w = 3
> w 
[1] 3
> # see we have changed w from 2.3 to 3

Creating vectors(array or list if you prefer) of more than 1 value

You create vectors or lists by Concatenating (or Combining) values together. R has a c() function for doing just that.

> w <- c(2.3,5,3.5,6)  # w will now have multiple values in it
> # display w by typing it in at the command prompt
> w
[1] 2.3 5.0 3.5 6.0
> 

Arithmetic Mean

This is what is commonly known as the average of a list of values. We calculate it by taking a list of numbers, suming them and dividing by how many numbers are in the list

\[w = {2.3,5,3.5,6} \] \[n = 4 \]

Doing this by hand you compute the mean using the following formula: \[mean = \frac{\displaystyle\sum_{i=1}^{n}w_{i}}{n}\]

so

2.3 + 5 is 7.3
7.3 + 3.5 is 10.8
10.8 + 6 is 16.8

divide 16.8 by the total number of values (4): 16.8/4 is 4.2

How do we accomplish this in R? There are 3 categories of solutions:

  • The first is the old fashioned way: Loop through the values of interest that we stored in w, accumulate the sum of those numbers then after the looping is finished divide by the number of values.
  • The second is the vector way: use R vector operations to operate on the whole group of values
  • The third is to use an R built-in function to calculate in one step

Traditional looping average calculation

Lets create a function myave that does it the first way.

myave <- function (values_vector) {
 sum = 0
 for (i in 1:length(values_vector)) {
  sum = sum + values_vector[i]
 }
 # i should be the value of the last index and therefore the number of values
 sum/i
}

In the R console you will need to open the source editor by clicking the blank page icon in the tool bar at the top of the console. Save the above text into the editor window. Then click the 'source .r' icon and type in the name of your script.

> source("/Users/Nasty/Downloads/LearnBayes/R/myave.R") # where my file ended up
> myave
function (values_vector) {
 sum = 0
 for (i in 1:length(values_vector)) {
  sum = sum + values_vector[i]
 }
 # i should be the value of the last index and therefore the number of values
 sum/i
}
> # myave has a value of the function code so you can see it as above
> # Now lets use it create the vector of values
> w = c(2.3,5,3.5,6)
> w
[1] 2.3 5.0 3.5 6.0
> # plug it into myave
> myave(w)
[1] 4.2

A more vector way to calculate

> sum(w)
[1] 16.8
> sum(w)/length(w)
[1] 4.2
> 

This is pretty simple and to reduce typing you could create an R function to do it. You don't need the editor since it's a one-liner.

>  myave1 <- function(values) {sum(values)/length(values)}
> myave1
function(values) {sum(values)/length(values)}
> myave1(w)
[1] 4.2
> 

R built-in method

Now R being a statistics package it must have a predefined function that will do this. It does have something called 'ave'. But it has a wierd idiosyncracy that the above functions don't share. Let's try it.

> ave(w)
[1] 4.2 4.2 4.2 4.2

It turns out that the 'ave' function runs the average on subsets of our values. So it produces a vector of averages. Not quite what we wanted right now. So guess at a function name at your own risk. In the computer world there is an old acronym RTFM. Which in nice language stands for Read the Stinking Manual. So try to look up what you want to do (Google is your friend here).

So the built-in function we really want to use is 'mean'

> mean(w)
[1] 4.2

Now mean has other parameters that you can use that have default values. So for our usage it works with the default values. Depending on how in depth your statistics course is you may discover them. If you're interested check https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/mean

Frequencies

It's helpful to look at frequencies of occurances many times when analyzing data. So how would we use R vector operations to build a list of frequencies. The key is to know the complete range of values that the random process we are looking at can take on. This is because in a small sample some of the values may not show up and the number of occurances will be 0. To compare all the values of one vector with all the values of a second vector we use a function known as 'outer product'. Outer product will take an operator and execute it between values of the 2 vectors. R has an outer product built-in function called 'outer'. It takes 3 parameters, 2 vectors and an operator symbol. It then does all the heavy lifting of applying the operator between each element of vector1 to each element of vector2. This will produce a table of calculations. In our case using equality operator '==' we obtain a matrix of truth values.

> # frequencies from a list of die rolls stored in a vector
> D = c(4,5,1,4,3,6,5,4,6,4,6,1)
> D
 [1] 4 5 1 4 3 6 5 4 6 4 6 1
> # range of values of a single die
> r = c(1:6)
> r
[1] 1 2 3 4 5 6
> # use the outer product function 'outer' to create a table of what in D == values in r
> outer(r,D,"==")
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
[1,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,]  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
[5,] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
> # these are all logical values that we want to sum to create frequencies
> # if we multiply by 1 R will convert these to numeric values for us
> 1*outer(r,D,"==")
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,]    0    0    1    0    0    0    0    0    0     0     0     1
[2,]    0    0    0    0    0    0    0    0    0     0     0     0
[3,]    0    0    0    0    1    0    0    0    0     0     0     0
[4,]    1    0    0    1    0    0    0    1    0     1     0     0
[5,]    0    1    0    0    0    0    1    0    0     0     0     0
[6,]    0    0    0    0    0    1    0    0    1     0     1     0
> # sum the rows of the equality table to obtain the frequencies
> # there happens to be a function called 'rowSums' that will do just that
> rowSums(1*outer(r,D,"=="))
[1] 2 0 1 4 2 3
> 

Next we combine the range values with their respective frequencies into a table.

> # lets assign the frequencies to a variable f
> f = rowSums(1*outer(r,D,"=="))
> f
[1] 2 0 1 4 2 3
> # Now lets combine the range of values with their frequencies
> matrix(c(r,f),length(f),2)
     [,1] [,2]
[1,]    1    2
[2,]    2    0
[3,]    3    1
[4,]    4    4
[5,]    5    2
[6,]    6    3
> # to see them in horizontal representation use the matrix transpose function 't'
> t(matrix(c(r,f),length(f),2))
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    3    4    5    6
[2,]    2    0    1    4    2    3
> # R has a 'barplot' function to make a nice graph of the frequencies
> # however it's not quite smart enough to break up our matrix representation
> # so we will go back to using r and f vectors
> barplot(f)
>

> # Not very informative if you look up the function we can add some parameters
> # to name the bars and place labels and a title
> barplot(f,names.arg = r,main="Die Roll Frequencies",xlab="Die values",ylab="Occurances")
> 

Now R has a function built in to create a frequency table. It's called 'table'

> table(D)
D
1 3 4 5 6 
2 1 4 2 3 
> 

It doesn't quite provide the same thing we calculated. This is the distinct frequency list. Remember the 2 value of the die had no rolls in our list. So 'table' doesn't include it.

Median and Quartiles

Median is the "middle" observation when you look at a set of sorted data. So rather than being a complex formula this is more of a positional definition. Sometimes it actually is in the exact middle (which happens for an odd number of items) for example: 3 4 5 6 7 then 5 is the median and the actual middle number. But if you had: 4 5 6 7, there is no actual middle number. In this case 5.5 would be condidered the median. To find the median in R we would perform the following vector operations:

  • sort the data into a new sorted vector
  • find the index of the middle number (if there are a odd number of values) or find the middle 2 numbers (if an even number of values)
  • return the number found or calculated above.

Median for even length vector

> # Median by hand
> # sample data:
> M = c(22, 14, 32, 30, 19, 16, 28, 21, 25, 31)
> M
 [1] 22 14 32 30 19 16 28 21 25 31
> # need to sort the data. we will cheat and use the sort function in R rather than 
> # doing it by hand
> sM = sort(M)
> sM
 [1] 14 16 19 21 22 25 28 30 31 32
> length(sM)
[1] 10
> # length is even so we divide length by 2 and get the value at the calculated index
> # and we need the value at the calculated index + 1 as well
> midx = length(sM)/2
> midx
[1] 5
> midx+1
[1] 6
> # so jumping ahead a couple of steps and putting all together
> (sM[midx]+sM[midx+1])/2
[1] 23.5
> # Thats the median that lies between the 2 values
> sM[midx]
[1] 22
> # and
> sM[midx+1]
[1] 25
> 

Median for odd length vector

> # lets add a value to our even vector to make the length odd
> M1 = c(M,40)
> M1
 [1] 22 14 32 30 19 16 28 21 25 31 40
> # formula for the index of odd length
> m1idx = (length(M1)+1)/2
> m1idx
[1] 6
> # we still need to sort before we find the median
> sM1 = sort(M1)
> sM1
 [1] 14 16 19 21 22 25 28 30 31 32 40
> # just select the median value at m1idx now
> sM1[m1idx]
[1] 25
>

Package it into an R script/function

mymedian <- function(myvector) {
# we will need to sort the vector for both cases
# let's do that now
 sM = sort(myvector)
 if (0 == length(myvector)%%2) {
  # even length code
  midx = length(sM)/2
  z = (sM[midx]+sM[midx+1])/2
 } else {
  # odd length code
  midx = (length(sM)+1)/2
  z = sM[midx]
 }
 z
}

Now test the function on the 2 data sets and compare mymedian against R's median function

> # Test out mymedian and R's median function
> # first what were the data sets?
> M
 [1] 22 14 32 30 19 16 28 21 25 31
> M1
 [1] 22 14 32 30 19 16 28 21 25 31 40
> # source the mymedian function
> source("/Users/Nasty/Downloads/LearnBayes/R/mymedian.R")
> mymedian
function(myvector) {
# we will need to sort the vector for both cases
# let's do that now
 sM = sort(myvector)
 if (0 == length(myvector)%%2) {
  # even length code
  midx = length(sM)/2
  z = (sM[midx]+sM[midx+1])/2
 } else {
  # odd length code
  midx = (length(sM)+1)/2
  z = sM[midx]
 }
 z
}
> mymedian(M)
[1] 23.5
> mymedian(M1)
[1] 25
> # what about R's built in median function
> median(M)
[1] 23.5
> median(M1)
[1] 25
> 

For any statistical definition, you could code it your self using R. But R was built with statistics in mind. Someone has probably already beat you to it. This means that there is probably a built-in function or a library function that already has that functionality. However, by going from a statistics book that gives you a mathematical definition and using the R function you lose the transition from mathematics into implementation that would seal in the math stuff in your mind. Your professor is going to expect that you know the theoretical math to some extent so don't overlook it. If your having trouble understanding just what the math is driving at try to implement it (if you have the time). Your text should have some simple examples with data you can test with. You won't have to do it every time and you can always google an implementation and just read it to see what it would look like. It may give you some understanding of what the text book is trying to define mathematically.

Quartiles

> # Quartiles are the numbers that mark the boundries such that
> # one quarter of the data lies between any consequtive quartile number
> # the easiest quartile to find is quartile 2. It's just the median
> Q2 = median
> # Q1 is the median of all numbers less than the value found by Q2
> Q1 <- function (myvec){median(myvec[which(Q2(myvec)>myvec)])}
> # lets set up some test data
> u = c(22,14,32,30,19,16,28,21,25,31)
> Q2(u)
[1] 23.5
> Q1(u)
[1] 19
> # Q3 is the median of all numbers greater than Q2 (ie. the median)
> Q3 <- function (myvec){median(myvec[which(Q2(myvec)<myvec)])}
> Q3(u)
[1] 30
> 

five-statistic

This is a summary of the data producing the following values:

min,Q1,Q2,Q3,max

> # five-statistic definition
> five <- function(myvec){c(min(myvec),Q1(myvec),Q2(myvec),Q3(myvec),max(myvec))}
> five(u)
[1] 14.0 19.0 23.5 30.0 32.0
> 

Mode

One thing that R does not provide a built-in for is the statistical mode. This is defined as the most frequently occuring data value in your vector of data. So this one we have to define. If I follow Prof. Smillie's example for J but use R equivalents this is what the code will look like.

# return the indexes of the most frequent values
# the %in% tests membership and produces a vector of truth values
# which converts the TRUEs to their indexes
# imx - short for indexes of maximum value of a vector
imx <- function(myvec){which(myvec %in% max(myvec))}

# ufrq - returns a frequency list of all accounted for values.
# values in the domain that don't appear are not accounted for
# ufrq short for unique values frequencies
ufrq <- function(myvec){rowSums(selfclassify(myvec))}

# simpleMode - the simple most frequent value (statistical mode) in a vector list
# it relies on the fact that ufrq will produce frequencies in the order that 
# unique returns the uniques members of the data
simpleMode <- function(myvec){unique(myvec)[imx(ufrq(myvec))]}
> source("/Users/Nasty/Downloads/LearnBayes/R/simpleMode.R")
> imx
function(myvec){which(myvec %in% max(myvec))}
> ufrq
function(myvec){rowSums(selfclassify(myvec))}
> simpleMode
function(myvec){unique(myvec)[imx(ufrq(myvec))]}
> D
 [1] 4 5 1 4 3 6 5 4 6 4 6 1
> simpleMode(D)
[1] 4
> simpleMode(c(1,2,3,2,3,2,3,4))
[1] 2 3
> simpleMode(c(1,2,3,4))
[1] 1 2 3 4
> 

Variance

This is a measure of how close each element of a data set is to the mean of the data set. This involves summing the squares of the differences between the value and the mean. Some people have trouble with this because it introduces squaring. When logically if you wanted to find out how far a value is from the mean you would just subtract the smaller number from the larger number from each other and get your answer. So a naive set of steps might look like the following R-like pseudo code:

m = mean(values)
for (i=1:length(values)) {
 if (m > values[i]) {
  diff[i] = m - values[i]
 }
 else {
  diff[i] = values[i] - m 
 }
}
diff

Now you could fix that up and get the values for the distance from each mean but we don't have a very sussinct mathematical formula. So the mathematical way of doing this without all the if statements is the following:

  • for each i={1..N} valuesi - mean
  • now that will give some negative numbers so one mathematical way of getting rid of negatives is to square the numbers.
  • for each i={1..N} (valuesi - mean)2
  • This formula will give all positive numbers that will be related to the actual distance.
  • but having separate distance related numbers doesn't tell us much so one thing you could do is sum all the squared distances up and divide by the number of values. In essence take the mean of the squared distances.
  • Divide by N or N-1. What's up with that?
    • Population - this is the total universe of everybody or everthing you want to study. Now you may have data on every single subject of interest. In this case you would be dividing by N.
    • Sample - This is where you look at a few or some number less than the number of things in the population (ie. you 'sample' the population). Since you sample you could have been unlucky and chosen things that were very similar with very similar data. This means there is still larger variation in the population. To account for that in the variance you use N-1 to divide. This enlarges the variance but as your sample size approaches the population size the difference in dividing by N or N-1 is negligible. = for example: (N=2) 1/2 = 0.5 and 1/3=0.33 the difference is 0.17 = But: (N=100) 1/99 = 0.0101 1/100 = 0.01 the difference is 0.001 = Makes sense naively you would think the more samples you have the higher the accuracy. This is just a mathematical way to express it.
> # the built in variance calculation
> # remember the w data from before?
> w
[1] 2.3 5.0 3.5 6.0
> mean(w)
[1] 4.2
> w - mean(w)
[1] -1.9  0.8 -0.7  1.8
> (w - mean(w))^2
[1] 3.61 0.64 0.49 3.24
> sum((w - mean(w))^2)
[1] 7.98
> sum((w - mean(w))^2)/(length(w)-1)
[1] 2.66
> # or using the built-in var function in r
> var(w)
[1] 2.66
> 

Standard Deviation

variance is useful for a variety of things that you may go into in your course. But you are probably thinking if I just took the absolute value of my numbers I would have a more exact representation of the distance each value is from the mean. You would be correct but the accepted mathematical way to do this is to take the square root of the variance instead and call it Standard Deviation.

> var(w)^0.5   # you should remember that square root is the same as taking to the 1/2 power
[1] 1.630951
> sqrt(var(w)) # in case you didn't believe me
[1] 1.630951
> sd(w)
[1] 1.630951
>

Create a Statistics Summary

let's put this all together and create a statistics summary. Essentially this is a report or table of everything we have just covered.

ssummary <- function(myvec){
 cat("Sample size\t\t\t",length(D),"\n")
 cat("Minimun\t\t\t\t",min(D),"\n")
 cat("Maximum\t\t\t\t",max(D),"\n")
 cat("Arithmetic mean\t\t",mean(D),"\n")
 cat("Variance\t\t\t\t",var(D),"\n")
 cat("Standard deviation\t",sd(D),"\n")
 cat("First Quartile\t\t",Q1(D),"\n")
 cat("Median\t\t\t\t",median(D),"\n")
 cat("Third Quartile\t\t",Q3(D),"\n")
}

So lets use it in R

> # using summary on our dice data variable D
> D
 [1] 4 5 1 4 3 6 5 4 6 4 6 1
> ssummary(D)
Sample size          12 
Minimun    1 
Maximum    6 
Arithmetic mean   4.083333 
Variance   2.992424 
Standard deviation  1.729862 
First Quartile   1 
Median    4 
Third Quartile   6 
> 

Now if you want something more fancy, there are ways to export this stuff to latex or HTML, rather than just printing to the screen within the R console. That is for you to look up and figure out. By the way R has a built-in 'summary' function:

> summary(D)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.750   4.000   4.083   5.250   6.000 
>

Conclusion

There is much more in the main reference for statistics and it's implementation in J in Dr. Smillie's paper.

Bibliography

  1. Smillie, K. (1999, January). J Companion for Statistical Calculations. Retrieved February 05, 2018, from https://webdocs.cs.ualberta.ca/~smillie

Author: NASTY OLD DOG

Validate