R Basic Vector/Matrix Stuff (for the Statistically Inclined but Computer Programming Challenged)
Introduction
After some feedback on my previous R blog I have found that a 'Newbie' R/Statistics person needs to have a better foundation in the Vector arithmetic and representation that is the foundation of R. I thought the cursory look provided in my previous blog would suffice. I realize now that R provides multiple ways of accessing Vectors and Matrices (esp. Matrices) that hide the "Vectorness" that is inherent in the language. There are many thing in R that older programmers have already had experience with. The original vector language developed by IBM was known as APL. Dr. Ken Iverson developed a specialized math syntax while at Harvard. IBM hired him to implement that syntax into a computer programming language (Original concepts detailed in reference [2]). This all happened in the 1960s. For those that learned Computer Science in the 60s and 70s they would have had exposure to this language. It has continued on and there is even a free GNU version available today[6]. The problem for many people was the strange symbols that were the basis of the language. Since APL there have been many offshoots that have carried forward this idea of 'Vectors' being the built in data structure of the language but with a design change that uses standard characters found on your standard keyboard for syntax. The language K is probably the most successful commercial implementation of this offshoot[3]. R is probably the most successful open source implementation of these concepts. My personal favorite is the J language which the late Dr. Iverson developed as a redesign of his APL concepts. J has an active user forum and a great collection of articles on their website on the history of APL, Dr. Iverson and many technical articles showing various uses of J in many different areas(see reference [1]).
This history that many Professors and teachers experienced first hand make it difficult for them to explain. It is very easy to assume that something is a simple concept because you forget that you didn't learn it in R. You learned it in some other computer language, programming different types of things. Jumping into R was not that difficult and you appreciate how R has transformed some of the menial tasks into simple function calls. For the 'Newbie' they are left with many WTF moments as things seem to happen by magic. The goal of this blog post is to show you how the basic vector concepts are in everything that you do. This will help you as you try to dissect your data stored in a table. R has many layers on that data that help facilitate creating charts and statistics, but in the end it is all just vectors and matrices (aka arrays and tables).
Vectors/Arrays
A vector has its roots in physics. The idea behind it is that many physical properties are described by a value and a direction. I may push something along at 25 miles per hour but that is only part of the story. I am also pushing it along in a certain direction. Once I come up with a way of telling direction I now must carry 2 values along to let you know exactly what I am doing. So the concept of a vector is a way of carrying around multiple values to describe a single concept. In math and in computers it's not hard to envision that we might want to carry around more than just 2 values. Why not 3? There are after all 3 dimensions. Why not 10? Why not 1000? Hence for our purposes a vector is a way of carrying around multiple pieces of information and referencing them by a single name and an index. Mathematics uses a subscript to identify a particular item in a vector:
\[x = {2,4,6,8}\] \[x_1 = 2 \] \[x_4 = 8 \]
In R access to individual vector elements is accomplished as follows:
> x = c(2,4,6,8) #combine 2 4 6 8 into a vector and store it in x > x [1] 2 4 6 8 > # since subscripting is a pain in the neck R uses square brackets > x[2] [1] 4 > x[1] [1] 2 > x[4] [1] 8 >
Seems easy enough. In math rather than write out every element of a vector we can use an ellipsis to continue an established pattern. So for example to represent the numbers from 1 to 100 in a vector in Math we do the following:
\[x = {1,2,3,4,\ldots,99,100} \] \[x_3 = 3 \] \[x_{98} = 98 \]
R rotates the ellipsis and uses the ':' (the colon) to implement similar functionality:
> x = c(1:100) > x [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 [22] 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 [43] 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 [64] 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 [85] 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 > x[3] [1] 3 > x[98] [1] 98 >
Now here is where R can be deceiving. The colon operator is like the ellipsis but not exactly alike. The colon is only good for generating an increment by one pattern. So for example in math
\[ x = {2,4,6,\ldots,20,22} \]
You instinctively understand I mean to count by 2's up to 22. Trying this in R with the colon operator just increments by 1s from 6 to 20:
> x = c(2,4,6:20,22) > x [1] 2 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 > # from 6 to 20 R counts by 1s it doesn't try to infer my pattern
Now that doesn't mean I have to enter in every value for R if I want to count by 2's. But it does mean I have to be more arithmetically distinct in what I tell R to do. Counting by 2's is just counting by 1's up to half the maximum value and multiplying the result by 2. So to accomplish the same thing in R:
> x = 2 * c(1:11) > x [1] 2 4 6 8 10 12 14 16 18 20 22 >
R does do one bit of inference with this operator:
> # one thing R will infer is that if you reverse the order and put the larger number first > # R will count backwards for you > x = c(11:1) > x [1] 11 10 9 8 7 6 5 4 3 2 1 >
But if R didn't do this, it would be easy to reconstruct with some added R functionality: the reverse function 'rev'. This function gives the reverse order of a vector
> # Create a reverse order without switching > x = c(1:11) > x [1] 1 2 3 4 5 6 7 8 9 10 11 > rev(x) [1] 11 10 9 8 7 6 5 4 3 2 1 > # in one line > x = rev(c(1:11)) > x [1] 11 10 9 8 7 6 5 4 3 2 1 >
I hope at this point you can extrapolate and realize that by investigating the functions available in R we can create our own vectors of data without having to resort to reading it in from a file. This comes in handy for putting together some simple testing data.
Matrix/Matrices
A Matrix wasn't originally a computer driven reality to enslave people to provide power to machines. It is just a mathematical concept for a table of values. It is an extension of the concept of a vector. While a vector has multiple values it is considered a one-dimensional object. This means I only need one index to obtain a value. If I took a set of vectors of the same length and piled them on top of each other I would create a table or Matrix. In mathmatics notation you just put a table of numbers in parenthesis:
\[ M = \begin{pmatrix} 1 & 2 & 3 & 4 & 5 \\ 11 & 12 & 13 & 14 & 15 \\ 21 & 22 & 23 & 24 & 25 \end{pmatrix} \]
\[ M_{1,2} = 2 \] \[ M_{3,3} = 23 \]
Matrices can be created directly in R. But first a little segue to go from vectors to matrices In R start by creating 3 vectors of 5 elements each. Vector1 = {1,2,3,4,5}, Vector2={11,12,13,14,15} and Vector3={21,22,23,24,25}. To save typing call them V1, V2, and V3. Here is the R session to set that up.
> # 3 Vectors of length 5 (notice I use a little math to help create different values) > V1 = c(1:5) > V2 = 10+V1 > V3 = 20+V1 > V1 [1] 1 2 3 4 5 > V2 [1] 11 12 13 14 15 > V3 [1] 21 22 23 24 25 > # notice that R added a number to the whole vector V1 >
Even though I had to type each variable to display the data, notice the natural tabular form that appears when looking at the last 3 lines of numbers above. They look like 3 rows of a table. If I wanted the second element of the first row, the 4th element of the second row and the 1st element of the third row. I could access them all as follows (continuing with the vectors I have set up):
> V1[2] [1] 2 > V2[4] [1] 14 > V3[1] [1] 21 >
I named the vectors with numbers purposefully. If I could form a table and R could extend it's access to account for rows and columns (which it does) I could use one variable name and access any element by just giving the row and column number of that element. V1[2] would be M[1,2] in a table constucted of these vectors and stored in M. Similarly V2[4] -> M[2,4] and V3[1] -> M[3,1] Not only do I save typing but I can also create loops that would be able to go through every member in the matrix in almost any conceivable order I can imagine making looping programs do.
Experimenting with R and its matrix creation function I was able to use the vectors to create a table with each vector above as one row. I did have to use the matrix transpose function 't' (initially). Transpose will flip the matrix by swapping rows for columns (look up matrix transpose if you don't quite understand what it's doing from the session below). In the end I figured out the proper parameters for the matrix function to pile the vectors on top of each other (in row fashion) in one fell swoop.
> # Use matrix function to create a matrix from V1, V2, and Ve
> M = matrix(c(V1,V2,V3),nrow=3,ncol=5)
> M
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    4   12   15   23
[2,]    2    5   13   21   24
[3,]    3   11   14   22   25
> # matrix fills columns first not rows what to do?
> t(M)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5   11
[3,]   12   13   14
[4,]   15   21   22
[5,]   23   24   25
> # Lets flip the dimensions around and see what happens
> M = matrix(c(V1,V2,V3),nrow=5,ncol=3)
> M
     [,1] [,2] [,3]
[1,]    1   11   21
[2,]    2   12   22
[3,]    3   13   23
[4,]    4   14   24
[5,]    5   15   25
> # since matrix fills columns first lets fill a vector per column by switching dimensions
> # like above. Now transpose should get us the form we were looking for which is a 
> # vector per row
> t(M)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]   11   12   13   14   15
[3,]   21   22   23   24   25
> # so lets put it all into one line to make a matrix of our three vectors with each
> # vector in its own row
> M = t(matrix(c(V1,V2,V3),nrow=5,ncol=3))
> M
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]   11   12   13   14   15
[3,]   21   22   23   24   25
> # Now M[1,2] should match V1[2], M[2,4] = V2[4] and M[3,1] = V3[1]
> M[1,2]
[1] 2
> V1[2]
[1] 2
> M[2,4]
[1] 14
> V2[4]
[1] 14
> M[3,1]
[1] 21
> V3[1]
[1] 21
> # Had I dug a little deeper into the matrix function there is a flag to fill by called 'byrow'
M = matrix(c(V1,V2,V3),nrow=3,ncol=5,byrow=TRUE)
> M
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]   11   12   13   14   15
[3,]   21   22   23   24   25
> # got the matrix in 1 step
The above session has an important nuance. I assumed that R would think the way I do: Put vectors into rows. But as the session unfolded it was clear that R is column oriented by default. I was able to adjust once I saw the way R was doing things. This is important! As you begin to think in terms of vector and matrix operations you may find your answer coming from R is not formatted properly or the data doesn't seem to have the right appearance. When you see wierd things happening you must break down your operations and make sure you and R are on the same page (more so you since R is not going to change). When in doubt go to one operation per line, display the results of each operation (or a portion thereof if you have a considerable amount of data). Verify that each operation you are performing is what you expect. You would be surprised how one small typographical error can cause you hours of debugging and anxiety. Your mind will overlook the small error because it will fill in a missing operation as you are looking at it (or ignore it if there is an extra operation). By breaking it down you are verifying to yourself that each operation works as intended.
Row and Column names
I use term 'table' above rather loosely above. Don't confuse this with any add-on packages that have tables. I mean it in the simplest sense as a way of describing 2 dimensional data. R has another table type structure called a 'data frame'. So what's the difference between a matrix (which I have shown as a 'table' of numbers) and an R data frame? In an R data frame you can have a mix of data types between columns. Each individual column needs to have data of the same type but the next column can have a completely different datatype (as long as it's consistent within that column). So in a matrix all the data must be the same across all rows and columns and in a data frame there can be some mixing of data types on a column by column basis.
Now you access data in a 'data frame' by indexing the same way as you do with a matrix. The trick is not to do any operation on that data that is inconsitent with the datatype of the column. So in a matrix (since all the data is the same type) I can add together any 2 selected elements (if the data is of numeric type).
> # Create a vector of 25 elements from 1 to 25
> v <- 1:25
> v
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
> # Use vector v to create a matrix that is 5x5 of those elements
> m <- matrix (v,5)
> m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25
> # Add m[2,3] and m[3,2] together
> m[2,3]
[1] 12
> m[3,2]
[1] 8
> m[2,3]+m[3,2]
[1] 20
>
Nothing surprising. I make a matrix of integer values and I can add them together any way I please.
What about naming columns and rows? Here it turns out there are multiple ways of naming columns and rows depending if the underlying data structure is a matrix or 'data frame'. The following calls work the same across all of those structures. A 'data frame' has a built in $ operator it is used to access a whole column of data in a 'data frame' by name. I include its use the session below:
> # Give names to the columns and rows
> m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25
> colnames(m) <- c("C1","C2","C3","C4","c5")
> m
     C1 C2 C3 C4 c5
[1,]  1  6 11 16 21
[2,]  2  7 12 17 22
[3,]  3  8 13 18 23
[4,]  4  9 14 19 24
[5,]  5 10 15 20 25
> # Now the rows
> rownames(m) <- c("r1","R2","r3","R4","r5")
> m
   C1 C2 C3 C4 c5
r1  1  6 11 16 21
R2  2  7 12 17 22
r3  3  8 13 18 23
R4  4  9 14 19 24
r5  5 10 15 20 25
> # We can still access with number indexes as before
> m[2,3]
[1] 12
> # But now we can use names as indexes instead
> m ["R2","C3"]
[1] 12
> # Is this where we can start using the $ in the variable name?
> m$C2
Error in m$C2 : $ operator is invalid for atomic vectors
> # No we can't use that type of access for a matrix
> # Turn m into a dataframe d and see what we can do
> d <- as.data.frame(m)
> d
   C1 C2 C3 C4 c5
r1  1  6 11 16 21
R2  2  7 12 17 22
r3  3  8 13 18 23
R4  4  9 14 19 24
r5  5 10 15 20 25
> # It doesn't look that much different but here are the different ways
> # to access data.
> d[2,3]
[1] 12
> d["R2","C3"]
[1] 12
> d["R2",]$C3
[1] 12
> d$C3
[1] 11 12 13 14 15
> d[2,]
   C1 C2 C3 C4 c5
R2  2  7 12 17 22
> d["R2",]
   C1 C2 C3 C4 c5
R2  2  7 12 17 22
> 
Data Frames
The data frame's strength comes from being able to handle tabular data of different data types. The following session creates a data frame with a mix of data types and shows how you have to be careful what operations you choose to do. By supplying column names in the creation of the 'data frame' there is no need to perform a separte operation to insert them into the 'data frame'.
> d2 <- data.frame(C1=c(1:5),C2=c("a","b","c","d","e"),C3=c("john","joesph","james","jane","janet"))
> d2
  C1 C2     C3
1  1  a   john
2  2  b joesph
3  3  c  james
4  4  d   jane
5  5  e  janet
> d2[1,1]+d2[3,1]
[1] 4
> d2[1,1]+d2[1,2]
[1] NA
Warning message:
In Ops.factor(d2[1, 1], d2[1, 2]) : ‘+’ not meaningful for factors
> # We can do some comparisons on the character data
> "a" == d2[2,2]
[1] FALSE
> "a" == d2[1,2]
[1] TRUE
> "james" == d2[3,2]
[1] FALSE
> "james" == d2[3,3]
[1] TRUE
> d2[1,]
  C1 C2   C3
1  1  a john
> d2$C2
[1] a b c d e
Levels: a b c d e
> 
The other strength of a 'data frame' is that it can be used seamlessly with functions that read in comma separated values. This allows you to pull in data sets from databases or websites and operate on them easily. Since comma separated value files usually include a first line of column names, the 'data frame' will already have column names inside after a read operation.
Conclusion
These topics are covered in more depth in the pdf text "An Introduction to R" [7]. Hopefully this blog has provided some insight into the workings of R and vector languages in general. The purpose here was to give just enough vector stuff to get you through debugging a statistics assignment when things go wrong. Usually the data is structured in a manner that's different from how your mind is perceiving it. This causes you to make improper function calls. I can't say this enough when in doubt break things down! Try functions on smaller pieces of data and make sure you get an answer you expect. Once things are operating the way you expect you can extrapolate up to larger datasets.
References
- http://www.jsoftware.com/ great vector based language. Excellent forum to search various subjects. There is an R interface to the J language so you can work in J and use R when you need something statistical that J doesn't have. Search the website for Ken Iverson they have some execellent essays on the beginnings of APL and vector languages
- Iverson, Kenneth E. “A Programming Language.” A Programming Language, J Software Inc., 13 Oct. 2009, www.jsoftware.com/papers/APL.htm.
- https://kx.com/ The company that produces the K-language and Kdb (a database based on the K-language)
- http://www.r-tutor.com/ offers nice tutorials on various aspects of R. It also has some nice deep-learning info. Always seems to come up first when googling an R language reference
- https://stackoverflow.com/questions/2281353/row-names-column-names-in-r discussion on matrix and dataframe row and column names
- https://www.gnu.org/software/apl/ GNU's apl implementation
- https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf A good general (not so statistical) introduction to the language that covers many of these details in greater depth. It's a PDF you should download a copy
 
 
No comments:
Post a Comment