Friday, September 14, 2018

Python Environment (Shell and IDEs)

Python Environment (Shell and IDEs)

Python Environment (Shell and IDEs)

The Python Environment

Python is an interpreter that runs its own environment. At a unix/linux/macos shell prompt if you use the 'python' command with a file name python loads the file, interprets the file (providing output if it has been programmed to) and then returns to the shell. This is how python was used in the introductory blog post "Introduction to python: first steps".

If you don't use any argument (filename or otherwise) and just run the 'python' command, you will be taken into the python shell environment. Now you can still load files into the python shell and execute them in what looks like any other linux shell environement command prompt. To test this functionality out change the hello world script so that it creates a function instead of executing immediately. Call that script hello1.py

#!/Users/tmcguire/anaconda3/bin/python
def hello(str):
  print(str)

Now go into python and import the hello1 script. If you started in the right directory when you called the python command you will not need to supply a pathname. The python shell uses the directory it's started in as the current working directory. Notice when you go into the shell environment python responds with 3 greater than signs: '>>>'. This is python's default prompt and python uses it to indicate it's your turn to enter a command. Now load the script usinge the 'import' commadn and try to execute the function hello:

$ python
Python 3.6.0 |Anaconda 4.3.1 (x86_64)| (default, Dec 23 2016, 13:19:00) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import hello1
>>> hello('Hello World')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'hello' is not defined
>>> hello1.hello('Hello World')
Hello World
>>> 

Notice the error when you try to simply execute the 'hello' function. This is because of how python creates and handles what are called namespaces. Python prepends the name of the function with the file name or library it came from. This is a naming convention so that as modules are added the names of individual functions can be access unambiguously. So in the above example python couldn't find a function called "hello" but it could find the function "hello1.hello".

Using the Python Shell Environment

I can refactor the previous python scripts (from the previous blog post) to take advantage of the python environment. This will get rid of the while loop and the interactive input statement because the function can just be called by name with the number that used to be input as an argument and python will take care of rest. An example script that use to use a while loop was the quick and easy squareroot function:

#!/Users/tmcguire/anaconda3/bin/python
import math

def squareroot(x):
  if x >= 0:
    return math.sqrt(x)
  else:
    print(x,' does not have a real square root')
    return 0
$ python
Python 3.6.0 |Anaconda 4.3.1 (x86_64)| (default, Dec 23 2016, 13:19:00) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqrt1
>>> sqrt1.squareroot(25)
5.0
>>> sqrt1.squareroot(2)
1.4142135623730951
>>> math.sqrt(25)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'math' is not defined
>>> import math
>>> math.sqrt(25)
5.0
>>> sqrt1.sqrt(2)
1.4142135623730951
>>> 

So with the function in place, just calling it at the command prompt with different arguments accomplishes what the while loop was doing. The import statement in the file sqrt1.py while valid for the code in the file does not get set globally. So trying to use the math library function "sqrt" directly fails because the python environment can't find it. It is hidden inside the function "squareroot". But use import on the python command line to bring in the math library and now the 'sqrt' function is available directly.

Is sqrt really hidden or did python just append the name sqrt1 to it? Namespaces are wierd entities and different language interpreters have all sorts of rules behind their use. Not being aware of the complete functionality of namespaces in python I wondered. Did python just prepend a sqrt1 to the math library imported in the file? This means it's really available for use but python changed the name a little. So rather than pour through manuals lets just test out the theory directly in the environment.

$ python
Python 3.6.0 |Anaconda 4.3.1 (x86_64)| (default, Dec 23 2016, 13:19:00) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqrt1
>>> sqrt1.math.sqrt(25)
5.0
>>> math.sqrt(25)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'math' is not defined
>>> 

In the above session, python has been reloaded to clear all the imports. If sqrt1.math.sqrt() is used the function is found by python. However maybe it was left over some how from the previous session. math.sqrt was run just to show that the environment is indeed cleared from before when the math library was imported.

Important Programming Point!

You should read the manuals available on python. But here is an important point about programming. Just because you read the manual doesn't mean you fully understand how python (or any interpreter or computer language) is going to respond. Test it out in a real python environment to make sure your understanding of the 'theory' of how python works is actually the way python works in real life.

Important Debugging Point!

When you come across what look like discrepancies, there is either a bug in the interpreter, an error in your program, an error in the manual, or an error in your judgement.

Important Corollary to the Important Debugging Point!

When in doubt of where the problem lies, start with your judgement and work backwards from there.

More Scripts to Fix and Run

# Program: divisors.py
# This loops through values 2 to number, and finds all divisors of the number
# No computer magic just loop through all values and test them with the
# modulo operator (%) which returns 0 if the number is a divisor
def divisors(x):
  print('The divisors of ',x,' are: ')
  for divisor in range(2,x):
    if (x % divisor)==0:
      print(divisor)

$ python
Python 3.6.0 |Anaconda 4.3.1 (x86_64)| (default, Dec 23 2016, 13:19:00) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import divisors
>>> divisors.divisors(12)
The divisors of  12  are: 
2
3
4
6
>>> divisors.divisors(18)
The divisors of  18  are: 
2
3
6
9
>>> 
# solve quadratic equation ax^2 + bx + c = 0
# accept a,b,c from the user
#
# Program: quadratic.py
#
import math

# Function definitions:
def quadratic(a,b,c):
  if (a==0) and (b==0):
    print('The equation is degenerate')
  else:
    if a == 0:
      print('the only root is ', -c/b)
    else:
       if c == 0:
         print('the roots are ', (-b/a),' and 0')
       else:
         re = -b / (2 * a)
         discriminant =  (b*b) - 4 * a * c
         im = math.sqrt(abs(discriminant))/(2*a)
         if discriminant >= 0:
           print('the roots are ', re+im, ' and ', re - im)
         else:
           print('the roots are complex: ', re,' + j',im,' and ', re, '- j',im)
$ python
Python 3.6.0 |Anaconda 4.3.1 (x86_64)| (default, Dec 23 2016, 13:19:00) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import quadratic1
>>> quadratic1.quadratic(0,0,7)
The equation is degenerate
>>> quadratic1.quadratic(0,10,2)
the only root is  -0.2
>>> quadratic1.quadratic(2,3,0)
the roots are  -1.5  and 0
>>> quadratic1.quadratic(1,5,6)
the roots are  -2.0  and  -3.0
>>> quadratic1.quadratic(1,1,1)
the roots are complex:  -0.5  + j 0.8660254037844386  and  -0.5 - j 0.8660254037844386
>>> 

That's just to show how to use the rest the previous programs will run in the python shell environment

Python Distributions and IDE

There are different python distributions and there are many IDEs (Integrated Development Environments). Distributions usually differ in how many libraries they prepackage with python. Rather than try to catolog them I will just discuss the one I use breifly.

Anaconda3 python

I happen to have the anaconda distribution of python because of it's integration with Jupyter Notebook (which is a web tool that creates notebooks of text and runnable python code). I came across the Anaconda3 distribution of python because Martin Saurer made a Jupyter Notebook implementation for the J programming language. Jupyter Notebook is not quite an IDE but it is a good prototyping platform where you can mix text code and pictures to produce a runnable notebook mixing python with text comments. It's easy to use, not to difficult to setup but it does need to run a local server application so it requires some minimal knowledge of how to set up network programs.

IDEs

There are python plugins that are available for the 2 major Java IDEs: Netbeans and Eclipse. This would be for the Java developer that now needs to use python for access to a popular python library.

If you are using a apple Mac as your programming environment then Apple's Xcode IDE can be set up to edit python programs. A google search seems to imply that like the other IDEs there is a process you have to go through to get XCode to work the way it should with python programs. I don't do a lot of XCode IDE delvelopment so I didn't bother to set it up. In the references I put a 2016 set of directions for getting XCode to use Python 3.5. If you're interested you can probably work through the same steps to get the latest Python version in place.

Anaconda3 provides 2 IDEs in their distribution. One is Microsoft's Visual Studio and the other is Spyder. I haven't use Microsoft VS in a while so I didn't bother with this IDE. The more compelling IDE is Spyder. It's written in Python itself so it has a design philosophy similar to the Java IDEs Eclipse and Netbeans have for Java. Spyder is just 'there' in the Anaconda3 distribution. There is some minor installation issues that will need to be address like upgrading Spyder to it's latest release. But the whole process amounts to:

  • Download the latest distribution of anaconda
  • Run anaconda-navigator and upgrade any highlighted packages (the version number will have an arrow in front of it and the arrow and version number will be colored blue)
  • type spyder at the command line to bring up the IDE

Spyder does your typical IDE sort of things like grouping programs into a project, debugging facilities, and of course a code editor that highlights the features of python code for easy reading. Now the nice thing about spyder as part of the anaconda distribution is that it give you easy integration into the data sciences libraries. Things like numpy the python mathematical extension library.

For now lets run one of the previous sample programs in Spyder:

  • Bring up a terminal window with a bash shell (or your favorite shell environment)
  • type 'spyder' at the command line

if you have installed Anaconda3 and updated Spyder your screen should look similar to:


Now use the file menu and open the program that calculates the square root


Click the green arrow button at the top list of controls to run the program. If all goes well you should see the correct output in the lower right hand corner of the IDE

Conclusion

There are many ways to develop in Python. If you prefer a command line environment then the vi editor or emacs are fine to edit code. Emacs will have a language module so you get the nice highlighting of an IDE. There are the legacy IDEs Netbeans, Eclipse, Visual Studio, XCode you can use. These may be preferred if your development spans multiple languages. But if your just looking for something to do python in that is easy to install and use Spyder. By the way I don't have a screen shot showing you but that console for output is a full ipython shell which means you can type python statements directly in at the 'In' prompt and ipython will execute them. So you can try things in real time and then move them easily back to your code in the IDE.

References

  1. python anaconda distribution: https://www.anaconda.com/
  2. spyder ide: https://www.spyder-ide.org/
  3. Jupyter Notebook: https://jupyter.org/
  4. numpy package for scientific computing: http://www.numpy.org/
  5. XCode integration by Erica Sadun: https://ericasadun.com/2016/12/04/running-python-in-xcode-step-by-step/
  6. Martin Saurer J Jupyter notebook page at J Software site: https://code.jsoftware.com/wiki/Guides/Jupyter

Author: Nasty Old Dog

Validate

Tuesday, August 21, 2018

Python Intro: First Steps

Python First Steps

Python First Steps

Keywords

Every programming language has a set of special predefined words that can only be used for statements in the programming language. In python they are:

and del from not while
as elif global or with
assert else if pass yield
break except import print False
class exec in raise None
continue finally is return True
def for lambda try  

This and following posts will deal with an important subset of these keywords. I will cover enough so the reader will be able to learn the rest on their own after they are comfortable writing simple programs.

Hello World (first Keyword: print)

Python has a large following and more importantly seems to be a language of choice for Artificial Intellegence and Big Data applications. Python is a scripting language and interpreter. I like to start by using thing in a unix shell environment.

save the following to hello.py:

#!/path-to-python/bin/python

print('Hello World!')

When the program is run using the python command:

$ python hello.py
Hello World!
$

Not very much but it's a start. Python is considered a functional language. This means that most of the work is done in functional blocks. There is a whole lot of theory behind why using a functional language is an important programming paridigm, but that is beyond the scope of this introduction. It will "flavor" how programs are structured. But you shouldn't need very much experience with the theory of functional programming to become a useful python programmer.

Function syntax

A more useful program would be one to find the squareroot of a number. So lets create our own function to do just that. I have transcribed part of the python grammar into a pseudogrammar to help show how functions are structured in python.

function_definition = 'def' <function_name> '('parameter_list')' ':' single_stm |
                      'def' <function_name> '('parameter_list')' ':' 
                          indented_statements

The first thing a function definition must have is the word 'def' to start. This is followed by a function name, a list of parameters we expect to be passed in. Finally it is completed by any number of indented statements. To simplify this lets use the math library function call as the guts of our function definition (later in this article I provide a way to actually calculate the square root). I use the comment symbol # to provide in-program explanations of new python commands and functions

#!/path-to-python/bin/python

# For now cheat and use the python math library to make the function work
# Python uses an import state to bring in libraries of functions
import math

# define a new function called squareroot taking 1 parameter x
def squareroot(x):
  return math.sqrt(x)

# The above function definition the return statement goes with the function
# due to the indentation level. If the return statement were even with the def

# the input statement displays a prompt and accepts user text.
# the text needs to be converted to a number, python provides a float() function
# and an int() function to convert text to numbers
X = float(input('Enter value: '))
print(squareroot(X))
$ python sqrt.py
Enter value: 25
5.0
$

Keywords covered so far

The keywords covered just in the preceding program are highlighted in red:

and del from not while
as elif global or with
assert else if pass yield
break except import print False
class exec in raise None
continue finally is return True
def for lambda try  

If - Else statment

Now there is a problem with this simple program. If a negative number is entered the program fails with error messages:

$ python sqrt.py
Enter value: -25
Traceback (most recent call last):
  File "sqrt.py", line 8, in <module>
    print(squareroot(X)) 
  File "sqrt.py", line 5, in squareroot
    return math.sqrt(x)
ValueError: math domain error

The math.sqrt function doesn't calculate negative square roots it is undefined as a negative root would generate an imaginary number. You can catch this problem in the program using an if-statement.

if statment syntax

The grammar text below tries to show how the actual program statement might be structured. It is verbose on purpose.

if_stm = ('if' test_expression ':' single_stm |           # | line symbol means or (one or the other but not both)
         'if' test_expression ':' 
             indented_statments)                          # parenthesis are used to group a section of the grammar to be done first
         ['else' ':' (single_stm | indented_statements)]  # square brackets means optional or 0 or 1 of the statements in the brackets

Lets shorten the grammar syntax up to the version with the else statement. Understand that in the if portion I am trying to give you a sense of how the statements will look structurally with indentation and newlines. Once you compare the syntax and the actual statments I think you will see that the following is an equivalent syntax definition that saves on space:

if_stm = 'if' test_expression ':' (single_stm | indented_statments) # if statment start
         ['else' ':' (single_stm | indented_statements)]            # optional else clause

In the above syntax statements you must interpret yourself that the indented statements if used would would follow underneath the 'if' keyword and testexpression part of the statement Again square brakets around a syntax specification mean that it is optional. You may place a clause there or you may leave it out. This is 0 or 1 statements.

import math

def squareroot(x):
  if x >= 0:
    return math.sqrt(x)
  else:
    print(x,' does not have a real square root')
    return 0

X = float(input('Enter value: '))
print(squareroot(X))
$ python sqrt.py
Enter value: -25
-25.0  does not have a real square root
0
$

Compound If statement

Some of the most difficult to read programs are ones with complex compound if - then - else statements. In python this is an if - else statement, the then is implied. In this case we will solve the quadratic equation

\[ax^2 + bx + c = 0\] which you should remember is solved by the following formula: \[x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}\]

# solve quadratic equation ax^2 + bx + c = 0
# accept a,b,c from the user
#
# Program: quadratic.py
#
import math

# Function definitions:
def quadratic(a,b,c):
  if (a==0) and (b==0):
    print('The equation is degenerate')
  else:
    if a == 0:
      print('the only root is ', -c/b)
    else:
       if c == 0:
         print('the roots are ', (-b/a),' and 0')
       else:
         re = -b / (2 * a)
         discriminant =  (b*b) - 4 * a * c
         im = math.sqrt(abs(discriminant))/(2*a)
         if discriminant >= 0:
           print('the roots are ', re+im, ' and ', re - im)
         else:
           print('the roots are complex: ', re,' + j',im,' and ', re, '- j',im)

# Begin Main Program:
a = float(input('a: '))
b = float(input('b: '))
c = float(input('c: '))
quadratic(a,b,c)
$ python quadratic.py
a: 0
b: 0
c: 7
The equation is degenerate
$ python quadratic.py
a: 0
b: 10
c: 2
the only root is  -0.2
$ python quadratic.py
a: 2
b: 3
c: 0
the roots are  -1.5  and 0
$ python quadratic.py
a: 1
b: 5
c: 6
the roots are  -2.0  and  -3.0
$ python quadratic.py
a: 1  
b: 1
c: 1
the roots are complex:  -0.5  + j 0.8660254037844386  and  -0.5 - j 0.8660254037844386
$

elsif statement

A companion keyword "elif" forms an else-if clause and helps flatten a compound if statement.

grammar syntax

if_stm = 'if' test_expression ':' (single_stm | indented_statments)         # must have this
         {'elif' test_expression ':' (single_stm | indented_statements)}    # curly braces mean 0 or more
         ['else' ':' (single_stm | indented_statements)]                    # square brackets optional part of statement

Here is the quadratic program now rewritten with the "elif" keyword (notice how it helps keep the indentation under control):

# solve quadratic equation ax^2 + bx + c = 0
# accept a,b,c from the user
#
# Program: quadratic.py
#
import math

# Function definitions:
def quadratic(a,b,c):
  if (a==0) and (b==0):
      print('The equation is degenerate')
  elif a == 0:
      print('the only root is ', -c/b)
  elif c == 0:
      print('the roots are ', (-b/a),' and 0')
  else:
      re = -b / (2 * a)
      discriminant =  (b*b) - 4 * a * c
      im = math.sqrt(abs(discriminant))/(2*a)
      if discriminant >= 0:
          print('the roots are ', re+im, ' and ', re - im)
      else:
          print('the roots are complex: ', re,' + j',im,' and ', re, '- j',im)

# Begin Main Program:
a = float(input('a: '))
b = float(input('b: '))
c = float(input('c: '))
quadratic(a,b,c)

Keywords covered so far

Here is the score card of keywords, the new ones are red with previous keywords covered only in Bold:

and del from not while
as elif global or with
assert else if pass yield
break except import print False
class exec in raise None
continue finally is return True
def for lambda try  

Loops: While loop, For loop

Now we will introduce the usage of for and while loops

for grammar syntax

for_stm = 'for' expression 'in' list ':' (single_stm | indented_statments)  # must have this
         ['else' ':' (single_stm | indented_statements)]                    # square brackets optional usage not shown below

while grammar syntax

while_stm = 'while' test_expression ':' (single_stm | indented_statments)   # must have this
         ['else' ':' (single_stm | indented_statements)]                    # square brackets optional usage not shown below

In the example below the for loop in the function divisors uses a python function called 'range'. Range returns a list of all integers beginning with the first argument and ending with the number just before the last argument. Now in Python3 range is an object. The object has been especially designed to work in a 'for' loop. These type of objects are called an Iterator. Iterators will give the 'for' loop the next item in the sequence of the underlying object. For now it's best to think of 'range' as returning a list. To see the underlying list that 'range' represents here is an example:

>>> list(range(2,10))
[2, 3, 4, 5, 6, 7, 8, 9]

Python makes use of lists as a built-in way to store data. Even though it's convenient to think of 'range' as a simple list, 'range' doesn't store all the values in memory. For example what if we wanted to loop 10,000,000 times. Why waste the memory with a list of 10,000,000 values when 'range' can calculate the next value when needed? You wouldn't, so range is set up to mimic a list in 'for' loops without pre-storing the entire list in memory.

# Program: divisor.py
# This loops through values 2, up to but not including the number, 
# and finds all divisors of the number
# No computer magic just loop through all values and test them with the 
# modulo operator (%) which returns 0 if the number is a divisor
def divisors(x):
  for divisor in range(2,x):  # this would work with list(range(2,x)) as well
    if (x % divisor)==0:
      print(divisor)

# Begin python program:
while True:          # begin repeat-until loop using while loop statement
  x = int(input('Input Integer: '))
  if x > 0:
    print('The divisors of ',x,' are: ')
    divisors(x)

  # break out of loop if the input is 0 or less
  if x <= 0:    # i.e. until x <= 0 
    break
  # otherwise return to the top of the while loop

Here is a quick session with the new program

$ python divisor.py
Input Integer: 12
The divisors of  12  are: 
2
3
4
6
Input Integer: 18
The divisors of  18  are: 
2
3
6
9
Input Integer: -1
$

Keywords covered so far

and del from not while
as elif global or with
assert else if pass yield
break except import print False
class exec in raise None
continue finally is return True
def for lambda try  

Conclusion

I selected a few short program examples from Peter Grogono's Programming in Pascal book and transcribed them into python. As you can see a significant portion of the keywords of python have now been covered. In the next installment I quickly show how to use the Python shell environment. It takes some small rewrites and all of the programs so far will run in the python shell command line.

References

  1. Grogono, Peter. Programming in Pascal. Adison-Wesley, 1984.
  2. “Welcome to Python.org.” The Python Tutorial, Python Software Foundation, 19 Aug. 2018, www.python.org/.

Author: Nasty Old Dog

Validate

Wednesday, February 21, 2018

R Basic Vector/Matrix Stuff (for the Statistically Inclined but Computer Programming Challenged)

R Basic Vector/Matrix Stuff (for the Statistically Inclined but Computer Programming Challenged)

R Basic Vector/Matrix Stuff (for the Statistically Inclined but Computer Programming Challenged)

Introduction

After some feedback on my previous R blog I have found that a 'Newbie' R/Statistics person needs to have a better foundation in the Vector arithmetic and representation that is the foundation of R. I thought the cursory look provided in my previous blog would suffice. I realize now that R provides multiple ways of accessing Vectors and Matrices (esp. Matrices) that hide the "Vectorness" that is inherent in the language. There are many thing in R that older programmers have already had experience with. The original vector language developed by IBM was known as APL. Dr. Ken Iverson developed a specialized math syntax while at Harvard. IBM hired him to implement that syntax into a computer programming language (Original concepts detailed in reference [2]). This all happened in the 1960s. For those that learned Computer Science in the 60s and 70s they would have had exposure to this language. It has continued on and there is even a free GNU version available today[6]. The problem for many people was the strange symbols that were the basis of the language. Since APL there have been many offshoots that have carried forward this idea of 'Vectors' being the built in data structure of the language but with a design change that uses standard characters found on your standard keyboard for syntax. The language K is probably the most successful commercial implementation of this offshoot[3]. R is probably the most successful open source implementation of these concepts. My personal favorite is the J language which the late Dr. Iverson developed as a redesign of his APL concepts. J has an active user forum and a great collection of articles on their website on the history of APL, Dr. Iverson and many technical articles showing various uses of J in many different areas(see reference [1]).

This history that many Professors and teachers experienced first hand make it difficult for them to explain. It is very easy to assume that something is a simple concept because you forget that you didn't learn it in R. You learned it in some other computer language, programming different types of things. Jumping into R was not that difficult and you appreciate how R has transformed some of the menial tasks into simple function calls. For the 'Newbie' they are left with many WTF moments as things seem to happen by magic. The goal of this blog post is to show you how the basic vector concepts are in everything that you do. This will help you as you try to dissect your data stored in a table. R has many layers on that data that help facilitate creating charts and statistics, but in the end it is all just vectors and matrices (aka arrays and tables).

Vectors/Arrays

A vector has its roots in physics. The idea behind it is that many physical properties are described by a value and a direction. I may push something along at 25 miles per hour but that is only part of the story. I am also pushing it along in a certain direction. Once I come up with a way of telling direction I now must carry 2 values along to let you know exactly what I am doing. So the concept of a vector is a way of carrying around multiple values to describe a single concept. In math and in computers it's not hard to envision that we might want to carry around more than just 2 values. Why not 3? There are after all 3 dimensions. Why not 10? Why not 1000? Hence for our purposes a vector is a way of carrying around multiple pieces of information and referencing them by a single name and an index. Mathematics uses a subscript to identify a particular item in a vector:

\[x = {2,4,6,8}\] \[x_1 = 2 \] \[x_4 = 8 \]

In R access to individual vector elements is accomplished as follows:

> x = c(2,4,6,8) #combine 2 4 6 8 into a vector and store it in x
> x
[1] 2 4 6 8
> # since subscripting is a pain in the neck R uses square brackets
> x[2]
[1] 4
> x[1]
[1] 2
> x[4]
[1] 8
> 

Seems easy enough. In math rather than write out every element of a vector we can use an ellipsis to continue an established pattern. So for example to represent the numbers from 1 to 100 in a vector in Math we do the following:

\[x = {1,2,3,4,\ldots,99,100} \] \[x_3 = 3 \] \[x_{98} = 98 \]

R rotates the ellipsis and uses the ':' (the colon) to implement similar functionality:

> x = c(1:100)
> x
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21
 [22]  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42
 [43]  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63
 [64]  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84
 [85]  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
> x[3]
[1] 3
> x[98]
[1] 98
> 

Now here is where R can be deceiving. The colon operator is like the ellipsis but not exactly alike. The colon is only good for generating an increment by one pattern. So for example in math

\[ x = {2,4,6,\ldots,20,22} \]

You instinctively understand I mean to count by 2's up to 22. Trying this in R with the colon operator just increments by 1s from 6 to 20:

> x = c(2,4,6:20,22)
> x
 [1]  2  4  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 22
> # from 6 to 20 R counts by 1s it doesn't try to infer my pattern 

Now that doesn't mean I have to enter in every value for R if I want to count by 2's. But it does mean I have to be more arithmetically distinct in what I tell R to do. Counting by 2's is just counting by 1's up to half the maximum value and multiplying the result by 2. So to accomplish the same thing in R:

> x = 2 * c(1:11)
> x
 [1]  2  4  6  8 10 12 14 16 18 20 22
>

R does do one bit of inference with this operator:

> # one thing R will infer is that if you reverse the order and put the larger number first
> # R will count backwards for you
> x = c(11:1)
> x
 [1] 11 10  9  8  7  6  5  4  3  2  1
> 

But if R didn't do this, it would be easy to reconstruct with some added R functionality: the reverse function 'rev'. This function gives the reverse order of a vector

> # Create a reverse order without switching
> x = c(1:11)
> x
 [1]  1  2  3  4  5  6  7  8  9 10 11
> rev(x)
 [1] 11 10  9  8  7  6  5  4  3  2  1
> # in one line
> x = rev(c(1:11))
> x
 [1] 11 10  9  8  7  6  5  4  3  2  1
> 

I hope at this point you can extrapolate and realize that by investigating the functions available in R we can create our own vectors of data without having to resort to reading it in from a file. This comes in handy for putting together some simple testing data.

Matrix/Matrices

A Matrix wasn't originally a computer driven reality to enslave people to provide power to machines. It is just a mathematical concept for a table of values. It is an extension of the concept of a vector. While a vector has multiple values it is considered a one-dimensional object. This means I only need one index to obtain a value. If I took a set of vectors of the same length and piled them on top of each other I would create a table or Matrix. In mathmatics notation you just put a table of numbers in parenthesis:

\[ M = \begin{pmatrix} 1 & 2 & 3 & 4 & 5 \\ 11 & 12 & 13 & 14 & 15 \\ 21 & 22 & 23 & 24 & 25 \end{pmatrix} \]

\[ M_{1,2} = 2 \] \[ M_{3,3} = 23 \]

Matrices can be created directly in R. But first a little segue to go from vectors to matrices In R start by creating 3 vectors of 5 elements each. Vector1 = {1,2,3,4,5}, Vector2={11,12,13,14,15} and Vector3={21,22,23,24,25}. To save typing call them V1, V2, and V3. Here is the R session to set that up.

> # 3 Vectors of length 5 (notice I use a little math to help create different values)
> V1 = c(1:5)
> V2 = 10+V1
> V3 = 20+V1
> V1
[1] 1 2 3 4 5
> V2
[1] 11 12 13 14 15
> V3
[1] 21 22 23 24 25
> # notice that R added a number to the whole vector V1
>

Even though I had to type each variable to display the data, notice the natural tabular form that appears when looking at the last 3 lines of numbers above. They look like 3 rows of a table. If I wanted the second element of the first row, the 4th element of the second row and the 1st element of the third row. I could access them all as follows (continuing with the vectors I have set up):

> V1[2]
[1] 2
> V2[4]
[1] 14
> V3[1]
[1] 21
> 

I named the vectors with numbers purposefully. If I could form a table and R could extend it's access to account for rows and columns (which it does) I could use one variable name and access any element by just giving the row and column number of that element. V1[2] would be M[1,2] in a table constucted of these vectors and stored in M. Similarly V2[4] -> M[2,4] and V3[1] -> M[3,1] Not only do I save typing but I can also create loops that would be able to go through every member in the matrix in almost any conceivable order I can imagine making looping programs do.

Experimenting with R and its matrix creation function I was able to use the vectors to create a table with each vector above as one row. I did have to use the matrix transpose function 't' (initially). Transpose will flip the matrix by swapping rows for columns (look up matrix transpose if you don't quite understand what it's doing from the session below). In the end I figured out the proper parameters for the matrix function to pile the vectors on top of each other (in row fashion) in one fell swoop.

> # Use matrix function to create a matrix from V1, V2, and Ve
> M = matrix(c(V1,V2,V3),nrow=3,ncol=5)
> M
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    4   12   15   23
[2,]    2    5   13   21   24
[3,]    3   11   14   22   25
> # matrix fills columns first not rows what to do?
> t(M)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5   11
[3,]   12   13   14
[4,]   15   21   22
[5,]   23   24   25
> # Lets flip the dimensions around and see what happens
> M = matrix(c(V1,V2,V3),nrow=5,ncol=3)
> M
     [,1] [,2] [,3]
[1,]    1   11   21
[2,]    2   12   22
[3,]    3   13   23
[4,]    4   14   24
[5,]    5   15   25
> # since matrix fills columns first lets fill a vector per column by switching dimensions
> # like above. Now transpose should get us the form we were looking for which is a 
> # vector per row
> t(M)
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]   11   12   13   14   15
[3,]   21   22   23   24   25
> # so lets put it all into one line to make a matrix of our three vectors with each
> # vector in its own row
> M = t(matrix(c(V1,V2,V3),nrow=5,ncol=3))
> M
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]   11   12   13   14   15
[3,]   21   22   23   24   25
> # Now M[1,2] should match V1[2], M[2,4] = V2[4] and M[3,1] = V3[1]
> M[1,2]
[1] 2
> V1[2]
[1] 2
> M[2,4]
[1] 14
> V2[4]
[1] 14
> M[3,1]
[1] 21
> V3[1]
[1] 21
> # Had I dug a little deeper into the matrix function there is a flag to fill by called 'byrow'
M = matrix(c(V1,V2,V3),nrow=3,ncol=5,byrow=TRUE)
> M
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]   11   12   13   14   15
[3,]   21   22   23   24   25
> # got the matrix in 1 step

The above session has an important nuance. I assumed that R would think the way I do: Put vectors into rows. But as the session unfolded it was clear that R is column oriented by default. I was able to adjust once I saw the way R was doing things. This is important! As you begin to think in terms of vector and matrix operations you may find your answer coming from R is not formatted properly or the data doesn't seem to have the right appearance. When you see wierd things happening you must break down your operations and make sure you and R are on the same page (more so you since R is not going to change). When in doubt go to one operation per line, display the results of each operation (or a portion thereof if you have a considerable amount of data). Verify that each operation you are performing is what you expect. You would be surprised how one small typographical error can cause you hours of debugging and anxiety. Your mind will overlook the small error because it will fill in a missing operation as you are looking at it (or ignore it if there is an extra operation). By breaking it down you are verifying to yourself that each operation works as intended.

Row and Column names

I use term 'table' above rather loosely above. Don't confuse this with any add-on packages that have tables. I mean it in the simplest sense as a way of describing 2 dimensional data. R has another table type structure called a 'data frame'. So what's the difference between a matrix (which I have shown as a 'table' of numbers) and an R data frame? In an R data frame you can have a mix of data types between columns. Each individual column needs to have data of the same type but the next column can have a completely different datatype (as long as it's consistent within that column). So in a matrix all the data must be the same across all rows and columns and in a data frame there can be some mixing of data types on a column by column basis.

Now you access data in a 'data frame' by indexing the same way as you do with a matrix. The trick is not to do any operation on that data that is inconsitent with the datatype of the column. So in a matrix (since all the data is the same type) I can add together any 2 selected elements (if the data is of numeric type).

> # Create a vector of 25 elements from 1 to 25
> v <- 1:25
> v
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
> # Use vector v to create a matrix that is 5x5 of those elements
> m <- matrix (v,5)
> m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25
> # Add m[2,3] and m[3,2] together
> m[2,3]
[1] 12
> m[3,2]
[1] 8
> m[2,3]+m[3,2]
[1] 20
>

Nothing surprising. I make a matrix of integer values and I can add them together any way I please.

What about naming columns and rows? Here it turns out there are multiple ways of naming columns and rows depending if the underlying data structure is a matrix or 'data frame'. The following calls work the same across all of those structures. A 'data frame' has a built in $ operator it is used to access a whole column of data in a 'data frame' by name. I include its use the session below:

> # Give names to the columns and rows
> m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    6   11   16   21
[2,]    2    7   12   17   22
[3,]    3    8   13   18   23
[4,]    4    9   14   19   24
[5,]    5   10   15   20   25
> colnames(m) <- c("C1","C2","C3","C4","c5")
> m
     C1 C2 C3 C4 c5
[1,]  1  6 11 16 21
[2,]  2  7 12 17 22
[3,]  3  8 13 18 23
[4,]  4  9 14 19 24
[5,]  5 10 15 20 25
> # Now the rows
> rownames(m) <- c("r1","R2","r3","R4","r5")
> m
   C1 C2 C3 C4 c5
r1  1  6 11 16 21
R2  2  7 12 17 22
r3  3  8 13 18 23
R4  4  9 14 19 24
r5  5 10 15 20 25
> # We can still access with number indexes as before
> m[2,3]
[1] 12
> # But now we can use names as indexes instead
> m ["R2","C3"]
[1] 12
> # Is this where we can start using the $ in the variable name?
> m$C2
Error in m$C2 : $ operator is invalid for atomic vectors
> # No we can't use that type of access for a matrix
> # Turn m into a dataframe d and see what we can do
> d <- as.data.frame(m)
> d
   C1 C2 C3 C4 c5
r1  1  6 11 16 21
R2  2  7 12 17 22
r3  3  8 13 18 23
R4  4  9 14 19 24
r5  5 10 15 20 25
> # It doesn't look that much different but here are the different ways
> # to access data.
> d[2,3]
[1] 12
> d["R2","C3"]
[1] 12
> d["R2",]$C3
[1] 12
> d$C3
[1] 11 12 13 14 15
> d[2,]
   C1 C2 C3 C4 c5
R2  2  7 12 17 22
> d["R2",]
   C1 C2 C3 C4 c5
R2  2  7 12 17 22
> 

Data Frames

The data frame's strength comes from being able to handle tabular data of different data types. The following session creates a data frame with a mix of data types and shows how you have to be careful what operations you choose to do. By supplying column names in the creation of the 'data frame' there is no need to perform a separte operation to insert them into the 'data frame'.

> d2 <- data.frame(C1=c(1:5),C2=c("a","b","c","d","e"),C3=c("john","joesph","james","jane","janet"))
> d2
  C1 C2     C3
1  1  a   john
2  2  b joesph
3  3  c  james
4  4  d   jane
5  5  e  janet
> d2[1,1]+d2[3,1]
[1] 4
> d2[1,1]+d2[1,2]
[1] NA
Warning message:
In Ops.factor(d2[1, 1], d2[1, 2]) : ‘+’ not meaningful for factors
> # We can do some comparisons on the character data
> "a" == d2[2,2]
[1] FALSE
> "a" == d2[1,2]
[1] TRUE
> "james" == d2[3,2]
[1] FALSE
> "james" == d2[3,3]
[1] TRUE
> d2[1,]
  C1 C2   C3
1  1  a john
> d2$C2
[1] a b c d e
Levels: a b c d e
> 

The other strength of a 'data frame' is that it can be used seamlessly with functions that read in comma separated values. This allows you to pull in data sets from databases or websites and operate on them easily. Since comma separated value files usually include a first line of column names, the 'data frame' will already have column names inside after a read operation.

Conclusion

These topics are covered in more depth in the pdf text "An Introduction to R" [7]. Hopefully this blog has provided some insight into the workings of R and vector languages in general. The purpose here was to give just enough vector stuff to get you through debugging a statistics assignment when things go wrong. Usually the data is structured in a manner that's different from how your mind is perceiving it. This causes you to make improper function calls. I can't say this enough when in doubt break things down! Try functions on smaller pieces of data and make sure you get an answer you expect. Once things are operating the way you expect you can extrapolate up to larger datasets.

References

  1. http://www.jsoftware.com/ great vector based language. Excellent forum to search various subjects. There is an R interface to the J language so you can work in J and use R when you need something statistical that J doesn't have. Search the website for Ken Iverson they have some execellent essays on the beginnings of APL and vector languages
  2. Iverson, Kenneth E. “A Programming Language.” A Programming Language, J Software Inc., 13 Oct. 2009, www.jsoftware.com/papers/APL.htm.
  3. https://kx.com/ The company that produces the K-language and Kdb (a database based on the K-language)
  4. http://www.r-tutor.com/ offers nice tutorials on various aspects of R. It also has some nice deep-learning info. Always seems to come up first when googling an R language reference
  5. https://stackoverflow.com/questions/2281353/row-names-column-names-in-r discussion on matrix and dataframe row and column names
  6. https://www.gnu.org/software/apl/ GNU's apl implementation
  7. https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf A good general (not so statistical) introduction to the language that covers many of these details in greater depth. It's a PDF you should download a copy

Author: NASTY OLD DOG

Validate