Data types

R like a simple calculator, try this arithmetic operators:

> 1
> 10 + 10
> 10 - 20
> 10 / 3
> 5 * 3
> 5^3
> 5 %/% 3 #integer division
> 5 %% 3 #modulo

Well, these operations are too trivial. Lets compute square root of 25 using a R’s base function sqrt. In R, functions expect some input(s) and return resuls to stdout (standard output) or assign a value to a variable. Here, functions are write in a prefix form and input is in brackets, i.e. myfunction(myinput). Concretenaly, for square root of 25 type following:

> sqrt(25)

For more details, look at R documentation

> ?sqrt #or help(sqrt)

In previous example, we work with numeric data but R provides further data types as a characted or logical. Character data are enclused in quotes (“”), logical data have a value TRUE or FALSE.

> "my first string!"
> FALSE
> TRUE

However, R is a case-sensitive meaning that a TRUE is not the same as a true and this is the same variable name, function name, … with everything.

> TRUE
> true
> sqrt(25)
> SQRT(25)

Now, lets try to apply arithmetic operators to LOGICAL data types. What do you expect?

> TRUE + TRUE
> FALSE - TRUE
> FALSE / TRUE

In computers, logical values TRUE and FALSE are represented as a 1 and 0, respectively. For concatenating of strings, we can use paste function.

> paste("my", "first", "R example")

By the way, if you dont know which function you can use, use a help.search function and look at the Base package.

> help.search("concatenate")

Instead of arithmetic operators, we should take into account logical and relational operators that will be discuss after the lunch break. But important thing is to store your results. For this R provides an assignment operator ‘<-’.

> x <- 1+1
> x
> 
> y<- "my_string"
> y
> 
> z <- paste("1+1", x)

Data structures

Now, we know how to manipulate with numerical, character, and logical data types and how to assign values to variables. However, real life problems are much complicated and we need more than a simple calculater, especially in 21 century. For this reason, we have to introduce objects for holding more complex data, including vectors, matrices, arrays, data frames, and lists. They differ in terms of the type of data they can hold, how they are created, their structural complexity, and the notation used to identify and access individual elements. Great overview is in figure bellow (from R in Action: Data Analysis and Graphics with R).

R data structures

R data structures

Vector

The basic data structure in R is the vector. The vector is 1d structure which is homogeneous (all elements must be of the same type) and is determined by its length and type (logical, numeric, and character). Note, scalars are one-element vectors.

Vectors are usually created with c(), short for combine:

> myvector <- c(1,2,3,4,5,6,20,150)
> myvectorl <- c(TRUE, TRUE, FALSE)
> myvectors <- c("these are", "some string")

Note, vectors are always flat:

> c(1,c(2, c(3,4)))
> #the same as
> c(1,2,3,4)

For faster sequence creation, we can use seq():

> seq(1,10)
> seq(1,10, by = 2)
> seq(1,10, length.out = 100)
> #also use the colon
> 1:10

For sequences containing repeating values, use rep():

> rep(0, 10)
> rep(c(1,3), each=5)
> rep(c(1,3), times=5)

Example: Using an arbitrary function generate all odd number from interval 1-100. seq(1,100, by=2)

However, as we mentioned above, all elements of an vector must be the same type, so when you attempt to combine different types they will be coerced to the most flexible type. Which of data types (numeric, character, and logical) is most flexible? Which one is the less flexible? Try it! (character > numeric > logical)

> x <- c(TRUE, FALSE, TRUE)
> typeof(x)
> x <- c(TRUE, FALSE, TRUE, 123.12)
> typeof(x)

Now, it’s the time for indexing of vectors! Suppose we have a vector and we would like to get only a subset of vector elements. For acessing an element in vector, R uses square brackets “[]” where a number represents a position in vector. Note that the first elements in a vector is indexed by 1!

> x <- seq(1,50)
> x[1] #first element
> x[length(x)] #last element
> x[c(1,5,10,25,36)]
> #more examples, later!

Matrix

In comparison to a vector, matrix is a 2d array that is also homogeneous. The general format is:

myymatrix <- matrix(vector, nrow=number_of_rows, ncol=number_of_columns, byrow=logical_value, dimnames=list(char_vector_rownames, char_vector_colnames))

Oftentimes, an example is better than the thousands of words. So, lets make our first matrix.

myymatrix <- matrix(seq(1:9), nrow=3, ncol=3, byrow=TRUE, dimnames=list(c("x1", "x2", "x3"), c("y1", "y2", "y3")))
myymatrix

Over metrices, new operators are defined, such as element-wise multiplication, matrix multiplication, outher product, transposition of matrix, etc.

A <- matrix(seq(1,9), nrow=3, ncol=3, byrow=TRUE, dimnames=list(c("x1", "x2", "x3"), c("y1", "y2", "y3")))
B <- matrix(seq(9,1), nrow=3, ncol=3, byrow=TRUE, dimnames=list(c("x1", "x2", "x3"), c("y1", "y2", "y3")))

A+B
A-B
A*B #element-wise multiplication
A %*% B #matrix multiplication
A %o% B #outer product
t(A) #transposition

In conctrary to a vector, matrix is 2d array so the indexing have to be different. You can identify rows, columns, or elements of a matrix by using subscripts and bracket. A[i,] refers to the i-th row of matrix A, A[,j] refers to j-th column, and A[i, j] refers to the ij-th element, respectively. The subscripts i and j can be numeric vectors in order to select multiple rows or columns, as shown in the following example.

A <- matrix(seq(1,9), nrow=3, ncol=3, byrow=TRUE, dimnames=list(c("x1", "x2", "x3"), c("y1", "y2", "y3")))

A[2,3]
A[2,]
A[,3]
A[c(1,2),]
A[c(1,2), c(2,3)]

Array

Arrays are very similar to matrices, thare are also a homogeneous. However, an array is a general case of matrix allowing to define n-dimension instead of 2. The array function has the following form:

myarray <- array(vector, dimensions, dimnames)

Lets have a look at example.

myarray <- array(seq(1,24), c(4,3,2), dimnames = list(c("x1", "x2", "x3", "x4"), c("y1", "y2", "y3"), c("z1", "z2")))
myarray

Indexing is similar to a matrix.

myarray <- array(seq(1,24), c(4,3,2), dimnames = list(c("x1", "x2", "x3", "x4"), c("y1", "y2", "y3"), c("z1", "z2")))

myarray[1,1,1]
myarray[1,1,]

List

First representant of a heterogeneous structure, where the contents can be of different types, is a list. In other words, a list allows you to gather a varienty of objects under one name. For example, a one list can contain several vectors, matrices, data frames, and list. It is the most complex structure in R. A list function uses the following form:

mylist <- list(object1, object2, ...)

Now, lets make an example of list of vectors.

mylist <- list(one = rep(1,5), two = rep(2,100), three= rep(3,20))
mylist

We create the list consists of three vectors that are names as ‘one’, ‘two’, and ‘three’. For getting or changing the names of elements, use ‘names’ function in the following way:

mylist <- list(one = rep(1,5), two = rep(2,100), three= rep(3,20))
names(mylist) #print all names
names(mylist) <- c("myOne", "myTwo", "myThree")
mylist

Access to an elements is provided via a dollar sign ($):

mylist <- list(one = rep(1,5), two = rep(2,100), three= rep(3,20))
mylist$one
mylist$two

Or indicating an element number or a name within double brackets.

mylist <- list(one = rep(1,5), two = rep(2,100), three= rep(3,20))
mylist[[1]]
mylist[["one"]]

Data frame

Finally, the second representant of a heterogeneous structure is a data frame, the most common way of storing data in R. A data frame is a 2d dataset that contains equal-length vectors. This structure shares propertioes of the matrix and the list. Each column can be considered as an element of the list. Consequently, each column can be of diffent data type (numeric, string, and logical) but, of course, elements in the column must be the same data type. A data frame function has the following form:

mydata <- data.frame(col1, col2, col3,...)

Now, lets make our first data frame.

mycol1 <- c(TRUE, FALSE, TRUE)
mycol2 <- c(1,57,698)
mycol3 <- c("first", "second", "third")
mydata <- data.frame(col1 = mycol1, col2 = mycol2, col3 = mycol3)
mydata

As we mentioned above, the data frame is similar to the list because columns are considered as a list element. So, acces to columns is privided via the dollar sign ($) or using double brackets.

mydata$col1
mydata[["col1"]]

On the other hand, the data frame is also close to the matrix.

mydata[2,]
mydata[2,1]
mydata[,1]

Names for each column, and row can be provided with the colnames() function and with the rownames() function, respectively. Note, that names have to be unique across dimensions.

colnames(mydata)
rownames(mydata)

colnames(mydata) <- c("mycol1", "mycol2", "mycol3")
rownames(mydata) <- c("myrow1", "myrow2", "myrow3")
mydata

Now, lets have a look at the structure of the data frame. Use str() function. What do you expect that you get?

str(mydata)

As you have seen, ‘mycol1’ is the logical, ‘mycol2’ is the numeric; however ‘mycol3’ is a factor.

Factor as a data type

A factor is a vector that can contain only predefined (categorical) values. This type represents a nominal or ordinal variable. (man, woman) is an example of nominal variable where man is coded as a 1 and woman is coded as a 2 in the data, also no order is implied. An example of ordinal variable can be (small, normal, big) where the order is evident, i.e. big > normal > small. Ordinal variable imply order but not amount.

An example of nominal variable:

sex <- factor(c("man", "woman"))

An example of ordinal variable:

size <- factor(c("small", "normal", "big"))
levels(size) <- c("small", "normal", "big")
size

Implicitly, data frame converts character vectors to factors. To avoid it, set stringsAsFactors = FALSE.

mycol1 <- c(TRUE, FALSE, TRUE)
mycol2 <- c(1,57,698)
mycol3 <- c("first", "second", "third")
mydata <- data.frame(col1 = mycol1, col2 = mycol2, col3 = mycol3, stringsAsFactors = FALSE)
mydata[3,3]

Reading data into R

Now we know data structures, so we need to put some data in them. R supports various data format for importing. We can import data from text files (.txt, xml, …), databases (MySQL, Oracle, Acess, …),. statistical packages (SAS, SPSS, …), from Excel, and from standard input (keyboard), of course. This course focuses on basis so in the next section, we will mention only CSV file. However, CSV format is fully sufficient for our requirements.

Base R datasets

Before we start with the CSV format, we would note that R provides a set of example datasets. For more information, have a look at ‘datasets’ package.

library(help = "datasets")

Lets try a dataset. Firstly, we attach the dataset ‘mtcars’ to our environment then inspect the dataset.

attach(mtcars)
View(mtcars)
head(mtcars)
str(mtcars)

Well, save the dataset as ‘mymtcars.csv’.

write.csv(mtcars, "mymtcars.csv")

CSV files

Probably the easist way how you can store your data is via CSV (comma-separeted values) format. CSV file is a plain text file that uses a comma to separate values. In our case, we work with a data which are store in a table. In this case, each column is separated by the comma and row is separated by a new line (character that represent a new line).

R provides a read.csv funtion to load a data in csv format. A syntax is following:

mycsv <- read.csv(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", ...)

Practice makes perfect! Read our “mymtcars.csv” file that we saved in the section above.

mydata <- read.csv("mymtcars.csv")
mydata <- read.csv("mymtcars.csv", row.names = 1)

See the structure! Are these two datasets same? Why not?

Subsetting data

From the previos subsections we know how to isolate an element or a row/column vector. Now, we will learn some more sophisticated cases. Generate matrix of 10 rows and 10 columns where elements are randomly generated from the uniform distribution from 0 to 1.

mymatrix <- matrix(runif(100,0,1), nrow = 10)

Indexing with vector

If you need to get not only an element or a vector but a subset of data (e.g. a subset of rows, a subset of columns, or both simultaneously), you can use put a vector into the squared bracket instead of an element.

mymatrix[c(1,5,9), ] #subset of rows
mymatrix[, c(1,5,9)] #subset of cols
mymatrix[c(1,5,9), c(1,5,9)] #both

Indexing with logicals

The second way to index vectors is via logical vector. The logical vector is a vector that contains only TRUE or FALSE values. If you put the logical vector in bracket [], only elements with corresponding TRUE values is shown. So logical vector works like a filter.

ht  <- c(10,20,30,40,50,60)
wt <- c(21,32,34,40,50,52)

mydataframe <- data.frame(weight = wt, height = ht)

mydataframe

ht  <- c(10,20,30,40,50,60)
wt <- c(21,32,34,40,50,52)

mydataframe <- data.frame(weight = wt, height = ht)

mydataframe

indexing

mydataframe[1,1]
mydataframe[2,]
mydataframe[,1]
mydataframe[1:2,1]
mydataframe[c(1,3),]
mydataframe[-c(1,2),1]
mydataframe$weight
mydataframe[, "weight"]
colnames(mydataframe)
rownames(mydataframe)
rownames(mydataframe) <- c("smpl1", "smpl2", "smpl3", "smpl4", "smpl5", "smpl6")
mydataframe["smpl1", "weight"]

binary operators

3 > 2
2 == 3
2 != 4
4 <= 5
myvect <- c(1,2,3,4,5,6)
myvect > 3
myvect[myvect > 3]

mydataframe$weight > 30
mydataframe[mydataframe$weight >= 40,]
mean(mydataframe$weight)
mean(mydataframe$height)
colMeans(mydataframe)

sd(mydataframe$weight)
median(mydataframe$weight)
sum(mydataframe$weight)
colSums(mydataframe)
summary(mydataframe)

Control structures

In the previous section, we suppose that your commands are executed one by one as a sequence of commands. But oftentimes, we need to change the flow of our program. For this reason, we take into account a different control flow statements. In this course, we introduce following control structures:

if-else statement #conditions
for cycle #loops
ifelse statement

IF statement

Sometimes it’s necessary to execute some part of your code that satisfy given conditions. For this case, R has specific control structure IF that has following syntax:

if(condition)
{
  condition is valid
}

If condition is valid (TRUE), then a statement in the body part gets executed. If it is FALSE, nothing happens in this case. But What does it mean when the condition is valid or it is FALSE? At first, lets recapitulate binary logical operators that return, as it is obvious from the name of operators, a logical value TRUE or FALSE.

4 > 3 #greater than
4 < 3 #less than
4 == 3 #exactly equal to
4 != 3 #not equal to

Now, lets try our first example of IF condition. If number four is bigger than three, we print some text to the standard output.

if(4 > 3)
{
  print("Four is bigger than three!")
}
if(4 < 3)
{
  print("Three is bigger than four")
}

Futhermore, the conditional part in the IF statement can consist more than one condition that can be connected by OR (||) or AND (&&) logical operator. Do you still remember AND and OR operator from propositional logic?

logictable

Try few examples.

TRUE && TRUE # AND
FALSE && TRUE
TRUE || FALSE # OR

If three is greater then four OR five is greater or equal to maximum value of sequence one to five, then print “body” to the standard output.

if(3 > 4 || 5 >= max(seq(1,5)))
{
  print("body")
}

In the next example, we present IF-ELSE statement that has folliwing syntax:

if(condition)
{
  condition is valid
} else
{
  condition is invalid
}
x <- 2
if(x > 3)
{
  print("x is bigger than three!")
} else
{
  print("x is less or equal to three!")
}

Finally, the third version of IF condition is IF-ELSEIF-ELSE statement. The syntax is following:

if(condition1)
{
  condition1 is valid
} else if(condition2)
{
  condition1 is invalid, condition2 is valid
}
else
{
  condition1 is invalid, condition2 is invalid
}

Alternatively, we can rewrite the code above as following:

if(condition1)
{
  condition1 is valid
}
else
{
  if(condition2)
  {
    condition1 is invalid, condition2 is valid
  }
  else
  {
    condition1 is invalid, condition2 is invalid 
  }
}

In the example bellow, we compare the variable x with number one. Generally, we can obsserve three states. First is equality, x is equal to one, second is the case where x is greater than one, and finally x varibale is less then number one. The resulting state will be printed to standard output.

x <- 2
if(x == 1)
{
  print('x is equal to one')
} else if (x > 1)
{
  print('x is greater than one')
} else
{
  print('x is less than one')
}

FOR cycles

Looping, cycling or iterating is an operation where specific part of code is repeated until condition is not satified. In this course, we focus on FOR cycle with following syntax:

for(variable in sequence)
{
  repeated chunk of code
}

For printing ….

print(paste("The year is", 2015))
print(paste("The year is", 2016))
print(paste("The year is", 2017))
print(paste("The year is", 2018))

This is annoying for more than 5,10,50, items. Fortunately, we have FOR cycles …

for(year in 2015:2018)
{
  print(paste("The year is", year))
}

More sophisticated example.

x <- vector()
for(i in 0:10){
  if(i > 5){
    i <- 5 - (i - 5)
  }
  x[length(x)+1] <- i
}
plot(0:10, x)

Another example. Firstly, we construct squared matrix of 10 columns and 10 rows.

mymatrix <- matrix(1:100, ncol = 10)
head(mymatrix)

and now, we would like to get basic statistics, e.g. sum of each column, means of each columns, etc.

colSums(mymatrix)
rowSums(mymatrix)
colMeans(mymatrix)
rowMeans(mymatrix)
colSds(mymatrix) ???

However, a function for couting a standard deviation across all columns is missing. What can we do?

FOR cycle, of course!

Iiterate over values
for(icol in 1:ncol(mymatrix))
{
  print(sd(mymatrix[,icol]))
}
Itereate over indexes
for(icol in )
{
  print(sd(mymatrix[,icol]))
}

But there is similar solution for iterating over matrices in R.

apply(mymatrix, MARGIN = 1, sd)
apply(mymatrix, MARGIN = 2, sd)

Vectorized operation

myvect1 <- seq(1,500)
myvect2 <- seq(2,501)
myres<- myvect1+myvect2

myres <- rep(0, length(myvect1))
microbenchmark(myres<- myvect1+myvect2, myvect1+myvect2, times = 1000)

mySumFce <- function(myvect1, myvect2){
for(i in 1:length(myvect1))
{
  myres[i] <- (myvect1[i] + myvect2[i])
}
}
microbenchmark(myres<- myvect1+myvect2, myvect1+myvect2, mySumFce(myvect1, myvect2), times = 1000)

If you perform an operation on two or more unequal length of vectors, R will apply the recycling policy. Suppose two vectors a and b whera length of a is 10 and length of b is 5. When R reaches the end of the shorter vector b then it starts again at the first element of b.

a <- 1:10
b <- 1:5
a+b

Maybe it seems weird, but it can be very useful when you want do add some value to every element in a vector. Remember, the scalars in R are represented as a vector of length 1.

a <- 1:10
b <- 5
a+b

However, when the length of the longer vector is not a multiple of the shorter vector, R is not silenced. (a warning is given.)

a <- 1:10
b <- 1:4
a+b
When it is possible, use vectorized operations instead of loop because there are much faster!

Hmmm… Which one is faster? Lets make a simple test. In R exixts a package called ‘microbenchmark’ measuring running time of R code. We create a matrix with 1000 rows and 1000 columns with values from a uniform distribution. Each element is multiplied by itself.

library(microbenchmark)
randomMatrix <- matrix(runif(1000000), ncol = 1000)
myvect <- runif(1000000)
microbenchmark(myvect, myvect ^2, times = 1000)
myfunc1 <- function(x){for(i in x){i^2}}
microbenchmark(myfunc1(myvect), myvect^2, times = 1000)
library(microbenchmark)
randomMatrix <- matrix(runif(1000000), ncol = 1000)
rootCol <- function(x){for(i in ncol(x)){i^2}}
rootRow <- function(x){for(i in nrow(x)){i^2}}
rootEach <- function(x){for(i in nrow(x)){
  for(j in ncol(x)){
    x[i,j]^2
  }
}
}
microbenchmark(rootCol(randomMatrix), rootRow(randomMatrix), rootEach(randomMatrix), apply(randomMatrix, 2, function(x){x^2}), apply(randomMatrix, 1, function(x){x^2}), apply(randomMatrix, c(1,2), function(x){x^2}), times = 1000)


library(microbenchmark)
randomMatrix <- matrix(runif(1000000), ncol = 1000)
rootCol <- function(x){for(i in ncol(x)){res<-sqrt(i)}}
rootRow <- function(x){for(i in nrow(x)){sqrt(i)}}

rootEach <- function(x){for(i in nrow(x)){
  for(j in ncol(x)){
    sqrt(x[i,j])
  }
}
}
microbenchmark(rootCol(randomMatrix), rootRow(randomMatrix), rootEach(randomMatrix), apply(randomMatrix, 2, sqrt), apply(randomMatrix, 1, sqrt), apply(randomMatrix, c(1,2), sqrt), sqrt(randomMatrix), times = 100)

the Floating point trap :-)

0.1 == (0.3/3)
0.1*3 == 0.3
round(0.1*3, 1) == 0.3
seq(0, 1, by=.1)
unique(c(.3, .4 - .1, .5 - .2, .6 - .3, .7 - .4))