Tidyr&dplyr

Dataframe Manipulation with dplyr

Manipulation of dataframes means many things to many researchers, we often select certain observations (rows) or variables (columns), we often group the data by a certain variable(s), or we even calculate summary statistics. We can do these operations using the normal base R operations:

Luckily, the dplyr package provides a number of very useful functions for manipulating dataframes in a way that will reduce the above repetition, reduce the probability of making errors, and probably even save you some typing. As an added bonus, you might even find the dplyr grammar easier to read.

Here we’re going to cover 5 of the most commonly used functions as well as using pipes (%>%) to combine them.

select()
filter()
group_by()
summarize()
mutate()

library("tidyverse")
library("dplyr")
df<-read.csv("//primus.img.cas.cz/data/83_BIOINFORMATICS/Vendula_BIOINFORMATICS/R course/mice.csv", check.names = F, stringsAsFactors = FALSE)

#replacement of nonsence value by NA
for (i in 1:ncol(df) ){
  df[,i]<- str_replace(df[,i],"(IMPC.*)", replacement = NA_character_)
}



#convert into numbers
for (i in c(10,15:106)) {
  
  df[,i] <- as.numeric(as.character(df[,i]))
}
#replace NA by number
df<-df%>% replace_na(list(Click=95))

data_sel<-filter(df, cohort=="c007" )
data_sel<-dplyr::select(data_sel,c(ID:group, albumin:Cl))

data_sel3<-df%>%
  filter(cohort=="c007" )%>%
  dplyr::select(c(1:4, 102:106))

Tasks

read the csv file mouse.csv
select the cohort c009
select the descriptive variable (column 1 to 4) and variable contains kHz (use the help ) save it into data_sel3

Mutate

df1<-mutate(df, weight_w9_mg=weight_w9*1000)
df1<-transmute(df, weight_w9_mg=weight_w9*1000)

Tasks

convert the measumerment of glucose from g/l to mol/l (multiply it by 18)

group_by, summarise

data_mean<-data_sel%>%
  group_by(strain, sex)%>%
  summarise(avg=mean(Fe))


data_mean<-data_sel%>%
  group_by(strain, sex)%>%
  summarise_if(is.numeric, mean)

Tasks

calculate the mean, median and standart deviation for each column contain numeric value for each strain and sex in whole dataset calculate the number of animals for each strain and sex and cohort (hint function n())

Dataframe Manipulation with tidyr

Researchers often want to manipulate their data from the ‘wide’ to the ‘long’ format, or vice-versa. The ‘long’ format is where:

each column is a variable
each row is an observation

In the ‘long’ format, you usually have 1 column for the observed variable and the other columns are ID variables.

For the ‘wide’ format each row is often a site/subject/patient and you have multiple observation variables containing the same type of data. These can be either repeated observations over time, or observation of multiple variables (or a mix of both). You may find data input may be simpler or some other applications may prefer the ‘wide’ format. However, many of R’s functions have been designed assuming you have ‘long’ format data. This tutorial will help you efficiently transform your data regardless of original format.

data_long<-data_sel3%>%
  gather(`30 kHz`, `24 kHz`, `18 kHz`, `12 kHz`,`6 kHz`, key="Freq", value = "values" )%>%
  separate(col=Freq, into=c("Frequency", "Unit"), sep = "[[:blank:]]")

data_sel2<-df[1:14,c(1:5,38:57)]
data_long2<-data_sel2%>%
  gather(-(ID:cohort), key = "Variable", value = "values")%>%
  separate(col="Variable", into=c("Var", "Time"), sep = -2)%>%
  spread(key=Var, value = values)

data_long3<-data_sel2%>%
  gather(-(ID:cohort), key = "Variable", value = "values")%>%
  separate(col="Variable", into=c("Var", "Time"), sep =-2)%>%
  mutate(Var=str_replace_all(Var,"\\s+$",replacement = "" ))%>%
  spread(key=Var, value = values)

Tasks

convert the glucose data into long format (columns 16:20), save the information about the mice (column 1:5), is should be useful to rename the columns, function colnames

JOin two tables together

DOB<-read.csv("//primus.img.cas.cz/data/83_BIOINFORMATICS/Vendula_BIOINFORMATICS/R course/animal_DOB_DOS.csv", check.names = F, stringsAsFactors = FALSE)

all<-inner_join(DOB, df, by="ID")

Tasks

test function left_join, right_join, ful_join, semi_join, anti_join