Manipulation of dataframes means many things to many researchers, we often select certain observations (rows) or variables (columns), we often group the data by a certain variable(s), or we even calculate summary statistics. We can do these operations using the normal base R operations:
Luckily, the dplyr package provides a number of very useful functions for manipulating dataframes in a way that will reduce the above repetition, reduce the probability of making errors, and probably even save you some typing. As an added bonus, you might even find the dplyr grammar easier to read.
Here we’re going to cover 5 of the most commonly used functions as well as using pipes (%>%) to combine them.
library("tidyverse")
library("dplyr")
df<-read.csv("//primus.img.cas.cz/data/83_BIOINFORMATICS/Vendula_BIOINFORMATICS/R course/mice.csv", check.names = F, stringsAsFactors = FALSE)
#replacement of nonsence value by NA
for (i in 1:ncol(df) ){
df[,i]<- str_replace(df[,i],"(IMPC.*)", replacement = NA_character_)
}
#convert into numbers
for (i in c(10,15:106)) {
df[,i] <- as.numeric(as.character(df[,i]))
}
#replace NA by number
df<-df%>% replace_na(list(Click=95))
data_sel<-filter(df, cohort=="c007" )
data_sel<-dplyr::select(data_sel,c(ID:group, albumin:Cl))
data_sel3<-df%>%
filter(cohort=="c007" )%>%
dplyr::select(c(1:4, 102:106))
df1<-mutate(df, weight_w9_mg=weight_w9*1000)
df1<-transmute(df, weight_w9_mg=weight_w9*1000)
data_mean<-data_sel%>%
group_by(strain, sex)%>%
summarise(avg=mean(Fe))
data_mean<-data_sel%>%
group_by(strain, sex)%>%
summarise_if(is.numeric, mean)
Researchers often want to manipulate their data from the ‘wide’ to the ‘long’ format, or vice-versa. The ‘long’ format is where:
In the ‘long’ format, you usually have 1 column for the observed variable and the other columns are ID variables.
For the ‘wide’ format each row is often a site/subject/patient and you have multiple observation variables containing the same type of data. These can be either repeated observations over time, or observation of multiple variables (or a mix of both). You may find data input may be simpler or some other applications may prefer the ‘wide’ format. However, many of R’s functions have been designed assuming you have ‘long’ format data. This tutorial will help you efficiently transform your data regardless of original format.
data_long<-data_sel3%>%
gather(`30 kHz`, `24 kHz`, `18 kHz`, `12 kHz`,`6 kHz`, key="Freq", value = "values" )%>%
separate(col=Freq, into=c("Frequency", "Unit"), sep = "[[:blank:]]")
data_sel2<-df[1:14,c(1:5,38:57)]
data_long2<-data_sel2%>%
gather(-(ID:cohort), key = "Variable", value = "values")%>%
separate(col="Variable", into=c("Var", "Time"), sep = -2)%>%
spread(key=Var, value = values)
data_long3<-data_sel2%>%
gather(-(ID:cohort), key = "Variable", value = "values")%>%
separate(col="Variable", into=c("Var", "Time"), sep =-2)%>%
mutate(Var=str_replace_all(Var,"\\s+$",replacement = "" ))%>%
spread(key=Var, value = values)
DOB<-read.csv("//primus.img.cas.cz/data/83_BIOINFORMATICS/Vendula_BIOINFORMATICS/R course/animal_DOB_DOS.csv", check.names = F, stringsAsFactors = FALSE)
all<-inner_join(DOB, df, by="ID")