R for physicists

a short intro

outline

  • R: what is it for ?
  • R: base syntax
  • R: the packages you need to know/learn to be efficient
  • Demos on physicists data
  • The pitfalls to avoid
  • how to learn and where to look

R is created in 1995

An open source implementation of S (Bell labs statistical software)

Designed to make data analysis and statistics easy

interpreted language, high level

Primarily used in academics, moving to enterprise lately

Pros

  • open source, so free
  • Wonderful IDE: Rstudio
    • variable exploration
    • help, plot tools
    • github integrated

Rstudio.PNG

  • huge community! so lot of helps

stack_R_community.PNG

  • Lots of tools (libraries)

number-of-submitted-packages-to-CRAN.png

  • easy to paralelize, can be used on baobab

baobab.PNG

Cons

  • Slow (not for Whab or Nico, but some library are coded in C)
  • lots of ways to do the same things, so a bit messy in the beginning
  • hard to find the appropriates tools in the beginning (but i ll give you the good ones)
  • steep learning curve, but accesible for experienced programmers or physicists

Basic syntax

When you install R, it comes with base package and 6 others.

R is interpreted, and guess the type you declare (like matlab and python)

In [6]:
a <- 2
b = 2
a
b
2
2

data type

In [247]:
a <- "plouf"
class(a)
'character'
In [248]:
b <- 2
class(b)
'numeric'
In [249]:
c <- 2L
class(c)
'integer'
In [250]:
d <- as.factor("plouf")
class(d)
'factor'

WTF factor ?

In [251]:
a
d
as.numeric(a)
as.numeric(d)
'plouf'
plouf
Warning message in eval(expr, envir, enclos):
"NAs introduits lors de la conversion automatique"
<NA>
1

used in statistical model (generalised linear models, etc), but dangerous

data structure type

vector

In [41]:
# vector
a <- c(2,5,7,8)
a
class(a)
  1. 2
  2. 5
  3. 7
  4. 8
'numeric'
In [42]:
a[1]
2
In [43]:
a[1:2]
  1. 2
  2. 5
In [48]:
a[1,4]
Error in a[1, 4]: nombre de dimensions incorrect
Traceback:
In [49]:
a[c(1,4)]
  1. 2
  2. 8

everything same class, so coerce things

In [30]:
a <- c(2,5,7,"plouf")
class(a)
a[1]
'character'
'2'

matrix

In [36]:
b <- matrix(c(3,4,5,8),nrow = 2, ncol = 2)
b
35
48
In [37]:
b[1,]
  1. 3
  2. 5
In [38]:
b[,2]
  1. 5
  2. 8
In [39]:
b[1]
3

all matrix operation exists, equation solving etc.

list

In [50]:
c <- list(trucs = c(1,5,7), plouf = "des choses", matrice = matrix(sample(1:100,25),5,5) )
c
$trucs
  1. 1
  2. 5
  3. 7
$plouf
'des choses'
$matrice
63 31 2878 29
44 57 4638 34
76 84 9551 83
17 15 3625 66
12 81 10069 70
In [56]:
c$trucs
c[[1]]
c[["trucs"]]
c["trucs"]
  1. 1
  2. 5
  3. 7
  1. 1
  2. 5
  3. 7
  1. 1
  2. 5
  3. 7
$trucs =
  1. 1
  2. 5
  3. 7

data frame: une liste avec contrainte de dimension !

C'est le coeur du traitement de données dans R

In [287]:
d <- data.frame(experiment = rep(LETTERS[1:5], each = 2), data1 = sample(c(0,1),10,replace = T), data2 = sample(1:100,10))
d
experimentdata1data2
A 1 98
A 1 22
B 0 87
B 1 85
C 1 52
C 1 72
D 0 84
D 1 38
E 0 75
E 1 16
In [288]:
#idem list:
d$experiment
d[,"experiment"]
  1. A
  2. A
  3. B
  4. B
  5. C
  6. C
  7. D
  8. D
  9. E
  10. E
  1. A
  2. A
  3. B
  4. B
  5. C
  6. C
  7. D
  8. D
  9. E
  10. E
In [289]:
d$experiment[1]
d[1,"experiment"]
d[1,1]
A
A
A

Esay to subset using boolean.

In [290]:
d$experiment
d$experiment == "A"
  1. A
  2. A
  3. B
  4. B
  5. C
  6. C
  7. D
  8. D
  9. E
  10. E
  1. TRUE
  2. TRUE
  3. FALSE
  4. FALSE
  5. FALSE
  6. FALSE
  7. FALSE
  8. FALSE
  9. FALSE
  10. FALSE
In [293]:
d[d$experiment == "A",]
experimentdata1data2
A 1 98
A 1 22

loops and conditional

In [294]:
d[d$data2 > 50 & d$data1 == 0,]
experimentdata1data2
3B 0 87
7D 0 84
9E 0 75
In [295]:
d[d$data2 > 50 | d$data1 == 0,]
experimentdata1data2
1A 1 98
3B 0 87
4B 1 85
5C 1 52
6C 1 72
7D 0 84
9E 0 75
In [84]:
for(i in 1:4){
    if(i>2){
        print(paste0(i," est sup à 2" ))}else{
        print(paste0(i," est inf à 2" ))}
}
[1] "1 est inf à 2"
[1] "2 est inf à 2"
[1] "3 est sup à 2"
[1] "4 est sup à 2"

Specific of R, really usefull: lapply and sapply (and others)

In [116]:
plouf <- lapply(1:4,function(x){x^2})
plouf
class(plouf)

plouf2 <- sapply(1:4,function(x){x^2})
plouf2
class(plouf2)
  1. 1
  2. 4
  3. 9
  4. 16
'list'
  1. 1
  2. 4
  3. 9
  4. 16
'numeric'
In [101]:
sapply(c("data1","data2"),function(col){
    summary(d[,col])
})
data1data2
Min.0.0 5.00
1st Qu.0.0 21.25
Median0.0 47.50
Mean0.4 46.60
3rd Qu.1.0 59.25
Max.1.0 95.00

Libraries

In [117]:
library(ggplot2) # pour les graphiques
In [119]:
head(mpg)
manufacturermodeldisplyearcyltransdrvctyhwyflclass
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5)f 21 29 p compact
audi a4 2.0 2008 4 manual(m6)f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 2.8 1999 6 manual(m5)f 18 26 p compact
In [296]:
ggplot(data = mpg) + 
  geom_point(aes(x = displ, y = hwy, color = class))
In [245]:
ggplot(data = mpg) + 
  geom_point(aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)
In [135]:
library(data.table) # pour le data management
dt <- setDT(copy(mpg))
class(dt)
  1. 'data.table'
  2. 'data.frame'
In [130]:
plus <- data.table(manufacturer = c('audi','chevrolet','dodge','ford','honda','hyundai','jeep','land rover','lincoln','mercury','nissan','pontiac','subaru','toyota','volkswagen'),
                   country = c("DE","US","US","US","JP","JP","US","UK","UK","US","JP","US","DE","JP","DE"))
head(plus)
manufacturercountry
audi DE
chevroletUS
dodge US
ford US
honda JP
hyundai JP
In [132]:
merged <- merge(dt,plus,on = "manufacturer")
head(merged)
manufacturermodeldisplyearcyltransdrvctyhwyflclasscountry
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact DE
audi a4 1.8 1999 4 manual(m5)f 21 29 p compact DE
audi a4 2.0 2008 4 manual(m6)f 20 31 p compact DE
audi a4 2.0 2008 4 auto(av) f 21 30 p compact DE
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact DE
audi a4 2.8 1999 6 manual(m5)f 18 26 p compact DE
In [134]:
merged[country == "DE",.(mean_comsum = mean(hwy)),by = .(manufacturer,model)]
manufacturermodelmean_comsum
audi a4 28.28571
audi a4 quattro 25.75000
audi a6 quattro 24.00000
subaru forester awd25.00000
subaru impreza awd 26.00000
volkswagen gti 27.40000
volkswagen jetta 29.11111
volkswagen new beetle 32.83333
volkswagen passat 27.57143

You would have made a double loop with condition. With 10 000 000 rows, 100000 groups you can't. Here it is efficient and concise. Alternative :

In [ ]:
library(dplyr) # more or less equal perf
In [136]:
library(stringr) # to deal efficiently with strings
In [144]:
dose <- c("20 g/kg", "30g/kg lkd","15000 mg/kg")
dose
  1. '20 g/kg'
  2. '30g/kg lkd'
  3. '15000 mg/kg'
In [145]:
dose_corr <- data.frame( unit1 = str_extract(dose,"[a-z]+(?=/)"), 
                  unit2 = str_extract(dose,"(?<=/)[a-z]+"),
                  value = str_extract(dose,"[0-9]+"))
dose_corr
unit1unit2value
g kg 20
g kg 30
mg kg 15000
In [185]:
library(lmerTest) 
library(lme4)

regressions linéaires, non linéaires, multiniveau, lineaire généralisé etc

In [269]:
a1_mean = 3
b1_mean = 0
curves <- data.table( curve = rep(letters[1:5],each = 10),
                     time = rep(1:10,5),
                     a1 = rep(rnorm(5,a1_mean,1),each = 10), 
                     b1 = rep(rnorm(5,b1_mean,1),each = 10))

curves[, y_data := a1*time + b1 + rnorm(.N,0,2), by = curve]
curves[, y_real := a1*time + b1 , by = curve]

Create 5 curves with slope normaly distrib around a1_mean and intercept around b1_mean, with a bit of noise.

In [298]:
p <- ggplot(data = curves )+
geom_point(aes(time,y_data,color = curve),size = 3)+
geom_line(aes(time,y_real,color= curve),linetype = "dashed")+
theme_light()
p
In [271]:
fit <- lmer(y_data ~ time + (1 + time|curve) , data = curves)
summary(fit)
Linear mixed model fit by REML t-tests use Satterthwaite approximations to
  degrees of freedom [lmerMod]
Formula: y_data ~ time + (1 + time | curve)
   Data: curves

REML criterion at convergence: 242

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-1.96349 -0.76208 -0.08607  0.71772  1.87458 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 curve    (Intercept) 0.1645   0.4055        
          time        1.1770   1.0849   -1.00
 Residual             5.1073   2.2599        
Number of obs: 50, groups:  curve, 5

Fixed effects:
            Estimate Std. Error      df t value Pr(>|t|)   
(Intercept)   0.2499     0.7138 19.4780   0.350  0.73002   
time          3.2979     0.4978  4.0060   6.625  0.00268 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
     (Intr)
time -0.439
In [299]:
curves[,y_fit := predict(fit, newdata = curves)]

p + geom_line(aes(time,y_fit,color= curve),linetype = "solid")

Some others

  • librudate for date handling
  • rshiny to generate java/html interactives pages with buttons and graphs (amazing)
  • reporters to generate table directly in word
  • parallel to do parallel calculation

now real physicist examples:

  • the join experiment I did with PSI in 2014.
  • one hour in R , weeks during my post doc and matlab.

The experiment: PSI_setup.PNG

three main devices :

  • arduino, say if laser is on or off
  • grimm for particles sizes
  • AMS, for composition (and size)
In [3]:
library(ggplot2)
library(data.table)
library(stringr)
library(lubridate)
setwd("C:/Users/dmongin/Documents/presentation_R_pourlesnuls/PSI")
list.files(pattern =  ".txt$")
  1. 'arduino_tot.txt'
  2. 'grimm_tot.txt'
  3. 'grimm_tot_M.txt'
  4. 'testo_tot.txt'
  5. 'timeserie_ams4.txt'
In [4]:
arduino <- fread("arduino_tot.txt")
head(arduino)
V1V2V3V4V5V6V7V8V9V10
2014 6 17 16 59 28 611680.00 45.6 0
2014 6 17 16 59 28 611680.00 45.6 0
2014 6 17 16 59 28 611680.00 45.6 0
2014 6 17 16 59 28 611680.00 45.6 0
2014 6 17 16 59 28 611680.00 45.6 0
2014 6 17 16 59 28 611680.35 45.6 0

Here a stupid person didn't put any name (and its me), and didn't use any formating for the date. Here is the date formating

In [7]:
arduino[,Date := ymd_hms(paste0(V1,".",V2,".",V3," ",V4,":",V5,":",V6))] # make a string with all year month day and HMS
head(arduino)
V1V2V3V4V5V6V7V8V9V10Date
2014 6 17 16 59 28 61168 0.00 45.6 0 2014-06-17 16:59:28
2014 6 17 16 59 28 61168 0.00 45.6 0 2014-06-17 16:59:28
2014 6 17 16 59 28 61168 0.00 45.6 0 2014-06-17 16:59:28
2014 6 17 16 59 28 61168 0.00 45.6 0 2014-06-17 16:59:28
2014 6 17 16 59 28 61168 0.00 45.6 0 2014-06-17 16:59:28
2014 6 17 16 59 28 61168 0.35 45.6 0 2014-06-17 16:59:28
In [5]:
arduino[,laser := V10]
arduino[,ozone := V9]
table(arduino[,.N,by = Date]$N) # table of line number per date (i.e. per second)
    1     2     3     8    11    29    45 
86003   165     1     1     2     1     1 

Multiple lines with the same exact date !! Only one measure per second and variable of interest:

In [8]:
arduino[,N := .N,by = Date] # N is the number of line per Date
arduino_clean <- arduino[N == 1,.(Date,laser,ozone)] # subselection
head(arduino_clean)
Datelaserozone
2014-06-17 16:59:290 49.2
2014-06-17 16:59:300 49.2
2014-06-17 16:59:310 49.2
2014-06-17 16:59:320 49.2
2014-06-17 16:59:330 49.2
2014-06-17 16:59:340 49.2
In [12]:
ggplot(data = arduino_clean,aes(Date,ozone))+ # define data, x and y
  geom_line()+ # curve type
  geom_ribbon(aes(Date,ymin = 0,ymax = laser*500),fill = "red", alpha = 0.1)+ # make area trasparent
  theme_light()+ # choose theme
  facet_wrap(~date(Date),scales = "free") # one plot per day
In [13]:
testo <- fread("testo_tot.txt") # read
testo[,Date := ymd_hms(paste0(V1,".",V2,".",V3," ",V4,":",V5,":",V6))]# define the date
testo[,temp := V9] # define temp column
testo[,humidity := V8]
testo_clean <- testo[,.(Date,temp,humidity)] # subselect the columns of interest
head(testo_clean)
Datetemphumidity
2014-06-17 16:59:3524.802 40.739
2014-06-17 16:59:4024.840 38.029
2014-06-17 16:59:4524.819 38.965
2014-06-17 16:59:5024.271 40.590
2014-06-17 16:59:5524.440 40.947
2014-06-17 17:00:0024.594 41.591
In [14]:
plouf <- merge(testo,arduino_clean,all.x = T,on = "Date") # merge arduino and testo
ggplot(data = plouf)+ # plot
  geom_line(aes(Date,temp,color = "temperature"))+ 
  geom_line(aes(Date,humidity,color = "humidity"))+
  geom_ribbon(aes(Date,ymin = 0,ymax = laser*100,fill = "laser"), alpha = 0.2)+
  theme_light()+  facet_wrap(~date(Date),scales = "free")
In [15]:
summary(lm(ozone ~ laser + humidity,data = plouf))
Call:
lm(formula = ozone ~ laser + humidity, data = plouf)

Residuals:
    Min      1Q  Median      3Q     Max 
-103.64  -23.30   -4.18   13.89  456.08 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 29.18843    1.09169   26.74   <2e-16 ***
laser       67.01037    0.52723  127.10   <2e-16 ***
humidity     0.15619    0.01437   10.87   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 56.69 on 53319 degrees of freedom
  (700 observations deleted due to missingness)
Multiple R-squared:  0.2326,	Adjusted R-squared:  0.2326 
F-statistic:  8080 on 2 and 53319 DF,  p-value: < 2.2e-16
In [16]:
summary(lm(humidity ~ laser + temp,data = plouf))
Call:
lm(formula = humidity ~ laser + temp, data = plouf)

Residuals:
     Min       1Q   Median       3Q      Max 
-24.3458  -2.7637   0.3438   3.4444  27.4333 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 121.593124   0.085689 1419.00   <2e-16 ***
laser        -1.377861   0.056112  -24.56   <2e-16 ***
temp         -2.575838   0.004224 -609.87   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.051 on 53319 degrees of freedom
  (700 observations deleted due to missingness)
Multiple R-squared:  0.876,	Adjusted R-squared:  0.876 
F-statistic: 1.883e+05 on 2 and 53319 DF,  p-value: < 2.2e-16
In [47]:
grim <- fread("grimm_tot.txt") # read file
setnames(grim,"[d&t31.12.2035 11:50:55]","date") # change weird name
grim[,Date := dmy_hms(date)] # make date
grim[,N := .N,by = Date] # count the number of line per date
grim <- grim[N == 1] # select just one line per s
head(grim)
dateCounts-1320 [1/l]Diameter-1320 [nm]0.265 µm0.290 µm0.325 µm0.375 µm0.425 µm0.475 µm0.540 µm...11.250 µm13.750 µm16.250 µm18.750 µm22.500 µm27.500 µm31.000 µm34.000 µmDateN
17.06.2014 17:01:10100 0 2075 700 325 275 50 0 0 ... 0 0 0 0 0 0 0 0 2014-06-17 17:01:101
17.06.2014 17:01:30100 0 2325 1175 800 325 100 50 25 ... 0 0 0 0 0 0 0 0 2014-06-17 17:01:301
17.06.2014 17:01:40100 0 2650 1000 675 350 0 0 0 ... 0 0 0 0 0 0 0 0 2014-06-17 17:01:401
17.06.2014 17:01:50100 0 2025 1075 625 400 150 25 0 ... 0 0 0 0 0 0 0 0 2014-06-17 17:01:501
17.06.2014 17:02:10100 0 2000 1350 675 300 100 0 25 ... 0 0 0 0 0 0 0 0 2014-06-17 17:02:101
17.06.2014 17:02:20100 0 2375 1425 425 375 250 0 25 ... 0 0 0 0 0 0 0 0 2014-06-17 17:02:201

I switch to long format instead of broad:

In [48]:
grim_clean <- melt(grim,measure.vars = patterns("m$"),variable.name = "size",value.name = "concentration")
head(grim_clean)
dateCounts-1320 [1/l]Diameter-1320 [nm]DateNsizeconcentration
17.06.2014 17:01:10100 0 2014-06-17 17:01:101 0.265 µm 2075
17.06.2014 17:01:30100 0 2014-06-17 17:01:301 0.265 µm 2325
17.06.2014 17:01:40100 0 2014-06-17 17:01:401 0.265 µm 2650
17.06.2014 17:01:50100 0 2014-06-17 17:01:501 0.265 µm 2025
17.06.2014 17:02:10100 0 2014-06-17 17:02:101 0.265 µm 2000
17.06.2014 17:02:20100 0 2014-06-17 17:02:201 0.265 µm 2375

I change the size to numbers

In [49]:
grim_clean[,size := as.numeric(str_extract(size,"[0-9]+\\.[0-9]+"))] # extract just numbers of size
grim_clean <- grim_clean[,.(Date,size,concentration)]
head(grim_clean)
Datesizeconcentration
2014-06-17 17:01:100.265 2075
2014-06-17 17:01:300.265 2325
2014-06-17 17:01:400.265 2650
2014-06-17 17:01:500.265 2025
2014-06-17 17:02:100.265 2000
2014-06-17 17:02:200.265 2375

Example : plot the mean concentration per hour, for each day of the experiment:

In [50]:
moy_grim <- grim_clean[,.(concentration_m = mean(concentration)),by = .(size,hour(Date),date(Date))]
head(moy_grim)
sizehourdateconcentration_m
0.265 17 2014-06-1743433.83
0.265 18 2014-06-1728027.25
0.265 19 2014-06-1720778.73
0.265 17 2014-06-1869960.71
0.265 18 2014-06-1868438.87
0.265 19 2014-06-1874782.11
In [56]:
ggplot(data = moy_grim,aes(as.numeric(size),concentration_m))+
  geom_line(aes(color = as.factor(hour),group = hour))+
  facet_wrap(~date,scales = "free")+ theme_light() +
  scale_y_log10() +
  labs(x = "size um", y = "concentration", color = "day hour")
Warning message:
"Transformation introduced infinite values in continuous y-axis"
In [24]:
AMS <- fread("timeserie_ams4.txt")
AMS[,Date := dmy_hms(time_stamp)]
head(AMS)
time_stampTotOrgSO4NO3NH4ChlOrg43Org44Date
18:06:2014 22:17:496.81623 3.75423 1.33178 0.798375 0.931821 1.45099e-05 0.310066 0.651729 2014-06-18 22:17:49
18:06:2014 22:18:106.36363 3.51243 1.27318 0.726966 0.845720 5.33109e-03 0.302963 0.617536 2014-06-18 22:18:10
18:06:2014 22:18:316.18041 3.29982 1.34318 0.761861 0.769005 6.54514e-03 0.277555 0.589916 2014-06-18 22:18:31
18:06:2014 22:18:536.38510 3.49896 1.34846 0.704156 0.824175 9.34887e-03 0.283127 0.606004 2014-06-18 22:18:53
18:06:2014 22:19:146.45665 3.50046 1.39680 0.696631 0.857673 5.08134e-03 0.300025 0.637838 2014-06-18 22:19:14
18:06:2014 22:19:355.71712 3.15600 1.19864 0.570948 0.779798 1.17294e-02 0.281378 0.546123 2014-06-18 22:19:35

define three experiment (dont remember):

In [25]:
# define 3 experiment with the dates
exp1 <- ymd(c("2014-06-17","2014-06-18","2014-06-19"))
exp2 <- ymd(c("2014-06-23"))
exp3 <- ymd(c("2014-06-25","2014-06-26"))

# for the date in each group, define new variable exp
grim_clean[date(Date) %in% exp1,exp := "exp1"]
grim_clean[date(Date) %in% exp2,exp := "exp2"]
grim_clean[date(Date) %in% exp3,exp := "exp3"]

AMS[date(Date) %in% exp1,exp := "exp1"]
AMS[date(Date) %in% exp2,exp := "exp2"]
AMS[date(Date) %in% exp3,exp := "exp3"]

I merge arduino results and AMS to have the laser

In [26]:
plouf <- merge(arduino_clean,AMS,all.y = T,by = "Date")
plouf <- plouf[!is.na(laser)]
head(plouf)
Datelaserozonetime_stampTotOrgSO4NO3NH4ChlOrg43Org44exp
2014-06-18 22:17:49 0 35 18:06:2014 22:17:496.81623 3.75423 1.33178 0.798375 0.931821 1.45099e-05 0.310066 0.651729 exp1
2014-06-18 22:18:10 0 34 18:06:2014 22:18:106.36363 3.51243 1.27318 0.726966 0.845720 5.33109e-03 0.302963 0.617536 exp1
2014-06-18 22:18:31 0 33 18:06:2014 22:18:316.18041 3.29982 1.34318 0.761861 0.769005 6.54514e-03 0.277555 0.589916 exp1
2014-06-18 22:18:53 0 33 18:06:2014 22:18:536.38510 3.49896 1.34846 0.704156 0.824175 9.34887e-03 0.283127 0.606004 exp1
2014-06-18 22:19:14 0 33 18:06:2014 22:19:146.45665 3.50046 1.39680 0.696631 0.857673 5.08134e-03 0.300025 0.637838 exp1
2014-06-18 22:19:35 0 33 18:06:2014 22:19:355.71712 3.15600 1.19864 0.570948 0.779798 1.17294e-02 0.281378 0.546123 exp1

change to long format

In [27]:
AMS_long <- melt(plouf,measure.vars = c("Tot","Org","SO4","NO3","NH4","Chl","Org43","Org44"),value.name = "concentration",variable.name = "type")
head(AMS_long)
Datelaserozonetime_stampexptypeconcentration
2014-06-18 22:17:49 0 35 18:06:2014 22:17:49exp1 Tot 6.81623
2014-06-18 22:18:10 0 34 18:06:2014 22:18:10exp1 Tot 6.36363
2014-06-18 22:18:31 0 33 18:06:2014 22:18:31exp1 Tot 6.18041
2014-06-18 22:18:53 0 33 18:06:2014 22:18:53exp1 Tot 6.38510
2014-06-18 22:19:14 0 33 18:06:2014 22:19:14exp1 Tot 6.45665
2014-06-18 22:19:35 0 33 18:06:2014 22:19:35exp1 Tot 5.71712
In [37]:
# calculate the mean per aerosol type, laser state and experiment
result_AMS <- AMS_long[type != "Tot",.(concentration_m = mean(concentration)),by = .(laser,type,exp)]
# do the plot
ggplot(data = result_AMS,aes( as.factor(laser) ,concentration_m,fill = type))+
  geom_col()+
  facet_wrap(~ exp, scales = "free")+
  labs( x = "laser", y = "concentration ppm", color = "type", fill = "type" ) + theme_light()

To visualize the time serie in time: not complicated either

In [285]:
ggplot(data = AMS_long[,laser_n := laser*max(concentration),by = exp],aes(Date,concentration))+
  geom_ribbon(aes(Date,ymin = 0,ymax = laser_n),fill = "red", alpha = 0.1)+
  geom_line(aes(colour = type))+ facet_wrap(~exp,scales = "free")+ theme_light()

Pitfalls

don't re-descover, use the libraries

use long format, grouping operation, and ggplot to make your life easy.

  • Long format : each column a variable.
  • Example grimm: All columns are particles concentration -> two columns, one concentration, the other size

spending time on data management makes life easier after. So:

  • check if your unique values are unique
  • always look out for missing values
  • define the column type when importing data: make numeric variable numeric, avoid factors

Ressources

  • http://r4ds.had.co.nz/ : R for Data science. Author is head of Rstudio, inventor of ggplot, dplyr, libridate, Rshiny etc. Ressource free online.
  • stackoverflow ! Carefull:
    • Check if the question is already asked
    • Give a reproducible small example of your problem, so people can give solution(s)
    • Be specific, show that you tried something
    • If so, 20 minutes to have an answer max
  • Rlunchs at unige ! http://use-r-carlvogt.github.io/prochains-lunchs/

Want more (I have more)?

Questions ?

Want to see my Schlieren experiment ?

Proof of warming in geneva in 20 lines of R

In [313]:
setwd("C:/Users/dmongin/Documents/presentation_R_pourlesnuls")
Geneva <- read.csv("test.txt",sep = ";",skip = 27)
head(Geneva)
Geneva <- setDT(Geneva)
YearMonthTemperaturePrecipitation
1864 1 -4.2 13.6
1864 2 -0.7 17.2
1864 3 5.3 32.7
1864 4 8.3 35.2
1864 5 13.5 68.9
1864 6 15.6 115.2
In [314]:
Geneva[,dec := Year %/% 10]
Geneva[,Temperature_dec := mean(Temperature),by = .(dec,Month)]

ggplot(data = Geneva,aes(Month,Temperature_dec))+
  geom_line(aes(color = as.factor(dec*10)))+
  theme_light()+
  labs(  color = "decennie", title = "mean per decennie",  y = "Temperature",   x = "Months"  )
options(repr.plot.width=8, repr.plot.height=4.5)
In [315]:
library(zoo)

nyear <- 5
Geneva[,Temperature_rollmean := c(rollmean(Temperature,nyear,align = "center",fill = NA)), by = Month]
fit <- lmer(Temperature ~ I(Year/100) + (1+I(Year/100)|Month),data = Geneva)
Geneva[,Temperature_predict := predict(fit,newdata = Geneva)]
summary(fit)
Linear mixed model fit by REML t-tests use Satterthwaite approximations to
  degrees of freedom [lmerMod]
Formula: Temperature ~ I(Year/100) + (1 + I(Year/100) | Month)
   Data: Geneva

REML criterion at convergence: 7166

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.5947 -0.6501  0.0148  0.6508  3.6801 

Random effects:
 Groups   Name        Variance Std.Dev. Corr 
 Month    (Intercept) 89.59250 9.465         
          I(Year/100)  0.02251 0.150    -0.94
 Residual              2.65071 1.628         
Number of obs: 1855, groups:  Month, 12

Fixed effects:
             Estimate Std. Error        df t value Pr(>|t|)    
(Intercept) -15.96749    3.18911  11.02700  -5.007 0.000395 ***
I(Year/100)   1.31851    0.09514  11.01000  13.858 2.59e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr)
I(Year/100) -0.824
In [316]:
ggplot(data = Geneva)+
  geom_point(aes(Year,Temperature_rollmean,color = as.factor(Month)))+
  geom_line(aes(Year,Temperature_predict,color = as.factor(Month)))+
  theme_light()+
  labs(   color = "Month",  title = "Evolution",   y = "Temperature",    x = "Year"  )
Warning message:
"Removed 48 rows containing missing values (geom_point)."

calculation of gradient

In [319]:
taille <- 100
field <- data.table( x = rep(seq(-taille,taille,1),2*taille+1), y = rep(seq(-taille,taille,1),each = 2*taille+1))
field[,temp := exp(-(x^2 + y^2)/10000)]
p <- ggplot(data = field,aes(x,y)) +
  geom_tile(aes(fill = temp))+
  scale_fill_gradient(low = "green", high = "red")
p
In [321]:
gradx <- field[,.(gradx = c(NA,temp[3:.N],NA)-c(NA,temp[1:(.N-2)],NA), x = x),by = y] #gradient selon x
grady <- field[,.(grady = c(NA,temp[3:.N],NA)-c(NA,temp[1:(.N-2)],NA), y = y),by = x] # gradient selon y
grad <- merge(gradx,grady, by = c("x","y")) # merge des deux

# je prend 10 fleche par lignes, sinon y en a trop. 
gradplot <- grad[x %in% seq(-taille,taille,round(taille/10)) & y %in% seq(-taille,taille,round(taille/10))] 
# je renormalise le gradient pour que la longueur des fleches soit appreciable sur le plot
gradplot[,c("gradx","grady"):= .(gradx/max(gradx,na.rm = T)*10,grady/max(grady,na.rm = T)*10)]
In [322]:
p + geom_segment(data = gradplot,aes(x,y,xend =x+ gradx,yend =  y+grady), arrow = arrow(length = unit(0.2,"cm")))
Warning message:
"Removed 80 rows containing missing values (geom_segment)."