Regression with R: tools and applications

Denis Mongin

11/11/2019

img[alt=drawing] { width: 200px; }

Regression in R

Regression is a wide subject.

Objectif : present the main tools available

Give examples and result explanation.

The main parts

Linear regression
- fixed effect
- generalized model
- mixed effect
Nonlinear regression
Parametric and smoothing regressions
Structural equations

The packages

rm(list = ls())
# install.packages(c("lme4","lmerTest","nls2","ggplot2","data.table","dplyr","ISLR"))
library(lme4)
library(lmerTest)
library(nls2)
# for data handling
library(ggplot2)
library(data.table)
library(dplyr)
# example data
library(ISLR)
#SEM
library(lavaan)

summary(Auto)

##       mpg          cylinders      displacement     horsepower   
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0  
##                                                                 
##      weight      acceleration        year           origin     
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:2225   1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000  
##  Median :2804   Median :15.50   Median :76.00   Median :1.000  
##  Mean   :2978   Mean   :15.54   Mean   :75.98   Mean   :1.577  
##  3rd Qu.:3615   3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000  
##                                                                
##                  name    
##  amc matador       :  5  
##  ford pinto        :  5  
##  toyota corolla    :  5  
##  amc gremlin       :  4  
##  amc hornet        :  4  
##  chevrolet chevette:  4  
##  (Other)           :365

Lets see how simple linear regression works. We will study fuel consumption (mpg) function of the other variables.

# set AUto as data.table
Auto <- setDT(copy(Auto))
# create the new variable
Auto[,consumption := 1/mpg*3.785/1.6093*100]
Auto[,weight := weight*0.4536]
Auto[,displacement := displacement*16.387]

I create the variable that makes sense for european:

L of gasoline per 100 km.

US	SI
1 Mile	1.6093 km
1 Galon	3.785 L
1 cubic inched	16.387 cm^3
1 lbs	0.4536 kg

melt(Auto[,c(-1,-8,-9)],id.vars = "consumption") %>%
  ggplot()+  geom_point(aes(value,consumption,color = variable))+
  facet_wrap(~variable,scales = "free")+  theme_bw()+ theme(legend.position = "none")

Univariate regression

fit <- lm(consumption ~ weight ,data = Auto)
fit

## 
## Call:
## lm(formula = consumption ~ weight, data = Auto)
## 
## Coefficients:
## (Intercept)       weight  
##    -0.89434      0.00899

linear trend

summary

summary(fit)

## 
## Call:
## lm(formula = consumption ~ weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4535 -1.2004 -0.0534  1.0254  7.7311 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.8943423  0.3362310   -2.66  0.00814 ** 
## weight       0.0089898  0.0002394   37.55  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.824 on 390 degrees of freedom
## Multiple R-squared:  0.7833, Adjusted R-squared:  0.7828 
## F-statistic:  1410 on 1 and 390 DF,  p-value: < 2.2e-16

Std Error is the standard deviation of the gaussian distribution of the estimators.
p value is the probability to obtain the same effect or stronger under the null hypothesis. Here:
- slope = 0 for the slope,
- intercept = 0, for theintercept.
F statistics is the Fisher statistic of the whole model (null hypothesis : all coeffs are 0), gives the overall p value.
\[ R^2 = 1 - \frac{\sum(y_i - f_i)^2}{\sum(y_i - \bar{y})^2}\] is the percentage of explained variance by the model.

par(mfrow=c(2,2))
plot(fit)

regression plots

Left : Homoscedasticity: residuals depends on value ?
Right top: normal ?
Right bottom : is there any outlier (i.e. some points contribute a lots to coeffs) ?

qnorm

predict

predict(fit) %>%
  head()

##        1        2        3        4        5        6 
## 13.39423 14.16493 13.11694 13.10471 13.16995 16.80734

Predict by default predict for all non missing predictors in the initial data frame.

Can choose new preditors values:

predict(fit,newdata = data.frame(weight = c(2,1000,300000)))

##            1            2            3 
##   -0.8763627    8.0954913 2696.0557534

predict

Auto[,fit1 := predict(fit)]%>%
ggplot()+
  geom_point(aes(weight,consumption),size = 2)+
  geom_line(aes(weight,fit1),color = "red",size = 1)+theme_light()

multivariate analysis

fit2 <- lm(consumption ~ weight + horsepower + year ,data = Auto)
summary(fit2)

## 
## Call:
## lm(formula = consumption ~ weight + horsepower + year, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6442 -0.7781 -0.0415  0.6282  6.2600 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.4539933  1.6526441  14.192  < 2e-16 ***
## weight       0.0064535  0.0003563  18.113  < 2e-16 ***
## horsepower   0.0190473  0.0037303   5.106 5.17e-07 ***
## year        -0.3015613  0.0205980 -14.640  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.356 on 388 degrees of freedom
## Multiple R-squared:  0.8809, Adjusted R-squared:   0.88 
## F-statistic: 956.8 on 3 and 388 DF,  p-value: < 2.2e-16

model comparison

anova(lm(consumption ~ weight  ,data = Auto),
           lm(consumption ~ weight + horsepower ,data = Auto),
           lm(consumption ~ weight + horsepower + year ,data = Auto),
           lm(consumption ~ weight + horsepower + year + displacement ,data = Auto))

## Analysis of Variance Table
## 
## Model 1: consumption ~ weight
## Model 2: consumption ~ weight + horsepower
## Model 3: consumption ~ weight + horsepower + year
## Model 4: consumption ~ weight + horsepower + year + displacement
##   Res.Df     RSS Df Sum of Sq        F Pr(>F)    
## 1    390 1297.55                                 
## 2    389 1107.01  1    190.53 103.4037 <2e-16 ***
## 3    388  713.09  1    393.92 213.7875 <2e-16 ***
## 4    387  713.09  1      0.00   0.0013 0.9712    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

interaction term

lm(consumption ~ weight + year + year:weight ,data = Auto) %>%
  summary()

## 
## Call:
## lm(formula = consumption ~ weight + year + year:weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4581 -0.7477 -0.0982  0.6232  6.5769 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.550e+00  5.565e+00   0.818    0.414    
## weight       2.470e-02  4.182e-03   5.906 7.68e-09 ***
## year        -5.003e-02  7.385e-02  -0.677    0.499    
## weight:year -2.237e-04  5.598e-05  -3.997 7.69e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.373 on 388 degrees of freedom
## Multiple R-squared:  0.8779, Adjusted R-squared:  0.877 
## F-statistic: 930.3 on 3 and 388 DF,  p-value: < 2.2e-16

plot(Auto$year,Auto$weight)

calculated var

lm(consumption ~ horsepower + I(horsepower^2),data = Auto)%>%
  summary()

## 
## Call:
## lm(formula = consumption ~ horsepower + I(horsepower^2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9056 -1.2469 -0.1442  1.2677  7.8468 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.214e+00  8.178e-01  -1.484    0.139    
## horsepower       1.484e-01  1.414e-02  10.495  < 2e-16 ***
## I(horsepower^2) -2.453e-04  5.545e-05  -4.424 1.26e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.987 on 389 degrees of freedom
## Multiple R-squared:  0.7436, Adjusted R-squared:  0.7423 
## F-statistic: 564.1 on 2 and 389 DF,  p-value: < 2.2e-16

use of formula

var <- "weight"
form <- as.formula(paste0("consumption ~",var))
form

## consumption ~ weight

lm(form,data = Auto) %>% summary()

## 
## Call:
## lm(formula = form, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4535 -1.2004 -0.0534  1.0254  7.7311 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.8943423  0.3362310   -2.66  0.00814 ** 
## weight       0.0089898  0.0002394   37.55  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.824 on 390 degrees of freedom
## Multiple R-squared:  0.7833, Adjusted R-squared:  0.7828 
## F-statistic:  1410 on 1 and 390 DF,  p-value: < 2.2e-16

looping

vars <- c("cylinders","weight","year","acceleration")
res <- lapply(vars,function(var){
  form <- as.formula(paste0("consumption ~",var))
  lm(form,data = Auto) %>%
    summary()%>%
    .$coefficients %>%
    .[,"Estimate"]%>%
    t()%>%
    `colnames<-`(NULL)%>%
    as.data.table()%>%
    .[,variable := var]
})%>% rbindlist() %>%
  setnames(c("V1","V2"),c("intercept","slope"))
res

##     intercept        slope     variable
## 1:  0.7099151  1.925771049    cylinders
## 2: -0.8943423  0.008989834       weight
## 3: 56.3094049 -0.593077529         year
## 4: 21.3079389 -0.647327145 acceleration

mixed effect

lmer(weight ~ I(year-70) + (1+ I(year-70) |origin),data = Auto)%>% 
  summary()

## boundary (singular) fit: see ?isSingular

## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: weight ~ I(year - 70) + (1 + I(year - 70) | origin)
##    Data: Auto
## 
## REML criterion at convergence: 5539.8
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.88501 -0.64362 -0.07317  0.71990  2.24129 
## 
## Random effects:
##  Groups   Name         Variance Std.Dev. Corr 
##  origin   (Intercept)  146073.1 382.2         
##           I(year - 70)    519.8  22.8    -1.00
##  Residual               81411.1 285.3         
## Number of obs: 392, groups:  origin, 3
## 
## Fixed effects:
##              Estimate Std. Error       df t value Pr(>|t|)  
## (Intercept)  1269.904    223.038    2.704   5.694   0.0142 *
## I(year - 70)  -10.836     13.804    2.745  -0.785   0.4946  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr)
## I(year-70) -0.981
## convergence code: 0
## boundary (singular) fit: see ?isSingular

fixed effect

##                Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)  1543.96097  35.362243 43.661285 3.184570e-152
## I(year - 70)  -32.33144   5.036824 -6.419012  3.986518e-10

mixed effect

## boundary (singular) fit: see ?isSingular

##                Estimate Std. Error       df    t value   Pr(>|t|)
## (Intercept)  1269.90374  223.03783 2.704301  5.6936697 0.01415205
## I(year - 70)  -10.83628   13.80439 2.745352 -0.7849882 0.49455593

## boundary (singular) fit: see ?isSingular

generalized models

summary(Default)

##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554

Default[,"default"] <- relevel(Default$default,ref = "Yes")

binomial distribution

glm(default~student + balance,data = Default,family = binomial) %>% 
  summary()

## 
## Call:
## glm(formula = default ~ student + balance, family = binomial, 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.7435   0.0203   0.0559   0.1422   2.4578  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 10.7494959  0.3691914  29.116  < 2e-16 ***
## studentYes   0.7148776  0.1475190   4.846 1.26e-06 ***
## balance     -0.0057381  0.0002318 -24.750  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.7  on 9997  degrees of freedom
## AIC: 1577.7
## 
## Number of Fisher Scoring iterations: 8

!!! danger

Here the regression is on

\[log(ODD) = log(\frac{p_{defaukt}}{1-p_{default}})\]

glm(default~student + balance,data = Default,family = binomial) %>% 
  summary() %>%
  .$coefficients%>%
  .[,"Estimate"]%>%
  exp()

##  (Intercept)   studentYes      balance 
## 4.660653e+04 2.043937e+00 9.942783e-01

Being a student multiply the odds by 2 of having default

Use the predict

predict(glm(default~student + balance,data = Default,family = binomial), 
        data.frame(student = c("No","Yes"),balance = c(2000,2000)), 
        type="response")

##         1         2 
## 0.3259166 0.4970413

family	link	variable
binomial	(link = “logit”)	binary
gaussian	(link = “identity”)	normal
Gamma	(link = “inverse”)	continuous non normal
inverse.gaussian	(link = “1/mu^2”)	continuous positive non normal
poisson	(link = “log”)	count variable

nonlinear regression

nls(consumption ~ a*exp(-acceleration/b)+c,
    start = c(a = 20, b = 10, c = 1),
    data = Auto)%>%
  summary()

## 
## Formula: consumption ~ a * exp(-acceleration/b) + c
## 
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)    
## a   50.343     19.212   2.620 0.009129 ** 
## b    5.992      1.673   3.581 0.000385 ***
## c    7.082      1.306   5.424 1.03e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.418 on 389 degrees of freedom
## 
## Number of iterations to convergence: 13 
## Achieved convergence tolerance: 3.969e-06

Hard to find starting values

nls(consumption ~ a*exp(-acceleration/b)+c,
    start = c(a = 1, b = 1, c = 1),
    data = Auto)

## Error in nls(consumption ~ a * exp(-acceleration/b) + c, start = c(a = 1, : gradient singulier

use nls2

fitnls2 <- nls2(consumption ~ a*exp(-acceleration/b)+c,
    start = data.frame(a = c(0,50), b = c(1,50), c = c(0,1)),
    # algorithm = "plinear-random",
    data = Auto)

## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   le pas 0.000488281 est devenu inférieur à 'minFactor' de 0.000976562
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   le pas 0.000488281 est devenu inférieur à 'minFactor' de 0.000976562
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   le pas 0.000488281 est devenu inférieur à 'minFactor' de 0.000976562
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier

summary(fitnls2)

## 
## Formula: consumption ~ a * exp(-acceleration/b) + c
## 
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)    
## a   50.343     19.212   2.620 0.009128 ** 
## b    5.992      1.673   3.581 0.000385 ***
## c    7.082      1.306   5.424 1.03e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.418 on 389 degrees of freedom
## 
## Number of iterations to convergence: 12 
## Achieved convergence tolerance: 3.634e-06

use of function in nls

damped oscillator numerical integration

\[ \frac{d^2y}{dt^2} + 2\zeta\omega_n \frac{dy}{dt} + \omega_n^2 y = 0 \]

\[\omega_n = \frac{2\pi}{T} = \sqrt{\frac{k}{m}} \]

drawing

library(deSolve)

ODE2_nls <- function(t, x, parms) {with(as.list(c(parms, x)), {
  # import <- excfunc(time)
  dS2 <- dS1
  dS1 <- -amort*2*2*pi/period*dS1  - (2*pi/period)^2*S1 
  res <- c(dS2,dS1)
  list(res)})}## 

solution_analy_ODE2 = function(amort,period,time,yo,yoder){
  parms  <- c(amort = amort,period = period)
  xstart = c(S1 =  yo,dS1 = yoder)
  out <-  lsoda(xstart, time, ODE2_nls, parms)
  return(out[,2])
}

test <- data.table(time = rep(0:99,10), amort = rep(seq(0,2,length.out = 10),each = 100))
test[,y:= solution_analy_ODE2(amort,40,time,3,0),by = amort]
ggplot(test)+
  geom_line(aes(time,y,color = as.factor(round(amort,1))))+
  theme_light()+
  labs(color = "damping")

test <- data.table(time = 0:99)
test[,y_true:= solution_analy_ODE2(0.4,40,time,3,2)]
test[,y_noise := y_true +  rnorm(.N,0,1.5)]
ggplot(test,aes(time,y_noise))+
  geom_point()+theme_light()

fitnls2 <- nls(y_noise ~ solution_analy_ODE2(amort,period,time,yo,yoder),
        data = test,
        start = list(amort = 0.5, period = 10,yo = 1,yoder = 0),
        lower = c(0,2,0,0),
        upper = c(2,50,10,10),
        algorithm = "port",
        control=list(maxiter=70, warnOnly=T,minFactor = 1e-10))

summary(fitnls2)

## 
## Formula: y_noise ~ solution_analy_ODE2(amort, period, time, yo, yoder)
## 
## Parameters:
##        Estimate Std. Error t value Pr(>|t|)    
## amort   0.36644    0.04179   8.769 6.54e-14 ***
## period 42.70112    1.96623  21.717  < 2e-16 ***
## yo      3.85154    0.95628   4.028 0.000113 ***
## yoder   1.58328    0.28232   5.608 1.97e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.615 on 96 degrees of freedom
## 
## Algorithm "port", convergence message: relative convergence (4)

SEM

Structural Equation Modelling allow the test a multivariate model on data
It uses Maximum Likelihood estimation
It allows for Latent variables: unobserved variables (questionnaires)
Used in behavioral science
General linear model is a special case of SEM

I am not going to explain SEM (because I don’t understand it)

SEM

It allows constrained regression !

lavaan package

myModel <- ' # regressions
             consumption ~ a*weight + b*horsepower + c*year
           '
fit <- sem(myModel, data = Auto)
# summary(fit, standardized=TRUE)
coef(fit)

##                        a                        b                        c 
##                    0.006                    0.019                   -0.302 
## consumption~~consumption 
##                    1.819

myModel <- ' # regressions
             consumption ~ a*weight + b*horsepower + c*year
             # constraints
             b == a*3
             c < -b
           '
fit <- sem(myModel, data = Auto)
# summary(fit, standardized=TRUE)
# parameterEstimates(fit)
coef(fit)

##                        a                        b                        c 
##                    0.006                    0.019                   -0.301 
## consumption~~consumption 
##                    1.819

So much more

geeglm for generalized estimating equation
autoregressive regression
state space modelling estimation with OpenMx
survival fits and Kaplan Mayer in medicine
Bayesian world of Bugs
….

Thanks

questions ? Me : https://scitilab.com