Regression with R: tools and applications

Denis Mongin

11/11/2019

img[alt=drawing] { width: 200px; }

Regression in R

Regression is a wide subject.

Objectif : present the main tools available

Give examples and result explanation.

The main parts


  • Linear regression
    • fixed effect
    • generalized model
    • mixed effect
  • Nonlinear regression
  • Parametric and smoothing regressions
  • Structural equations

The packages

##       mpg          cylinders      displacement     horsepower   
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0  
##                                                                 
##      weight      acceleration        year           origin     
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:2225   1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000  
##  Median :2804   Median :15.50   Median :76.00   Median :1.000  
##  Mean   :2978   Mean   :15.54   Mean   :75.98   Mean   :1.577  
##  3rd Qu.:3615   3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000  
##                                                                
##                  name    
##  amc matador       :  5  
##  ford pinto        :  5  
##  toyota corolla    :  5  
##  amc gremlin       :  4  
##  amc hornet        :  4  
##  chevrolet chevette:  4  
##  (Other)           :365

Lets see how simple linear regression works. We will study fuel consumption (mpg) function of the other variables.

I create the variable that makes sense for european:

L of gasoline per 100 km.

US SI
1 Mile 1.6093 km
1 Galon 3.785 L
1 cubic inched 16.387 cm^3
1 lbs 0.4536 kg

Univariate regression

## 
## Call:
## lm(formula = consumption ~ weight, data = Auto)
## 
## Coefficients:
## (Intercept)       weight  
##    -0.89434      0.00899

linear trend

summary

## 
## Call:
## lm(formula = consumption ~ weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4535 -1.2004 -0.0534  1.0254  7.7311 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.8943423  0.3362310   -2.66  0.00814 ** 
## weight       0.0089898  0.0002394   37.55  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.824 on 390 degrees of freedom
## Multiple R-squared:  0.7833, Adjusted R-squared:  0.7828 
## F-statistic:  1410 on 1 and 390 DF,  p-value: < 2.2e-16

  • Std Error is the standard deviation of the gaussian distribution of the estimators.

  • p value is the probability to obtain the same effect or stronger under the null hypothesis. Here:
    • slope = 0 for the slope,
    • intercept = 0, for theintercept.
  • F statistics is the Fisher statistic of the whole model (null hypothesis : all coeffs are 0), gives the overall p value.

  • \[ R^2 = 1 - \frac{\sum(y_i - f_i)^2}{\sum(y_i - \bar{y})^2}\] is the percentage of explained variance by the model.

regression plots



  • Left : Homoscedasticity: residuals depends on value ?
  • Right top: normal ?
  • Right bottom : is there any outlier (i.e. some points contribute a lots to coeffs) ?

qnorm

predict

##        1        2        3        4        5        6 
## 13.39423 14.16493 13.11694 13.10471 13.16995 16.80734

Predict by default predict for all non missing predictors in the initial data frame.

Can choose new preditors values:

##            1            2            3 
##   -0.8763627    8.0954913 2696.0557534

predict

multivariate analysis

## 
## Call:
## lm(formula = consumption ~ weight + horsepower + year, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6442 -0.7781 -0.0415  0.6282  6.2600 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.4539933  1.6526441  14.192  < 2e-16 ***
## weight       0.0064535  0.0003563  18.113  < 2e-16 ***
## horsepower   0.0190473  0.0037303   5.106 5.17e-07 ***
## year        -0.3015613  0.0205980 -14.640  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.356 on 388 degrees of freedom
## Multiple R-squared:  0.8809, Adjusted R-squared:   0.88 
## F-statistic: 956.8 on 3 and 388 DF,  p-value: < 2.2e-16

model comparison

## Analysis of Variance Table
## 
## Model 1: consumption ~ weight
## Model 2: consumption ~ weight + horsepower
## Model 3: consumption ~ weight + horsepower + year
## Model 4: consumption ~ weight + horsepower + year + displacement
##   Res.Df     RSS Df Sum of Sq        F Pr(>F)    
## 1    390 1297.55                                 
## 2    389 1107.01  1    190.53 103.4037 <2e-16 ***
## 3    388  713.09  1    393.92 213.7875 <2e-16 ***
## 4    387  713.09  1      0.00   0.0013 0.9712    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

interaction term

## 
## Call:
## lm(formula = consumption ~ weight + year + year:weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4581 -0.7477 -0.0982  0.6232  6.5769 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.550e+00  5.565e+00   0.818    0.414    
## weight       2.470e-02  4.182e-03   5.906 7.68e-09 ***
## year        -5.003e-02  7.385e-02  -0.677    0.499    
## weight:year -2.237e-04  5.598e-05  -3.997 7.69e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.373 on 388 degrees of freedom
## Multiple R-squared:  0.8779, Adjusted R-squared:  0.877 
## F-statistic: 930.3 on 3 and 388 DF,  p-value: < 2.2e-16

calculated var

## 
## Call:
## lm(formula = consumption ~ horsepower + I(horsepower^2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9056 -1.2469 -0.1442  1.2677  7.8468 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.214e+00  8.178e-01  -1.484    0.139    
## horsepower       1.484e-01  1.414e-02  10.495  < 2e-16 ***
## I(horsepower^2) -2.453e-04  5.545e-05  -4.424 1.26e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.987 on 389 degrees of freedom
## Multiple R-squared:  0.7436, Adjusted R-squared:  0.7423 
## F-statistic: 564.1 on 2 and 389 DF,  p-value: < 2.2e-16

use of formula

## consumption ~ weight
## 
## Call:
## lm(formula = form, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4535 -1.2004 -0.0534  1.0254  7.7311 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.8943423  0.3362310   -2.66  0.00814 ** 
## weight       0.0089898  0.0002394   37.55  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.824 on 390 degrees of freedom
## Multiple R-squared:  0.7833, Adjusted R-squared:  0.7828 
## F-statistic:  1410 on 1 and 390 DF,  p-value: < 2.2e-16

looping

##     intercept        slope     variable
## 1:  0.7099151  1.925771049    cylinders
## 2: -0.8943423  0.008989834       weight
## 3: 56.3094049 -0.593077529         year
## 4: 21.3079389 -0.647327145 acceleration

mixed effect

## boundary (singular) fit: see ?isSingular
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: weight ~ I(year - 70) + (1 + I(year - 70) | origin)
##    Data: Auto
## 
## REML criterion at convergence: 5539.8
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.88501 -0.64362 -0.07317  0.71990  2.24129 
## 
## Random effects:
##  Groups   Name         Variance Std.Dev. Corr 
##  origin   (Intercept)  146073.1 382.2         
##           I(year - 70)    519.8  22.8    -1.00
##  Residual               81411.1 285.3         
## Number of obs: 392, groups:  origin, 3
## 
## Fixed effects:
##              Estimate Std. Error       df t value Pr(>|t|)  
## (Intercept)  1269.904    223.038    2.704   5.694   0.0142 *
## I(year - 70)  -10.836     13.804    2.745  -0.785   0.4946  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr)
## I(year-70) -0.981
## convergence code: 0
## boundary (singular) fit: see ?isSingular

fixed effect

##                Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)  1543.96097  35.362243 43.661285 3.184570e-152
## I(year - 70)  -32.33144   5.036824 -6.419012  3.986518e-10

mixed effect

## boundary (singular) fit: see ?isSingular
##                Estimate Std. Error       df    t value   Pr(>|t|)
## (Intercept)  1269.90374  223.03783 2.704301  5.6936697 0.01415205
## I(year - 70)  -10.83628   13.80439 2.745352 -0.7849882 0.49455593
## boundary (singular) fit: see ?isSingular

generalized models

##  default    student       balance           income     
##  No :9667   No :7056   Min.   :   0.0   Min.   :  772  
##  Yes: 333   Yes:2944   1st Qu.: 481.7   1st Qu.:21340  
##                        Median : 823.6   Median :34553  
##                        Mean   : 835.4   Mean   :33517  
##                        3rd Qu.:1166.3   3rd Qu.:43808  
##                        Max.   :2654.3   Max.   :73554

binomial distribution

## 
## Call:
## glm(formula = default ~ student + balance, family = binomial, 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.7435   0.0203   0.0559   0.1422   2.4578  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 10.7494959  0.3691914  29.116  < 2e-16 ***
## studentYes   0.7148776  0.1475190   4.846 1.26e-06 ***
## balance     -0.0057381  0.0002318 -24.750  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1571.7  on 9997  degrees of freedom
## AIC: 1577.7
## 
## Number of Fisher Scoring iterations: 8

!!! danger

Here the regression is on

\[log(ODD) = log(\frac{p_{defaukt}}{1-p_{default}})\]

##  (Intercept)   studentYes      balance 
## 4.660653e+04 2.043937e+00 9.942783e-01

Being a student multiply the odds by 2 of having default

Use the predict

##         1         2 
## 0.3259166 0.4970413

family link variable
binomial (link = “logit”) binary
gaussian (link = “identity”) normal
Gamma (link = “inverse”) continuous non normal
inverse.gaussian (link = “1/mu^2”) continuous positive non normal
poisson (link = “log”) count variable

nonlinear regression

## 
## Formula: consumption ~ a * exp(-acceleration/b) + c
## 
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)    
## a   50.343     19.212   2.620 0.009129 ** 
## b    5.992      1.673   3.581 0.000385 ***
## c    7.082      1.306   5.424 1.03e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.418 on 389 degrees of freedom
## 
## Number of iterations to convergence: 13 
## Achieved convergence tolerance: 3.969e-06

Hard to find starting values

## Error in nls(consumption ~ a * exp(-acceleration/b) + c, start = c(a = 1, : gradient singulier

use nls2

## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   le pas 0.000488281 est devenu inférieur à 'minFactor' de 0.000976562
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   le pas 0.000488281 est devenu inférieur à 'minFactor' de 0.000976562
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   le pas 0.000488281 est devenu inférieur à 'minFactor' de 0.000976562
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(),  : 
##   gradient singulier

## 
## Formula: consumption ~ a * exp(-acceleration/b) + c
## 
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)    
## a   50.343     19.212   2.620 0.009128 ** 
## b    5.992      1.673   3.581 0.000385 ***
## c    7.082      1.306   5.424 1.03e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.418 on 389 degrees of freedom
## 
## Number of iterations to convergence: 12 
## Achieved convergence tolerance: 3.634e-06

use of function in nls

damped oscillator numerical integration


\[ \frac{d^2y}{dt^2} + 2\zeta\omega_n \frac{dy}{dt} + \omega_n^2 y = 0 \]


\[\omega_n = \frac{2\pi}{T} = \sqrt{\frac{k}{m}} \]

drawing

## 
## Formula: y_noise ~ solution_analy_ODE2(amort, period, time, yo, yoder)
## 
## Parameters:
##        Estimate Std. Error t value Pr(>|t|)    
## amort   0.36644    0.04179   8.769 6.54e-14 ***
## period 42.70112    1.96623  21.717  < 2e-16 ***
## yo      3.85154    0.95628   4.028 0.000113 ***
## yoder   1.58328    0.28232   5.608 1.97e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.615 on 96 degrees of freedom
## 
## Algorithm "port", convergence message: relative convergence (4)

SEM

  • Structural Equation Modelling allow the test a multivariate model on data
  • It uses Maximum Likelihood estimation
  • It allows for Latent variables: unobserved variables (questionnaires)
  • Used in behavioral science
  • General linear model is a special case of SEM

I am not going to explain SEM (because I don’t understand it)

SEM

It allows constrained regression !

lavaan package

##                        a                        b                        c 
##                    0.006                    0.019                   -0.302 
## consumption~~consumption 
##                    1.819

##                        a                        b                        c 
##                    0.006                    0.019                   -0.301 
## consumption~~consumption 
##                    1.819

So much more

  • geeglm for generalized estimating equation
  • autoregressive regression
  • state space modelling estimation with OpenMx
  • survival fits and Kaplan Mayer in medicine
  • Bayesian world of Bugs
  • ….

Thanks

questions ? Me : https://scitilab.com