img[alt=drawing] { width: 200px; }
Regression is a wide subject.
Objectif : present the main tools available
Give examples and result explanation.
## mpg cylinders displacement horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0
##
## weight acceleration year origin
## Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:2225 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000
## Median :2804 Median :15.50 Median :76.00 Median :1.000
## Mean :2978 Mean :15.54 Mean :75.98 Mean :1.577
## 3rd Qu.:3615 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
##
## name
## amc matador : 5
## ford pinto : 5
## toyota corolla : 5
## amc gremlin : 4
## amc hornet : 4
## chevrolet chevette: 4
## (Other) :365
Lets see how simple linear regression works. We will study fuel consumption (mpg) function of the other variables.
# set AUto as data.table
Auto <- setDT(copy(Auto))
# create the new variable
Auto[,consumption := 1/mpg*3.785/1.6093*100]
Auto[,weight := weight*0.4536]
Auto[,displacement := displacement*16.387]
I create the variable that makes sense for european:
L of gasoline per 100 km.
US | SI |
---|---|
1 Mile | 1.6093 km |
1 Galon | 3.785 L |
1 cubic inched | 16.387 cm^3 |
1 lbs | 0.4536 kg |
melt(Auto[,c(-1,-8,-9)],id.vars = "consumption") %>%
ggplot()+ geom_point(aes(value,consumption,color = variable))+
facet_wrap(~variable,scales = "free")+ theme_bw()+ theme(legend.position = "none")
##
## Call:
## lm(formula = consumption ~ weight, data = Auto)
##
## Coefficients:
## (Intercept) weight
## -0.89434 0.00899
##
## Call:
## lm(formula = consumption ~ weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4535 -1.2004 -0.0534 1.0254 7.7311
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.8943423 0.3362310 -2.66 0.00814 **
## weight 0.0089898 0.0002394 37.55 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.824 on 390 degrees of freedom
## Multiple R-squared: 0.7833, Adjusted R-squared: 0.7828
## F-statistic: 1410 on 1 and 390 DF, p-value: < 2.2e-16
Std Error is the standard deviation of the gaussian distribution of the estimators.
F statistics is the Fisher statistic of the whole model (null hypothesis : all coeffs are 0), gives the overall p value.
\[ R^2 = 1 - \frac{\sum(y_i - f_i)^2}{\sum(y_i - \bar{y})^2}\] is the percentage of explained variance by the model.
## 1 2 3 4 5 6
## 13.39423 14.16493 13.11694 13.10471 13.16995 16.80734
Predict by default predict for all non missing predictors in the initial data frame.
Can choose new preditors values:
## 1 2 3
## -0.8763627 8.0954913 2696.0557534
Auto[,fit1 := predict(fit)]%>%
ggplot()+
geom_point(aes(weight,consumption),size = 2)+
geom_line(aes(weight,fit1),color = "red",size = 1)+theme_light()
##
## Call:
## lm(formula = consumption ~ weight + horsepower + year, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6442 -0.7781 -0.0415 0.6282 6.2600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.4539933 1.6526441 14.192 < 2e-16 ***
## weight 0.0064535 0.0003563 18.113 < 2e-16 ***
## horsepower 0.0190473 0.0037303 5.106 5.17e-07 ***
## year -0.3015613 0.0205980 -14.640 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.356 on 388 degrees of freedom
## Multiple R-squared: 0.8809, Adjusted R-squared: 0.88
## F-statistic: 956.8 on 3 and 388 DF, p-value: < 2.2e-16
anova(lm(consumption ~ weight ,data = Auto),
lm(consumption ~ weight + horsepower ,data = Auto),
lm(consumption ~ weight + horsepower + year ,data = Auto),
lm(consumption ~ weight + horsepower + year + displacement ,data = Auto))
## Analysis of Variance Table
##
## Model 1: consumption ~ weight
## Model 2: consumption ~ weight + horsepower
## Model 3: consumption ~ weight + horsepower + year
## Model 4: consumption ~ weight + horsepower + year + displacement
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 390 1297.55
## 2 389 1107.01 1 190.53 103.4037 <2e-16 ***
## 3 388 713.09 1 393.92 213.7875 <2e-16 ***
## 4 387 713.09 1 0.00 0.0013 0.9712
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = consumption ~ weight + year + year:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4581 -0.7477 -0.0982 0.6232 6.5769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.550e+00 5.565e+00 0.818 0.414
## weight 2.470e-02 4.182e-03 5.906 7.68e-09 ***
## year -5.003e-02 7.385e-02 -0.677 0.499
## weight:year -2.237e-04 5.598e-05 -3.997 7.69e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.373 on 388 degrees of freedom
## Multiple R-squared: 0.8779, Adjusted R-squared: 0.877
## F-statistic: 930.3 on 3 and 388 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = consumption ~ horsepower + I(horsepower^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9056 -1.2469 -0.1442 1.2677 7.8468
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.214e+00 8.178e-01 -1.484 0.139
## horsepower 1.484e-01 1.414e-02 10.495 < 2e-16 ***
## I(horsepower^2) -2.453e-04 5.545e-05 -4.424 1.26e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.987 on 389 degrees of freedom
## Multiple R-squared: 0.7436, Adjusted R-squared: 0.7423
## F-statistic: 564.1 on 2 and 389 DF, p-value: < 2.2e-16
## consumption ~ weight
##
## Call:
## lm(formula = form, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4535 -1.2004 -0.0534 1.0254 7.7311
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.8943423 0.3362310 -2.66 0.00814 **
## weight 0.0089898 0.0002394 37.55 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.824 on 390 degrees of freedom
## Multiple R-squared: 0.7833, Adjusted R-squared: 0.7828
## F-statistic: 1410 on 1 and 390 DF, p-value: < 2.2e-16
vars <- c("cylinders","weight","year","acceleration")
res <- lapply(vars,function(var){
form <- as.formula(paste0("consumption ~",var))
lm(form,data = Auto) %>%
summary()%>%
.$coefficients %>%
.[,"Estimate"]%>%
t()%>%
`colnames<-`(NULL)%>%
as.data.table()%>%
.[,variable := var]
})%>% rbindlist() %>%
setnames(c("V1","V2"),c("intercept","slope"))
res
## intercept slope variable
## 1: 0.7099151 1.925771049 cylinders
## 2: -0.8943423 0.008989834 weight
## 3: 56.3094049 -0.593077529 year
## 4: 21.3079389 -0.647327145 acceleration
## boundary (singular) fit: see ?isSingular
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: weight ~ I(year - 70) + (1 + I(year - 70) | origin)
## Data: Auto
##
## REML criterion at convergence: 5539.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.88501 -0.64362 -0.07317 0.71990 2.24129
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## origin (Intercept) 146073.1 382.2
## I(year - 70) 519.8 22.8 -1.00
## Residual 81411.1 285.3
## Number of obs: 392, groups: origin, 3
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 1269.904 223.038 2.704 5.694 0.0142 *
## I(year - 70) -10.836 13.804 2.745 -0.785 0.4946
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## I(year-70) -0.981
## convergence code: 0
## boundary (singular) fit: see ?isSingular
fixed effect
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1543.96097 35.362243 43.661285 3.184570e-152
## I(year - 70) -32.33144 5.036824 -6.419012 3.986518e-10
mixed effect
## boundary (singular) fit: see ?isSingular
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 1269.90374 223.03783 2.704301 5.6936697 0.01415205
## I(year - 70) -10.83628 13.80439 2.745352 -0.7849882 0.49455593
## boundary (singular) fit: see ?isSingular
## default student balance income
## No :9667 No :7056 Min. : 0.0 Min. : 772
## Yes: 333 Yes:2944 1st Qu.: 481.7 1st Qu.:21340
## Median : 823.6 Median :34553
## Mean : 835.4 Mean :33517
## 3rd Qu.:1166.3 3rd Qu.:43808
## Max. :2654.3 Max. :73554
##
## Call:
## glm(formula = default ~ student + balance, family = binomial,
## data = Default)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.7435 0.0203 0.0559 0.1422 2.4578
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 10.7494959 0.3691914 29.116 < 2e-16 ***
## studentYes 0.7148776 0.1475190 4.846 1.26e-06 ***
## balance -0.0057381 0.0002318 -24.750 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2920.6 on 9999 degrees of freedom
## Residual deviance: 1571.7 on 9997 degrees of freedom
## AIC: 1577.7
##
## Number of Fisher Scoring iterations: 8
Here the regression is on
\[log(ODD) = log(\frac{p_{defaukt}}{1-p_{default}})\]
glm(default~student + balance,data = Default,family = binomial) %>%
summary() %>%
.$coefficients%>%
.[,"Estimate"]%>%
exp()
## (Intercept) studentYes balance
## 4.660653e+04 2.043937e+00 9.942783e-01
Being a student multiply the odds by 2 of having default
Use the predict
predict(glm(default~student + balance,data = Default,family = binomial),
data.frame(student = c("No","Yes"),balance = c(2000,2000)),
type="response")
## 1 2
## 0.3259166 0.4970413
family | link | variable |
---|---|---|
binomial | (link = “logit”) | binary |
gaussian | (link = “identity”) | normal |
Gamma | (link = “inverse”) | continuous non normal |
inverse.gaussian | (link = “1/mu^2”) | continuous positive non normal |
poisson | (link = “log”) | count variable |
nls(consumption ~ a*exp(-acceleration/b)+c,
start = c(a = 20, b = 10, c = 1),
data = Auto)%>%
summary()
##
## Formula: consumption ~ a * exp(-acceleration/b) + c
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## a 50.343 19.212 2.620 0.009129 **
## b 5.992 1.673 3.581 0.000385 ***
## c 7.082 1.306 5.424 1.03e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.418 on 389 degrees of freedom
##
## Number of iterations to convergence: 13
## Achieved convergence tolerance: 3.969e-06
Hard to find starting values
## Error in nls(consumption ~ a * exp(-acceleration/b) + c, start = c(a = 1, : gradient singulier
use
nls2
fitnls2 <- nls2(consumption ~ a*exp(-acceleration/b)+c,
start = data.frame(a = c(0,50), b = c(1,50), c = c(0,1)),
# algorithm = "plinear-random",
data = Auto)
## Error in (function (formula, data = parent.frame(), start, control = nls.control(), :
## gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(), :
## le pas 0.000488281 est devenu inférieur à 'minFactor' de 0.000976562
## Error in (function (formula, data = parent.frame(), start, control = nls.control(), :
## gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(), :
## gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(), :
## le pas 0.000488281 est devenu inférieur à 'minFactor' de 0.000976562
## Error in (function (formula, data = parent.frame(), start, control = nls.control(), :
## gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(), :
## gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(), :
## gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(), :
## le pas 0.000488281 est devenu inférieur à 'minFactor' de 0.000976562
## Error in (function (formula, data = parent.frame(), start, control = nls.control(), :
## gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(), :
## gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(), :
## gradient singulier
## Error in (function (formula, data = parent.frame(), start, control = nls.control(), :
## gradient singulier
##
## Formula: consumption ~ a * exp(-acceleration/b) + c
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## a 50.343 19.212 2.620 0.009128 **
## b 5.992 1.673 3.581 0.000385 ***
## c 7.082 1.306 5.424 1.03e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.418 on 389 degrees of freedom
##
## Number of iterations to convergence: 12
## Achieved convergence tolerance: 3.634e-06
\[ \frac{d^2y}{dt^2} + 2\zeta\omega_n \frac{dy}{dt} + \omega_n^2 y = 0 \]
\[\omega_n = \frac{2\pi}{T} = \sqrt{\frac{k}{m}} \]
library(deSolve)
ODE2_nls <- function(t, x, parms) {with(as.list(c(parms, x)), {
# import <- excfunc(time)
dS2 <- dS1
dS1 <- -amort*2*2*pi/period*dS1 - (2*pi/period)^2*S1
res <- c(dS2,dS1)
list(res)})}##
solution_analy_ODE2 = function(amort,period,time,yo,yoder){
parms <- c(amort = amort,period = period)
xstart = c(S1 = yo,dS1 = yoder)
out <- lsoda(xstart, time, ODE2_nls, parms)
return(out[,2])
}
test <- data.table(time = rep(0:99,10), amort = rep(seq(0,2,length.out = 10),each = 100))
test[,y:= solution_analy_ODE2(amort,40,time,3,0),by = amort]
ggplot(test)+
geom_line(aes(time,y,color = as.factor(round(amort,1))))+
theme_light()+
labs(color = "damping")
test <- data.table(time = 0:99)
test[,y_true:= solution_analy_ODE2(0.4,40,time,3,2)]
test[,y_noise := y_true + rnorm(.N,0,1.5)]
ggplot(test,aes(time,y_noise))+
geom_point()+theme_light()
fitnls2 <- nls(y_noise ~ solution_analy_ODE2(amort,period,time,yo,yoder),
data = test,
start = list(amort = 0.5, period = 10,yo = 1,yoder = 0),
lower = c(0,2,0,0),
upper = c(2,50,10,10),
algorithm = "port",
control=list(maxiter=70, warnOnly=T,minFactor = 1e-10))
##
## Formula: y_noise ~ solution_analy_ODE2(amort, period, time, yo, yoder)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## amort 0.36644 0.04179 8.769 6.54e-14 ***
## period 42.70112 1.96623 21.717 < 2e-16 ***
## yo 3.85154 0.95628 4.028 0.000113 ***
## yoder 1.58328 0.28232 5.608 1.97e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.615 on 96 degrees of freedom
##
## Algorithm "port", convergence message: relative convergence (4)
I am not going to explain SEM (because I don’t understand it)
It allows constrained regression !
lavaan package
myModel <- ' # regressions
consumption ~ a*weight + b*horsepower + c*year
'
fit <- sem(myModel, data = Auto)
# summary(fit, standardized=TRUE)
coef(fit)
## a b c
## 0.006 0.019 -0.302
## consumption~~consumption
## 1.819
myModel <- ' # regressions
consumption ~ a*weight + b*horsepower + c*year
# constraints
b == a*3
c < -b
'
fit <- sem(myModel, data = Auto)
# summary(fit, standardized=TRUE)
# parameterEstimates(fit)
coef(fit)
## a b c
## 0.006 0.019 -0.301
## consumption~~consumption
## 1.819
questions ? Me : https://scitilab.com