导图社区 杜克大学《线性回归方程与建模》思维导图
Coursera上Duke University线性回归方程与建模(Linear Regression and MOdeling)一课程四周的知识要点
编辑于2020-07-09 05:13:33Linear Regression and Modeling
2. Outliers & Inference for regression
outliers
a leverage point: lies away from the center of data in the horizontal direction
a influential point: a point that influences the slope of the line -usually away from the trajectory
Don't move outliers out of data without good reasons
Categorical explanatory variable w/ few observations could act as influential data
t-test for explanatory variable
Determine whether x is a significant predictor for responce variable
formulas
t-value 
degree of freedom
CI: 
Intercept 的t-value 常常表明不显著,这是因为0常常不在定义域里,不需要改写回归方程
3. Multiple Linear Regression
Multiple Predictor
y=β0+β1*x1+β2*x2+...+βk*xk
Adjusted R squared
a penalty for the number of included variables in the model
 n: the number of observations k: the number of predictors SSE: sum of squares errors
only increase if the added variable has a meaningful contribution to the amount of explained variability in yy
Interaction variables: considering different slopes for different category
Inference for MLR
β0: expected value of response variable when all predictor variables are equal to 0, on average
βk: all else held constant, for each unit increase in x, we would expect y to be higher/lower by βk
Model Selection
Definition: identifying the best model for predicting a given response variable
Collinearity
Definition: high correlation between two independent variable, resulting redundant information in model
Outcome of multicollinearity: unreliable estimates of coefficients
Principle: if two predictors are highly correlated, only include one in model
Parsimony: prefer simpler (parsimonious) models over more complicated ones
Occam's razor: Among competing hypotheses, the one with the fewest assumption should be selected
Stepwise model selection
Types (both backward and forward)
based on p-values: drop variables that are not significant
based on adjusted R^2: choose the model with higher adjusted R^2
Full model: model with all predictors
Model Diagnostic
F-test
assess the significance of the model as a whole
H0: β1 = β2 = ... =βk = 0
H1: At least one βi != 0
df = n-k-1
usually reported at the bottom of the regression output
t-test
Assess if a given predictor is significant, given that all others are in the model (Because p-values associated with each predictor are conditional on other variables being included in the model.
H0: β1 = 0, given all other variables are included in the model
H1: β1 != 0, given all other variables are included in the model
caucluate p-value: use t value under t distribution with n-k-1 df
CI: 
1.2 Linear Regression w/ 1 predictor
predict by y^ = b0 + b1*x
x should be in the range of the observed data, unless we are confident that linear relationship continues
extrapolation: 外推出原始数据的取值区间 applying a model estimate to values outside of the realm of the original data is called extrapolation
R^2: R squared
% of the variability in the response variable explained by the explanatory variable
calculated as the square of the correlation coefficient
prefer R^2 as close to 100% as possible -> strong correlation
1.1 Relationship between 2 numerical variables
y=β0+β1x association
direction: positive or non-positive
Form: Linear or not -(correlation相关关系仅指linear)
strength: determined by the scatter around the underlying relationship
estimate of β0,β1: b0, b1
correlation coefficient (R, Pearson's R)
R= 0 no linear relationship, but not no association
measure the strength of the linear association between 2 numerical variables
unitless, R won't change after unit conversion
Corr(X,Y)=Corr(Y, X)
sensitive to outliers
Least Square Line
Condition
Linearity
diagnostic: Residual plots 
代码:ggplot(data = m1, aes(x = .fitted, y = .resid)) + geom_point() + geom_hline(yintercept = 0, linetype = "dashed") + xlab("Fitted values") + ylab("Residuals")
Normality in Residual
Normal probability plot/quantile-quantile plot 
代码:ggplot(data = m1, aes(sample = .resid)) + stat_qq()
Histogram: 
代码:ggplot(data = m1, aes(x = .resid)) + geom_histogram(binwidth = 25) + xlab("Residuals")
Consitant Variability
同样查看residual plots得到
Residual: ei = yi - yi^
ei<0 overestimate
ei >0 underestimate
Estimate
slope b1=R*Sy/Sx, S: SD
intercept b0 = {ybar} - b1{xbar}
做出有回归线的散点图: 
代码:ggplot(data = mlb11, aes(x = at_bats, y = runs)) + geom_point() + stat_smooth(method = "lm", se = FALSE)
Interpret
Numerical x: For each unit increase in x, we would expect y to be lower/higher on average by |b1| units
Categorical x: the value of response variable is expected to be |b1| unit lower/higher between the baseline level and the other level of explanatory variably
Numerical x: when x=0, we would expect y to be, on average b0
Categorical x: The expected average value of response variable for the reference level of explanatory variable is b0