August 29, 2017

Learning Objectives

  • Review simple linear regression
  • Error term \(\varepsilon_i\)
  • Least squares estimation
  • Fitted values and residuals
  • Regression and correlation
  • Evaluating the regression model
  • Residual plots and outliers
  • Forecasting with regression
  • Nonlinear functions
  • Time trend series regression

Simple linear regression

The basic concept is that we forecast variable \(y\) assuming it has a linear relationship with variable \(x\).

\[y = \beta_0 + \beta_1 x + \varepsilon .\]

The model is called simple regression as we only allow one predictor variable \(x\). The forecast variable \(y\) is sometimes also called the regressand, dependent or explained variable. The predictor variable \(x\) is sometimes also called the regressor, independent or explanatory variable.

The parameters \(\beta_0\) and \(\beta_1\) determine the intercept and the slope of the line respectively. The intercept \(\beta_0\) represents the predicted value of \(y\) when \(x=0\). The slope \(\beta_1\) represents the predicted increase in \(Y\) resulting from a one unit increase in \(x\).


We can think of each observation \(y_i\) consisting of the systematic or explained part of the model, \(\beta_0+\beta_1x_i\), and the random error, \(\varepsilon_i\).

The error \(\varepsilon_i\)

captures anything that may affect \(y_i\) other than \(x_i\). We assume that these errors:

  • have mean zero; otherwise the forecasts will be systematically biased.
  • are not autocorrelated; otherwise the forecasts will be inefficient as there is more information to be exploited in the data.
  • are unrelated to the predictor variable; otherwise there would be more information that should be included in the systematic part of the model.

It is also useful to have the errors normally distributed with constant variance in order to produce prediction intervals and to perform statistical inference.

The error \(\varepsilon_i\)

Another important assumption in the simple linear model is that \(x\) is not a random variable.

If we were performing a controlled experiment in a laboratory, we could control the values of \(x\) (so they would not be random) and observe the resulting values of \(y\).

With observational data (including most data in business and economics) it is not possible to control the value of \(x\), and hence we make this an assumption.

Least squares estimation

In practice, we have a collection of observations but we do not know the values of \(\beta_0\) and \(\beta_1\). These need to be estimated from the data. We call this fitting a line through the data.

There are many possible choices for \(\beta_0\) and \(\beta_1\), each choice giving a different line.

The least squares principle provides a way of choosing \(\beta_0\) and \(\beta_1\) effectively by minimizing the sum of the squared errors. The values of \(\beta_0\) and \(\beta_1\) are chosen so that that minimize

\[\sum_{i=1}^N \varepsilon_i^2 = \sum_{i=1}^N (y_i - \beta_0 - \beta_1x_i)^2. \]

Using mathematical calculus, it can be shown that the resulting least squares estimators are

\[\hat{\beta}_1=\frac{ \sum_{i=1}^{N}(y_i-\bar{y})(x_i-\bar{x})}{\sum_{i=1}^{N}(x_i-\bar{x})^2} \]


\[\hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x}, \]

where \(\bar{x}\) is the average of the \(x\) observations and \(\bar{y}\) is the average of the \(y\) observations. The estimated line is known as the regression line.

We do not know \(\beta_0\) and \(\beta_1\) for the true line \(y=\beta_0+\beta_1x\), so we cannot use this line for forecasting.

Therefore we obtain estimates \(\hat{\beta}_0\) and \(\hat{\beta}_1\) from the observed data to give the regression line used for forecasting: for each value of \(x\), we can forecast a corresponding value of \(y\) using \(\hat{y}=\hat{\beta}_0+\hat{\beta}_1x\).

Fitted values and residuals

The forecast values of \(y\) obtained from the observed \(x\) values are called fitted values: \(\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1x_i\), for \(i=1,\dots,N\). Each \(\hat{y}_i\) is the point on the regression line corresponding to \(x_i\).

The difference between the observed \(y\) values and the corresponding fitted values are the residuals:

\[e_i = y_i - \hat{y}_i = y_i -\hat{\beta}_0-\hat{\beta}_1x_i. \]

The residuals have some useful properties including the following two:

\[\sum_{i=1}^{N}{e_i}=0 \quad\text{and}\quad \sum_{i=1}^{N}{x_ie_i}=0. \]

Regression and correlation

Recall that the correlation coefficient \(r\) measures the strength and the direction of the linear relationship between the two variables. The stronger the linear relationship, the closer the observed data points will cluster around a straight line.

The slope coefficient \(\hat{\beta}_1\) can also be expressed as \[\hat{\beta}_1=r\frac{s_{y}}{s_x}, \] where \(s_y\) is the standard deviation of the \(y\) observations and \(s_x\) is the standard deviation of the \(x\) observations.

The advantage of a regression model over correlation is that it asserts a predictive relationship between the two variables (\(x\) predicts \(y\)) and quantifies this in a useful way for forecasting.


plot(jitter(Carbon) ~ jitter(City),xlab="City (mpg)",las=1,pch="+",
  ylab="Carbon footprint (tons per year)",col = "blue",data=fuel)
fit <- lm(Carbon ~ City, data=fuel); abline(fit, col = "red")

## Call:
## lm(formula = Carbon ~ City, data = fuel)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7014 -0.3643 -0.1062  0.1938  2.0809 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.525647   0.199232   62.87   <2e-16 ***
## City        -0.220970   0.008878  -24.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.4703 on 132 degrees of freedom
## Multiple R-squared:  0.8244, Adjusted R-squared:  0.823 
## F-statistic: 619.5 on 1 and 132 DF,  p-value: < 2.2e-16


stargazer(fit, type = "html")
Dependent variable:
City -0.221***
Constant 12.526***
Observations 134
R2 0.824
Adjusted R2 0.823
Residual Std. Error 0.470 (df = 132)
F Statistic 619.524*** (df = 1; 132)
Note: p<0.1; p<0.05; p<0.01


The estimated regression line is: \(\hat{y}=12.53-0.22x.\)

Intercept: \(\hat{\beta}_0=12.53\). A car that has fuel economy of \(0\) mpg in city driving conditions can expect an average carbon footprint of \(12.53\) tonnes per year. The interpretation of the intercept requires that a value of \(x=0\) makes sense. But even when \(x=0\) does not make sense, the intercept is an important part of the model.

Slope: \(\hat{\beta}_1=-0.22\). For every extra mile per gallon, a car's carbon footprint will decrease on average by 0.22 tonnes per year. Alternatively, if the fuel economies of two cars differ by 1 mpg in city driving conditions, their carbon footprints will differ, on average, by 0.22 tonnes per year.

Evaluating the regression model

Recall that each residual is the unpredictable random component of each observation and is defined as

\[e_i=y_i-\hat{y}_i, \]

for \(i=1,\ldots,N\).

We would expect the residuals to be randomly scattered without showing any systematic patterns. A simple and quick way for a first check is to examine a scatterplot of the residuals against the predictor variable.

A non-random pattern may indicate that a non-linear relationship may be required, or some heteroscedasticity is present (i.e., the residuals show non-constant variance), or there is some left over serial correlation (only when the data are time series).

Residual plots

res <- residuals(fit)
plot(jitter(res)~jitter(City), ylab="Residuals", xlab="City", 
     las = 1, col = "blue", pch = "+", data=fuel)

Outliers and influential observations

Observations that take on extreme values compared to the majority of the data are called outliers. Observations that have a large influence on the estimation results of a regression model are called influential observations. Usually, influential observations are also outliers that are extreme in the \(x\) direction.

Example: Predict the weight of 7 year old children by regressing weight against height. Two identical samples except for an outlier.

  1. child who weighs 35kg and is 120cm tall.
  2. child who also weighs 35kg but is much taller at 150cm (so more extreme in the x direction).