Linear regression
Linear Regression is a method of data analysis intended to be used with a set of paired observations on two variables on the same set of statistical units. Conventionally, we refer to one of the variables as independent (usually labeled x) and the other as dependent (labeled y). The notion of an independent variable often (but not always) implies the ability to choose the levels of the independent variable and that the dependent variable will respond naturally as in the stimulus-response model. The independent variable x may be a scalar or a vector. In the former case we may write one of the simplest linear-regression models as follows:
- yi=α+βxi+&epsiloni;
where &epsiloni; is a random "error". Historically, in applications to measurements in astronomy, the "error" was actually a random measurement error, but in many applications, ε is merely the amount by which the individual y-value differs from the average y-valued among individuals having the same x-value. The average value of the random "error" ε is zero. Often in linear regression problems statisticians rely on the Gauss-Markov assumptions:
- The random errors &epsiloni have expected value 0.
- The random errors &epsiloni are uncorrelated (this is weaker than an assumption of probabilistic independence).
- The random errors &epsiloni are "homoscedastic", i.e., the all have the same variance.
Sometimes stronger assumptions are relied on:
- The random errors &epsiloni have expected value 0.
- They are independent.
- They are normally distributed.
- They all have the same variance.
If xi is a vector we can take the product βxi to be a "dot-product".
It is often erroneously thought that the reason the technique is called "linear regression" is that the graph of y=α+βx is a line. But in fact, if the model is
- yi=α+βxi+γxi2+&epsiloni
(in which case we have put the vector (xi,xi2) in the role formerly played by "xi" and the vector (β,γ) in the role formerly played by β), then the problem is still one of linear regression, even though the graph is not a straight line. (I will attempt to explain the rationale below; for now I am saving my work.....)
A statistician will usually estimate the unobservable values of the parameters α and β by the method of least squares, which consists of finding the values of a and b that minimize the sum of squares of the residuals ei=yi-(a+bxi). Those values are the "least-squares estimtes. The residuals may be regarded as estimates of the errors. Notice that, whereas the errors are independent, the residuals cannot be independent because the use of least-squares estimates implies that the sum of the residuals must be 0, and the dot-product of the vector of residuals with the vector of x-values must be 0, i.e., we must have
- e1+.....+en=0 and
- e1x1+.....+enxn=0.
These two linear constraints imply that the vector of residuals must lie within a certain (n-2)-dimensional subspace of Rn; hence we say that there are "n-2 degrees of freedom for error". It can be shown to follow that
- the sum of squares of residuals ε12+....+εn2 is distributed as σ2χ2n-2, i.e., the sum of squares, divided by the error-variance σ2 has a chi-square distribution with n-2 degrees of freedom, and
- the sum of squares of residuals is actually probabilistically independent of the estimates a, b of the parameters α and β.
These facts make it possible to use Student's distribution (so named in honor of the pseudonymous "Student") to find confidence intervals for α and β.
- Note: A useful alternative to linear regression is robust regression in which mean absolute error is minimized instead of mean squared error as in linear regression. Robust regression is computationally much more intensive than linear regression and is somewhat more difficult to implement as well.
- Summarizing the data
- We sum the observations, the squares of the Y's and X's and the products of X*Y to obtain the following quantities.
- SX = X1 + X2 +...+ Xn and SY similarly
- SXX = X12 + X22 +...+ Xn2 and SYY similarly
- SXY = X1Y1 + X2Y2 +...+ XnYn
- Estimating beta
- We use the summary statistics above to calculate b, the estimate of beta.
- b = (n*SXY-SXSY)/(n*SXX-SXSX)
- Estimating alpha
- We use the estimate of beta and the other statistics to estimate alpha by:
- a = (SY - b*SX)/n
- Displaying the residuals
- The first method of displaying the residuals use the histogram or cumulative distribution to depict the similarity (or lack thereof) to a normal distribution. Non-normality suggests that the model may not be a good summary description of the data.
- We plot the residuals, (Y-a-bX) against the independent variable, X. There should be no discernible trend or pattern if the model is satisfactory for this data. Some of the possible problems are:
- Residuals increase (or decrease) as the independent variable increases -- indicates mistakes in the calculations -- find the mistakes and correct them.
- Residuals first rise and then fall (or first fall and then rise) -- indicates that the appropriate model is (at least) quadratic. See polynomial regression.
- One residual is much larger than the others and opposite in sign -- suggests that there is one unusual observation which is distorting the fit --
- Verify its value before publishing or
- Eliminate it, document your decision to do so, and recalculate the statistics.
- Ancillary statistics
- The sum of squared deviations can be partitioned as in ANOVA to indicate what part of the dispersion of the dependent variable is explained by the independent variable.
- The correlation coefficient, r, can be calculated by
- r = (n*SXY-SXSY) / sqrt[(n*SXX-SX2) * (n*SYY-SY2)]
- This statistic is a measure of how well a straight line describes the data. Values near zero suggest that the model is ineffective. r2 is frequently interpreted as the fraction of the variability explained by the independent variable, X.