Regression

In regression we are always modelling $Y ∣ X$ and not just $Y$ since we are actually using the explanatory variable to fit $Y$ .

The MSE is minimized by the conditional expectation $E (Y ∣ X)$ . But we don’t really know how to get this, so we take a step back and assume the form of $f (X) = X β + ϵ$ in the risk minimization problem (called OLS, RSSQ) and then solve for the coefficients using matrix methods.

In multiple regression we have some simplifying assumptions s.t.

$f (X; β) = E [Y ∣ X] = β_{0} + \sum_{i = 0}^{n} β_{i} X_{i} = X β + ϵ$

Where $X$ is the $n \times (p - 1)$ design matrix.

OLS

We call the risk under the assumed linear model the \textbf{Residual Sum of Squares}. To estimate the vector of parameters $β$ using ${(y_{i}, x_{i}) ∣ i \in [n]}$ where $x_{i} = (x_{i 1}, x_{i 2}, \dots x_{i p})$ represents a row on the design matrix where each $j$ column is a variate (rooms, bathrooms, parking spots) or an attribute (bathrooms squared $x_{i}^{p}$ ) of the $i -$ th observation/unit.

$RSS (β) := \sum_{i = 1}^{n} (y_{i} - β_{0} - \sum_{j = 1}^{p} x_{ij})$

As a function of the coefficient vector, our estimate for $\hat{β}$ in quadratic matrix form is:

$\bm \hat \beta = \argmin_{\bm \beta} (\bm y - \bm X \bm \beta)^T (\bm y - \bm X \bm \beta)$

The OLS extimate for the coefficient vector is:

$β_{O L S} = (X^{T} X)^{- 1} X^{T}$

\paragraph{Sampling distribution}

The sampling distribution of the OLS coefficients estimator is $\tilde{β}_{O L S} \sim M V N (β, σ (X^{T} X)^{- 1})$

The variance estimator also has a sampling distribution following a chi-squared.

The test of hypothesis for $H_{0} : \hat{β}_{j} = 0$ uses the pivota quantity

[z_j = \frac{\hat \beta_j}{\hat \sigma \sqrt{v_j}} \sim t_{n-p-1}]

Where $v_{j}$ is the $(j + 1)^{t h}$ diagonal element of $(X^{T} X)^{- 1}$

Properties of the OLS estimator $\tilde{β}$

\begin{itemize}

\item it is unbiased ie. $E (\tilde{β}) = β$

\item variance covariance matrix is $Va r ((X^{T} X)^{- 1} X^{T}) = σ^{2} (X^{T} X)^{- 1}$ . This means that the variability and thus length of confidence and prediction intervals depends on $σ$ (which we don’t control that much) and on the design matrix $X$ which we can indeed control. The size of the variance will explode if $(X^{T} X)^{- 1}$ explodes. This happens when columns of $X$ are \textit{correlated}, ie. you have \textit{redundant} information in the design matrix.

\item $σ^{2}$ is estimated with the unbiased estimator $\tilde{σ}^{2} = \frac{SSE}{n - p - 1}$ recall $p$ is the number of exploratory variates.

\item a $100 (1 - α)$ % \textit{confidence interval} for $β_{i}$ is $\hat{β}_{i} \pm t_{α /2, n - p - 1} SE (\hat{β}_{i})$ where the Standard Error of the estimate $SE (\hat{β}_{i})$ is the square root of the corresponding diagonal element of the variance covarinace matrix $σ^{2} (X^{T} X)^{- 1}$ .

\end{itemize}

You can also show OLS with maximum likelihood estimation.

\paragraph{Sum of squares decomposition}

We can show that TSS = SSReg + SSE

We consider the hat matrix $H = X (X^{T} X)^{- 1} X^{T}$ which:

\begin{itemize}

\item is an \textit{orthogonal projection} matrix since it is indempotent.

\item is indempotent, so $H^{2} = H$

\item the fitted values are the projection of the observed values in the training set y onto the linear subspace spanned by the columnds of \textbf{X}. ie $\overset{y}{^} = Hy$

\item the estimated residuals are the corresponding orthogonal complement $\overset{r}{^} = y - Hy = (I - H) y$

\end{itemize}

Regression in R

\paragraph{lm}

Consider the output of $l m$ . The $p -$ values of the coefficients test the hypothesis $H_{0} : β_{j} = 0$ , so small $p -$ values mean the that the $b e t a_{j}$ are useful/necessary and large ones mean that the variable $X_{j}$ is not adding much to the prediction or maybe it’s information is also captured by another correlated variable. if $p > 0.5$ not a good fit.

The $F -$ statistic $p -$ value is the probability that all coefficients (except the intercept $β_{0}$ ) are zero and we can model $Y_{i}$ better with just a constant $β_{0}$ .

If lots of coefficient $p -$ values are large but the $p -$ val of the $F -$ statistic is small, this is a sign of colinearity.

\begin{itemize}

\item \textbf{Residual standard error} “2.92” on $n - p - 1$ degrees of freedom.

\item Multiple $R -$ squared: percentage of variability explained by the model.

\item Adjusted $R -$ squared

\item $F -$ statistic $p - v a l u e$ . Overall significance of th model. $H_{0} : β_{1} = β_{2} = \dots = β_{p}$

\end{itemize}

🥷🌵 Juan Bello

Explorer

Regression

Regression

OLS

Regression in R

Polynomial

Graph View

Table of Contents

Backlinks