Principal component analysis

In statistics, principal components analysis (PCA), Pearson, 1901, is a technique that can be used to simplify a dataset; more formally it is a linear transformation that chooses a new coordinate system for the data set such that the greatest variance by any projection of the data set comes to lie on the first axis (then called the first principal component), the second greatest variance on the second axis, and so on. PCA can be used for reducing dimensionality in a dataset while retaining those characteristics of the dataset that contribute most to its variance by eliminating the later principal components (by a more or less heuristic decision). These characteristics may be the "most important", but this is not necessarily the case, depending on the application.

PCA is also called the Karhunen-Loève transform (named after Kari Karhunen and Michel Loève) or the Hotelling transform (in honor of Harold Hotelling). PCA has the distinction of being the optimal linear transformation for keeping the subspace that has largest variance. This advantage, however, comes at the price of greater computational requirement if compared, for example, to the discrete cosine transform. Unlike other linear transforms, the PCA does not have a fixed set of basis vectors. Its basis vectors depend on the data set.

Assuming zero empirical mean (the empirical mean of the distribution has been subtracted away from the data set), the principal component w₁ of a dataset x can be defined as:

\mathbf {w} _{1}=\arg \max _{\Vert \mathbf {w} \Vert =1}E\left\{\left(\mathbf {w} ^{T}\mathbf {x} \right)^{2}\right\}

(See arg max for the notation.) With the first $k-1$ components, the $k$ -th component can be found by subtracting the first $k-1$ principal components from x:

\mathbf {\hat {x}} _{k-1}=\mathbf {x} -\sum _{i=1}^{k-1}\mathbf {w} _{i}\mathbf {w} _{i}^{T}\mathbf {x}

and by substituting this as the new dataset to find a principal component in

\mathbf {w} _{k}=\arg \max _{\Vert \mathbf {w} \Vert =1}E\left\{\left(\mathbf {w} ^{T}\mathbf {\hat {x}} _{k-1}\right)^{2}\right\}

.

A simpler way to calculate the components w_i uses the empirical covariance matrix of x, the measurement vector. By finding the eigenvalues and eigenvectors of the covariance matrix, we find that the eigenvectors with the largest eigenvalues correspond to the dimensions that have the strongest correlation in the dataset. The original measurements are finally projected onto the reduced vector space. Note that the eigenvectors X are actually the columns of the matrix V, where X=ULV′ is the singular value decomposition of X.

PCA is equivalent to empirical orthogonal functions (EOF).

PCA is a popular technique in pattern recognition. However, PCA is not optimized for class separability. An alternative is the linear discriminant analysis, which does take this into account. PCA optimally minimizes reconstruction error under the L2 norm.

Algorithm details

Table of symbols and abbreviations

Symbol	Meaning	Dimensions	Indices
$\mathbf {X} =\{X[m,n]\}$	data matrix, consisting of the set of all data vectors, one vector per column	$M\times N$	$m=1\ldots M$ $n=1\ldots N$
$N\,$	the number of column vectors in the data set	$1\times 1$	scalar
$M\,$	the number of elements in each column vector	$1\times 1$	scalar
$L\,$	the number of dimensions in the dimensionally reduced subspace, $1\leq L\leq M$	$1\times 1$	scalar
$\mathbf {u} =\{u[m]\}$	vector of empirical means, one mean for each row m of the data matrix	$M\times 1$	$m=1\ldots M$
$\mathbf {s} =\{s[m]\}$	vector of empirical standard deviations, one standard deviation for each row m of the data matrix	$M\times 1$	$m=1\ldots M$
$\mathbf {h} =\{h[n]\}$	vector of all 1's	$1\times N$	$n=1\ldots N$
$\mathbf {Y} =\{Y[m,n]\}$	deviations from the mean of each row m of the data matrix	$M\times N$	$m=1\ldots M$ $n=1\ldots N$
$\mathbf {Z} =\{Z[m,n]\}$	z-scores, computed using the mean and standard deviation for each row m of the data matrix	$M\times N$	$m=1\ldots M$ $n=1\ldots N$
$\mathbf {C} =\{C[m,k]\}$	covariance matrix	$M\times M$	$m=1\ldots M$ $k=1\ldots M$
$\mathbf {R} =\{R[m,k]\}$	correlation matrix	$M\times M$	$m=1\ldots M$ $k=1\ldots M$
$\mathbf {V} =\{V[m,k]\}$	matrix consisting of the set of all eigenvectors of C, one eigenvector per column	$M\times M$	$m=1\ldots M$ $k=1\ldots M$
$\mathbf {D} =\{D[m,k]\}$	diagonal matrix consisting of the set of all eigenvalues of C along its principal diagonal, and 0 for all other elements	$M\times M$	$m=1\ldots M$ $k=1\ldots M$
$\mathbf {W} =\{W[m,k]\}$	matrix consisting of a subset of the eigenvectors of C, one eigenvector per column	$M\times L$	$m=1\ldots M$ $k=1\ldots L$

Find the basis vectors

Following is a detailed description of PCA using the covariance method. Suppose you have N data vectors $\mathbf {x} _{1}\ldots \mathbf {x} _{N}$ each length M, written as $\mathbf {x} _{n}=(\mathbf {x} _{n}^{1}\ldots \mathbf {x} _{n}^{M})$ , and you want to project your data into a L dimensional subspace.

1.	Organize your data into column vectors, so you end up with an $M\times N$ matrix, X.
2.	Find the empirical mean along each dimension, so you end up with an $M\times 1$ empirical mean vector, u
3.	Subtract the empirical mean vector u from each column of the data matrix X. Store mean-subtracted data $M\times N$ matrix in Y. $\mathbf {Y} =\mathbf {X} -\mathbf {u} \cdot \mathbf {h}$
	where h is a 1 x N vector of all 1's: $h_{n}=1\,\qquad \qquad \mathrm {for\ } n=1\ldots N$
4.	Find the $M\times M$ empirical covariance matrix C from matrix Y: $\mathbf {C} =\mathbb {E} \left[\mathbf {Y} \cdot \mathbf {Y} ^{T}\right]={1 \over N-1}\mathbf {Y} \cdot \mathbf {Y} ^{T}$ . where $\mathbb {E}$ is the expected value operator and $*T$ represents the conjugate transpose operation.
5.	Create an $M\times 1$ empirical standard deviation vector s from the square root of each element along the main diagonal of the covariance matrix C: $\mathbf {s} =\{s[m]\}={\sqrt {C[m,m]}}\qquad \mathrm {for\ } m=1\ldots M$
6.	Compute and sort by decreasing eigenvalue D, the eigenvectors V of the covariance matrix C. $\mathbf {C} \cdot \mathbf {V} =\mathbf {V} \cdot \mathbf {D}$
7.	Save the mean vector u. Save the first L columns of V as the $M\times L$ matrix W, where $1\leq L\leq M$ .

Observation

Using the covariance matrix C and the standard deviation vector s, compute the correlation matrix R as:

R_{ij}={{C_{ij}} \over (s^{i}\cdot s^{j})}

.

The matrix R is symmetric (like the covariance matrix), its values are between -1 and 1, and the sum of the elements along the main diagonal is N. In other words, the trace of matrix R is N:

\mathrm {tr} (\mathbf {R} )=N

When the different dimensions of the input data have different measuring units, by using the matrix C we are computing linear combinations of data of different scales; thus using the normalized matrix R makes more sense.

In that case steps 6 and 7 become:
6. Compute and sort by decreasing eigenvalue, the eigenvectors V of R.
7. Save the mean vector u and the standard deviation vector s. Save the first L columns of V as the $M\times L$ matrix W. where

1\leq L\leq M

.

Projecting new data

Suppose you have a M×1 data vector D. Then the k×1 projected vector is v = P^T(D − u).

If the correlation matrix R has been used instead of the covariance matrix C, the elements of the input vector should be normalized: $Z^{i}=(D^{i}-u^{i})\cdot (\sigma ^{i})^{-1}$ . Then the projected vector is $v=P^{T}\cdot Z$ .

Derivation of PCA using the covariance method

Let X be a d-dimensional random vector expressed as column vector. Without loss of generality, assume X has zero empirical mean. We want to find a $d\times d$ orthonormal projection matrix P such that

Y=P^{\top }X

with the constraint that

\operatorname {cov} (Y)

is a diagonal matrix and

P^{-1}=P^{\top }

.

By substitution, and matrix algebra, we obtain:

{\begin{matrix}\operatorname {cov} (Y)&=&\operatorname {E} [YY^{\top }]\\\ &=&\operatorname {E} [(P^{\top }X)(P^{\top }X)^{\top }]\\\ &=&\operatorname {E} [(P^{\top }X)(X^{\top }P)]\\\ &=&P^{\top }\operatorname {E} [XX^{\top }]P\\\ &=&P^{\top }\operatorname {cov} (X)P\end{matrix}}

.

We now have:

{\begin{matrix}P\operatorname {cov} (Y)&=&PP^{\top }\operatorname {cov} (X)P\\\ &=&\operatorname {cov} (X)P\\\end{matrix}}

.

Rewrite P as d $d\times 1$ column vectors, so

P=[P_{1},P_{2},\ldots ,P_{d}]

and $\operatorname {cov} (Y)$ as:

{\begin{bmatrix}\lambda _{1}&\cdots &0\\\vdots &\ddots &\vdots \\0&\cdots &\lambda _{d}\end{bmatrix}}

.

Substituting into equation above, we obtain:

[\lambda _{1}P_{1},\lambda _{2}P_{2},\ldots ,\lambda _{d}P_{d}]=[\operatorname {cov} (X)P_{1},\operatorname {cov} (X)P_{2},\ldots ,\operatorname {cov} (X)P_{d}]

.

Notice that in $\lambda _{i}P_{i}=\operatorname {cov} (X)P_{i}$ , P_i is an eigenvector of X′s covariance matrix. Therefore, by finding the eigenvectors of X′s covariance matrix, we find a projection matrix P that satisfies the original constraints.

Correspondence analysis

Correspondence analysis is conceptually similar to PCA, but scales the data (which must be positive) so that rows and columns are treated equivalently. It is traditionally applied to contingency tables where Pearson's chi-square test has shown a relationship between rows and columns.

References

Template:Journal reference

External link