Principle Component Analysis/

Factor Analysis

## Chong Ho (Alex) Yu, Ph.D., CNE, MCSE |

The objective of this article is to explain the concepts of eigenvector, eigenvalue, variable space, and subject space, as well as the application of these concepts in factor analysis and regression analysis. Gabriel Biplot (Gabriel, 1981; Jacoby, 1998) in SAS/JMP will be used as an example. You may come across such terms as eigenavlue and eigenvector in factor analysis and principal component analysis. What do they mean? Are they from an alien language?

No, they are from earth. We deal with numbers every day. A mathematical object with an numeric/quantitative value is called a **scalar**. A mathematical object that has both a numeric value and a direction is called a **vector**. If I just tell you to drive 10 miles to reach my home, this instruction is definitely useless. I must say something like, "From Tempe drive 10 miles West to Phoenix." This example shows how essential it is to have both quantitative and directional information.

If you are familar with computing networking, you may know that the **Distance Vector protocol** is used for a network router to determine which path is the best way to transmit data. Again, the router must know two things: Distance (how far is the destination from the source?) and Vector (To what direction should the data travel?)

Another example can be found in computer graphics. There is a form of computer graphics called vector-based graphics, which is used in Adobe Illustrator, Macromedia Flash, and Paint Shop Pro. In vector-based graphics, the image is defined by the **relationships** among vectors instead of the composition of pixels. For example, to construct a shape, the software stores the information like "Start from point A, draw a straight line at 45 degrees, stop at 10 units, draw another line at 35 degrees..." In short, the scalars and vectors of vector-based graphics define the characteristics of an image.

In the context of statistical analysis, vectors help us to understand the **relationships** among variables. "Eigen" is a German word, which means **characteristic**. An eigenvalue has a numeric property while an eigenvector has a directional property. These properties together define the characteristics of a variable.

GRE-Verbal | GRE-Quant | |

David | 550 | 575 |

Sandra | 600 | 580 |

The above data can be viewed as a matrix as the following.

600 580

In a scatterplot we deal with the variable space. In the plot on the right,GRE-V lies on the X-axis whereas GRE-Q is on the Y-axis. The data points are the scores of David and Sandra. In a two data-point case, the regression line is perfect, of course. |

The graph on the right is a plot of subject space. In this graph the X axis and Y axis represent Sandra and David. In GRE-V David scores 550 and Sandra scores 600. A vector is drawn from 0 to the point where Sandra's and David's scores meet (the scale of the graph is not of the right proportion. Actually it starts from 500 rather than 0 in order to make other portions of the graph visible). The vector for GRE-Q is constructed in the same manner. |

In reality, a research project always involves more than two variables and two subjects.In a multi-dimensional hyperspace, the vectors in the subject space can be combined to form an **eigenvector**, which depicts the **eigenvalue**. The longer the length of the eigenvector is,the higher the eigenvalue is and the more variance it can explain.

For example, assume that you are questioning whether you should use GRE-V and GRE-Q together to predict GPA. In a two-subject case, you can examine the relationship between GRE-Q and GRE-V by looking at the promixity of two vectors. When the angle between two vectors is large, both GRE-Q and GRE-V can be retained in the model. But if two vectors exactly overlap or almostoverlap each other, then the regression model must be refined. |

As mentioned before, the size of eigenvalues can also tell us the strength of association between the variables (variance explained). When computing a regression model in the variable space, you can use **Variation Inflation Factor (VIF)** to detect the presence of collinearity. Eigenvalue can be conceptualized as a subject space equivalence to VIF.

Because many people are familiar with regression analysis, in the following regression is used as a metaphor to illustrate the concept.

We usually depict regression in the variable space. In the variable space the data points are people. The purpose is to fit the regression line with the people. In other words, we want the regression line passes through as many people as possible with the least distance between the regression line and the data points. This criterion is called least square, which is the sum of square of the residuals. |

In factor analysis and principal component analysis, we jump from variable space into subject space. In the subject space we fit the factor with the variables. The fit is based upon factor loading--variable-factor correlation. The sum of square of factor loadings is Eigenvalue. According to Kaiser's rule, we should retain the factor which has an Eigenvalue of one or above. |

Variable space | Subject space | |

Graphical representation | The axes are variables whereas the data points are people. | The axes are people whereas the data points are variables. |

Reduction | The purpose of regression analysis is to reduce a large number of people's responses into a small manageable number of trends called regression lines. | The purpose of factor analysis is to reduce a large number of variables into a small manageable number of factors which are represented by Eigenvectors. |

Fit | This reduction of people's responses is essentially to make the scattered data form a meaningful pattern. To find the pattern in variable space we "fit" the regression line to the people's responses. In statistical jargon we call it the best fit. | In subject space we look for the fit between the variables and the factors. We want each variable to "load" into the factor most related to it. In statistical jargon we call this factor loading. |

Criterion | In regression we sum the squares of residuals and make the best fit based on the least square. These are the criteria used to make the reduction and the fit. | In factor analysis we sum the squares of factor loadings to get the Eigenvalue. The size of the Eigenvalues determines how many factors are "extracted" from the variables. |

Structure | In regression we want that the regression line passes through as many points as possible. | In factor analysis the eigenvalue is geometrically expressed in the eigenvector. We want the eigenvector passes through as many points as possible. In statistical jargon we call this simple structure, which will be explained later. |

Equation | In regression the relationship between the outcome variable and the predictor variables can be expressed in a weighted linear combination such as Y = a + b_{1}X_{1} + b_{2}X_{2} + e. | In factor analysis the relationship between the latent variable (factor) and the observed variables can also be expressed in a weighted linear combination such as Y = b_{1}X_{1} + b_{2}X_{2} except that there is no intercept in the equation. |

Positive Manifold and Simple Structure

Positive manifold: The data may turn out having large positive and negative loadings. If you know that your factors are bipolar e.g. introvert personality and extrovert personality, it is acceptable. If your factors are measuring quantiative intelligence and verbal intelligence, they may be lowly correlated but should not go to oppositve directions. In other words, students who score very good in math may not have equal performance in English, but their English scores should not score extremely poor. In this case, you had better rotate the factors to get as many positive loadings as possible. |

**Simple structure**: Simple structure suggests that any one variable should be highly related to only one factor and most of the loadings on any one factor should be small. If some variables have high loadings into serveral factors, the researcher must rotate the factors. For instance, in the following case most variables are loaded into Factor A, and variable 3 and 5 have high loadings in both Factor A and B.

Factor A | Factor B | |

Varaible 1 | .75 | .32 |

Variable 2 | .79 | .21 |

Variable 3 | .64 | .67 |

Variable 4 | .10 | .50 |

Variable 5 | .55 | .57 |

After rotation the structure should be less messy and simpler. In the following case, variable 1, 3, and 5 are loaded into factor A while variable 2 and 4 are loaded into factor B:

Factor A | Factor B | |

Varaible 1 | .63 | .39 |

Variable 2 | .49 | .66 |

Variable 3 | .77 | .27 |

Variable 4 | .03 | .70 |

Variable 5 | .75 | .33 |

Combining subject space and variable space

Jacoby, W. G. (1998). __Statistical graphics for visualizing multivariate data__. Thousand Oaks: Sage Publications.