Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. It is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest.
PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible.
1. Normalize the data: PCA is used to identify the components with the maximum variance, and the contribution of each variable to a component is based on its magnitude of variance. It is best to normalize the data before conducting a PCA as unscaled data with different measurement units can distort the relative comparison of variance across features.
2. Create a covariance matrix for Eigen decomposition: A useful way to get all the possible relationship between all the different dimensions is to calculate the covariance among them all and put them in a covariance matrix which represents these relationships in the data. Understanding the cumulative percentage of variance captured by each principal component is an integral part of reducing the feature set.
3. Select the optimal number of principal components: The optimal number of principal components is determined by looking at the cumulative explained variance ratio as a function of the number of components. The choice of PCs is entirely dependent on the tradeoff between dimensionality reduction and information loss.
- Feature Reduction: PCA eliminates the less important features.
- Data visualization: PCA allows you to visualize high-dimensional objects into a lower dimension.
- Partial least squares: PCA features can be used as the basis for a linear model in partial least squares.
- Outlier detection (improving data quality): Projects a set of variables in fewer dimensions and highlights extraneous values.
- Model performance: PCA can lead to a reduction in model performance on datasets with no or low feature correlation or does not meet the assumptions of linearity.
- Classification accuracy: Variance-based PCA framework does not consider the differentiating characteristics of the classes. Also, the information that distinguishes one class from another might be in the low variance components and may be discarded.
- Outliers: PCA is also affected by outliers, and normalization of the data needs to be an essential component of any workflow.
- Interpretability: Each principal component is a combination of original features and does not allow for the individual feature importance to be recognized.