Principal Component Analysis (PCA) from Scratch Using the Classical Technique with C# - Visual Studio Magazine

The Data Science Lab

Principal Component Analysis (PCA) from Scratch Using the Classical Technique with C#
Transforming a dataset into one with fewer columns is extra difficult than it might sound, explains Dr. James McCaffrey of Microsoft Research on this full-code, step-by-step machine studying tutorial.

By James McCaffrey01/17/2024

Principal part evaluation (PCA) is a classical machine studying approach. The objective of PCA is to rework a dataset into one with fewer columns. This is named dimensionality discount. The reworked knowledge can be utilized for visualization or as the foundation for prediction utilizing ML methods that may’t deal with a lot of predictor columns.

There are two principal methods to implement PCA. The first approach computes eigenvalues and eigenvectors from a covariance matrix derived from the supply knowledge. The second PCA approach sidesteps the covariance matrix and computes a singular worth decomposition (SVD) of the supply knowledge. Both methods give the similar outcomes, topic to very small variations.

This article presents a from-scratch C# implementation of the first approach: compute eigenvalues and eigenvectors from the covariance matrix. If you are not acquainted with PCA, most of the terminology — covariance matrix, eigenvalues and eigenvectors and so forth — sounds fairly mysterious. But the concepts will develop into clear shortly.

[Click on image for larger view.] Figure 1: PCA Using the Classical Technique with C# in Action

A great way to see the place this text is headed is to check out the screenshot of a demo program in Figure 1.

For simplicity, the demo makes use of simply 9 knowledge objects:

[0] 5.1000 3.5000 1.4000 0.2000
[1] 4.9000 3.1000 1.5000 0.1000
[2] 5.1000 3.8000 1.5000 0.3000
[3] 7.0000 3.2000 4.7000 1.4000
[4] 5.0000 2.0000 3.5000 1.0000
[5] 5.9000 3.2000 4.8000 1.8000
[6] 6.3000 3.3000 6.0000 2.5000
[7] 6.5000 3.2000 5.1000 2.0000
[8] 6.9000 3.2000 5.7000 2.3000

The 9 knowledge objects are one-based objects 1, 10, 20, 51, 61, 71, 101, 111 and 121 of the 150-item Iris dataset. Each line represents an iris flower. The 4 columns are sepal size, sepal width, petal size and petal width (a sepal is a inexperienced leaf-like construction). Because there are 4 columns, the knowledge is claimed to have dimension = 4.

The main index values in brackets will not be a part of the knowledge. Each knowledge merchandise is considered one of three species: 0 = setosa, 1 = versicolor 2 = virginica. The first three objects are setosa, the second three objects are versicolor and the final three objects are virginica. The species labels had been eliminated as a result of they are not used for PCA dimensionality discount.

PCA is meant to be used with strictly numeric knowledge. Mathematically, the approach works with Boolean variables (0-1 encoded) and for one-hot encoded categorical knowledge. Conceptually, nevertheless, making use of PCA to non-numeric knowledge is questionable, and there’s little or no analysis on the subject.

The demo program applies z-score normalization to the supply knowledge. Then a 4-by-4 covariance matrix is computed from the normalized knowledge. Next, 4 eigenvalues and 4 eigenvectors are computed from the covariance matrix (utilizing the Jacobi algorithm). Next, the demo makes use of the eigenvectors to compute reworked knowledge:

[0] -2.0787 0.6736 -0.0425 0.0435
[1] -2.2108 -0.2266 -0.1065 -0.0212
[2] -2.0071 1.2920 0.1883 0.0025
[3] 1.1557 0.3503 -0.9194 -0.0785
[4] -0.7678 -2.6702 0.0074 0.0115
[5] 0.7029 -0.0028 0.4066 -0.0736
[6] 1.8359 0.2191 0.6407 -0.0424
[7] 1.3476 0.1482 -0.0201 0.0621
[8] 2.0224 0.2163 -0.1545 0.0960

At this level, in a non-demo situation, you could possibly extract the first two or first three of the columns of the reworked knowledge to make use of as a surrogate for the supply knowledge. For instance, in case you used the first two columns, you could possibly graph the knowledge with the first column appearing as x-values and the second column appearing as y-values.

The demo program concludes by programmatically reconstructing the supply knowledge from the reworked knowledge. The thought is to confirm the PCA labored appropriately and in addition for example that the reworked knowledge incorporates all of the info in the supply knowledge.

This article assumes you’ve gotten intermediate or higher programming talent however does not assume you realize something about principal part evaluation. The demo is applied utilizing C#, however it is best to have the ability to refactor the demo code to a different C-family language if you want. All regular error checking has been eliminated to maintain the principal concepts as clear as potential.

The supply code for the demo program is simply too lengthy to be introduced in its entirety on this article. The full code is on the market in the accompanying file obtain, and can also be out there on-line.

Understanding Principal Component AnalysisThe greatest technique to perceive PCA is to take a better have a look at the demo program. The demo has eight steps:

load knowledge into reminiscence

normalize the knowledge

compute a covariance matrix from the normalized knowledge

compute eigenvalues and eigenvectors from covariance matrix

type the eigenvectors primarily based on eigenvalues

compute variance defined by every eigenvector

use eigenvectors to compute reworked knowledge

confirm by reconstructing supply knowledge

The demo masses the supply knowledge utilizing a helper MatLoad() operate:

Console.WriteLine(“Loading Iris-9 knowledge “);
string dataFile = “……Datairis_9.txt”;
double[][] X = MatLoad(dataFile, new int[] { 0, 1, 2, 3 },
‘,’, “#”);
Console.WriteLine(“Source knowledge: “);
MatShow(X, 4, 9, true);

The arguments to MatLoad() imply fetch columns 0,1,2,3 of the comma-separated knowledge, the place a “#” character signifies a remark line. The arguments to MatShow() imply show utilizing 4 decimals, with a subject width of 9 per component, and show indices.

Normalizing the DataEach column of the knowledge is normalized utilizing these statements:

Console.WriteLine(“Standardizing (z-score, biased) “);
double[] means;
double[] stds;
double[][] stdX = MatStandardize(X, out means, out stds);
Console.WriteLine(“Standardized knowledge objects “);
MatShow(stdX, 4, 9, nRows: 3);

The 4 means and normal deviations for every column are saved as a result of they’re wanted to reconstruct the authentic supply knowledge. The phrases standardization and normalization are technically totally different however are sometimes used interchangeably. Standardization is a particular kind of normalization (z-score). PCA documentation tends to make use of the time period standardization, however I typically use the extra normal time period normalization.

Each worth is normalized utilizing the method x’ = (x – u) / s, the place x’ is the normalized worth, x is the authentic worth, u is the column imply and s is the column biased normal deviation (divide sum of squares by n somewhat than n-1 the place n is the variety of values in the column).

For instance, suppose a column of some knowledge has simply n = 3 values: (4, 15, 8). The imply of the column is (4 + 15 + 8) / 3 = 9.0. The biased normal deviation of the column is sqrt( ((4 – 9.0)^2 + (15 – 9.0)^2 + (8 – 9.0)^2) / n) ) = sqrt( (25.0 + 36.0 + 1.0) / 3 ) = sqrt(62.0 / 3) = 4.55.

The normalized values for (4, 15, 8) are:

4: (4 – 9.0) / 4.55 = -1.10
15: (15 – 9.0) / 4.55 = 1.32
8: (8 – 9.0) / 4.55 = -0.22

Normalized values which are damaging are lower than the column imply, and normalized values which are optimistic are better than the imply. For the demo knowledge, the normalized knowledge is:

-0.9410 0.7255 -1.3515 -1.2410
-1.1901 -0.1451 -1.2952 -1.3550
-0.9410 1.3784 -1.2952 -1.1270
1.4253 0.0725 0.5068 0.1266
-1.0655 -2.5392 -0.1689 -0.3292
0.0554 0.0725 0.5631 0.5825
0.5535 0.2902 1.2389 1.3803
0.8026 0.0725 0.7321 0.8105
1.3008 0.0725 1.0700 1.1524

After normalization, every worth in a column will sometimes be between -3.0 and +3.0. Normalized values which are lower than -6.0 or better than +6.0 are excessive and must be investigated. The thought behind normalization is to forestall columns with massive values (equivalent to worker annual salaries) from overwhelming columns with small values (equivalent to worker age).

Computing the Covariance MatrixThe covariance matrix of the normalized supply knowledge is computed like so:

Console.WriteLine(“Computing covariance matrix: “);
double[][] covarMat = CovarMatrix(stdX, false);
MatShow(covarMat, 4, 9, false);

The “false” argument in the name to CovarMatrix() means to deal with the knowledge as whether it is organized so that every variable (sepal size, sepal width, petal size and petal width) is in a column. A “true” argument means the knowledge variables are organized as rows. I exploit this interface to match that of the Python numpy.cov() library operate.

The covariance of two vectors is a measure of the joint variability of the vectors. For instance, if v1 = [4, 9, 8] and v2 = [6, 8, 1], then covariance(v1, v2) = -0.50. Loosely talking, the nearer the covariance is to 0, the much less related the two vectors are. The bigger the covariance, the better the affiliation. The signal of the covariance signifies the course of the affiliation. There’s no higher restrict on the magnitude of a covariance as a result of it can enhance as the variety of components in the vectors will increase. See my article, “Computing the Covariance of Two Vectors Using C#”, for a labored instance.

A covariance matrix shops the covariance between all pairs of columns in the normalized knowledge. For instance, cell [0][3] of the covariance matrix holds the covariance between column 0 and column 3. For the demo knowledge, the covariance matrix is:

1.1250 0.1649 0.9538 0.9147
0.1649 1.1250 -0.1976 -0.1034
0.9538 -0.1976 1.1250 1.1095
0.9147 -0.1034 1.1095 1.1250

Notice that the covariance matrix is symmetric as a result of covariance(x, y) = covariance(y, x). The values on the diagonal of the covariance matrix are all the similar (1.1250) as a result of the z-score normalization.

Computing the Eigenvalues and EigenvectorsThe eigenvalues and eigenvectors of the covariance matrix are computed utilizing this code:

Console.WriteLine(“Computing eigenvalues and eigenvectors “);
double[] eigenVals;
double[][] eigenVecs;
Eigen(covarMat, out eigenVals, out eigenVecs);
Console.WriteLine(“Eigenvectors (not sorted, as cols): “);
MatShow(eigenVecs, 4, 9, false);

For the demo knowledge that has dim = 4, the covariance matrix of the normalized knowledge has form 4-by-4. There are 4 eigenvalues (single values like 1.234) and 4 eigenvectors, every with 4 values. The eigenvalues and eigenvectors are paired.

For the demo knowledge, the 4 eigenvectors (as columns) are:

0.5498 0.2395 -0.7823 0.1683
-0.0438 0.9637 0.2420 -0.1035
0.5939 -0.1102 0.2188 -0.7663
0.5857 -0.0410 0.5306 0.6113

The program-defined Eigen() operate returns the eigenvectors as columns, so the first eigenvector is (0.5498, -0.0438, 0.5939, 0.5857). The related eigenvector is 3.1167. For PCA it does not assist to overthink the deep arithmetic of eigenvalues and eigenvectors. You can consider eigenvalues and eigenvectors as a type of factorization of a matrix that captures all the info in the matrix in a sequential approach.

Computing eigenvalues and eigenvectors is considered one of the most troublesome issues in numerical programming. Because a covariance matrix is mathematically symmetric optimistic semidefinite (assume “comparatively easy construction”), it’s potential to make use of a way known as the Jacobi algorithm to seek out the eigenvalues and eigenvectors. The majority of the demo code is the Eigen() operate and its helpers.

Eigenvectors from a matrix will not be distinctive in the sense that the signal is bigoted. For instance, an eigenvector (0.20, -0.50, 0.30) is equal to (-0.20, 0.50, -0.30). The eigenvector signal concern doesn’t have an effect on PCA.

Sorting the Eigenvalues and EigenvectorsThe subsequent step in PCA is to type the eigenvalues and their related eigenvectors from largest to smallest. Some library eigen() features return eigenvalues and eigenvectors already sorted, and a few eigen() features don’t. The demo program assumes that the eigenvalues and eigenvectors will not be essentially sorted.

First, the 4 eigenvalues are sorted in a approach that saves the ordering in order that the paired eigenvectors might be sorted in the similar order:

int[] idxs = ArgSort(eigenVals); // to type evecs
Array.Reverse(idxs); // used later
Array.Sort(eigenVals);
Array.Reverse(eigenVals);
Console.WriteLine(“Eigenvalues (sorted): “);
VecShow(eigenVals, 4, 9);

For the demo knowledge, the sorted eigenvalues are (3.1167, 1.1930, 0.1868, 0.0036). The program-defined ArgSort() operate does not type however returns the order during which to type from smallest to largest. The eigenvalues are sorted from smallest to largest utilizing the built-in Array.Sort() operate after which reversed to largest to smallest utilizing the built-in Array.Reverse() operate.

The eigenvectors are sorted utilizing these statements:

eigenVecs = MatExtractCols(eigenVecs, idxs); // type
eigenVecs = MatTranspose(eigenVecs); // as rows
Console.WriteLine(“Eigenvectors (sorted as rows):”);
MatShow(eigenVecs, 4, 9, false);

The MatExtractCols() operate pulls out every column in the eigenvectors, ordered by the idxs array, which holds the order during which to type from largest to smallest. The now-sorted eigenvectors are transformed from columns to rows utilizing the MatTranspose() operate solely as a result of that makes them a bit simpler to learn. The eigenvectors should be transformed again to columns later. The level is that if you’re taking a look at PCA documentation or working with PCA code, it is vital to maintain monitor of whether or not the eigenvectors are saved as columns (normally) or rows (typically).

https://visualstudiomagazine.com/articles/2024/01/17/principal-component-analysis.aspx