AI & ML

Principal Components in TypeScript (Part 2): PCA's Core Mechanics Explained

· 5 min read

For data professionals wrestling with complex, high-dimensional datasets, Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) are powerful tools. They cut through noise, reveal underlying structures, and can transform unwieldy information into actionable insights. Yet, the elegance of these mathematical techniques can sometimes obscure practical challenges, particularly when translated into real-world software implementations.

The Power and the Pitfall: PCA in Practice

Consider the classic problem of normalizing data: a common scenario in everything from financial modeling to customer segmentation. The original article illustrates this with a simple, relatable example: a high school teacher trying to fairly grade students across three exams of varying difficulty. Averages alone don't tell the full story. Student 4 might have a slightly higher average than Student 2, but was that due to genuine understanding or just excelling on the easiest test?

This is where PCA steps in. By normalizing the data – subtracting the mean from each value – and then analyzing the variances and covariances, PCA can identify the core, independent factors influencing performance. In our exam scenario, it could uncover that Exam 1 was far more discriminatory, meaning it truly separated high and low performers, while Exam 2 clustered scores more tightly. PCA assigns mathematical weights to these factors, allowing for a more nuanced and "fairer" composite score. Student 4, in the example, ultimately scores higher than Student 2 when normalized for exam difficulty, shifting from a 63.33 average to a score of 53, compared to Student 2's 44. This kind of clarity is invaluable in any field trying to assess performance or patterns accurately.

The calculations involve creating a Deviation Matrix, then multiplying its transpose by itself to get a Variance-Covariance matrix. This matrix is then subjected to SVD, which decomposes it into components that reveal characteristic insights and their importance. For reducing the number of data columns, we look at the 'V' matrix (eigenvectors) and 'Σ' (eigenvalues). Eigenvalues tell you how much variance each eigenvector explains, allowing you to select the most significant components to represent your data. Often, as the Pareto principle suggests, a single principal component can explain about 80% of the data's variance.

This process lets you compress data by selecting just a few key eigenvectors. If you start with 4 columns and 3 million rows, and you can effectively represent it with 2 columns, you've dramatically reduced your data footprint without losing too much critical information. While the "uncompressed" data is an approximation, the lossy nature is often an acceptable trade-off for the gains in efficiency and clarity.

Elegance vs. Optimization: A Developer's Dilemma

But the journey from elegant math to performant code isn't always smooth. The `pca-js` JavaScript library, for instance, provides an implementation that, by its own admission, "skews for elegance instead of optimization." This philosophy is visible in how it handles core operations like calculating the Deviation Matrix.

The library's `multiplyAndScale` function, central to its data normalization, employs three nested loops. This structure inherently leads to an O(n^3) time complexity in the worst-case scenario. For smaller datasets, this might be negligible. However, for "moderately large datasets"—and in today's data-rich environment, "moderate" can still mean millions of records—an O(n^3) operation rapidly escalates into a performance bottleneck, potentially leading to Out Of Memory (OOM) errors. The original author directly raises this concern, questioning if the issue is truly fixed or if OOM will persist.

let unit = unitSquareMatrix(matrix.length);
let deviationMatrix = subtract(matrix, multiplyAndScale(unit, matrix, 1 / matrix.length));
const D = deviationMatrix
//Where multiply and scale basically just does a matrix multiplication with the scaling used to calculate mean
//The below snippet is taken directly from the library
/**
* Fix for #11, OOM on moderately large datasets, fuses scale and multiply into a single operation to save memory
*
* @param {Matrix} a
* @param {Matrix} b
* @param {number} factor
* @returns
*/
export function multiplyAndScale(a: Matrix, b: Matrix, factor: number): Matrix {
assertValidMatrices(a, b, "a", "b")
const aRows = a.length;
const aCols = a[0].length;
const bCols = b[0].length;
const flat = new Float64Array(aRows * bCols);
for (let i = 0; i < aRows; i++) {
for (let k = 0; k < aCols; k++) {
const aVal = a[i][k] * factor;
const iOffset = i * bCols;
for (let j = 0; j < bCols; j++) {
flat[iOffset + j] += aVal * b[k][j];
}
}
const result: Matrix = [];
for (let i = 0; i < aRows; i++) {
result[i] = Array.from(flat.subarray(i * bCols, (i + 1) * bCols));
}
return result;
}

The justification offered for this design is that the library "assumes that you would eventually be doing this on the GPU." While GPU acceleration can indeed make parallel operations significantly faster for high-volume data, this assumption effectively pushes the burden of optimization onto the user's infrastructure. It means that if your environment isn't GPU-accelerated, or if you're dealing with "moderately large" but not "massive" datasets that might not warrant a full GPU setup, you could hit performance walls surprisingly quickly. A simpler approach for mean calculation, such as direct mapping and reduction on matrix columns, could circumvent this issue for CPU-bound applications.

This tension between elegant, theoretically sound code and real-world performance is a constant challenge for library developers and the data professionals who rely on their tools. The `pca-js` project's source code directly reveals these design choices.

Beyond the Numbers: The Interpretation Challenge

While PCA excels at dimension reduction and pattern identification, it's not a silver bullet for insight into inherently distinct variables. The original article bluntly labels PCA's interpretability in such cases as "Wishy Washy AF." If our exams were not just different difficulties of the "same subject" but entirely different subjects—say, Physics, Chemistry, and Math—then a combined "score" becomes less about a student's overall aptitude and more about an opaque weighting. Is a high-scoring student a "physical chemist" or a "mathematical physicist"? The labels blur because the underlying concepts are too disparate to merge into a single, meaningful composite.

This highlights a crucial distinction: PCA is phenomenal when you want to reduce redundant features, visualize high-dimensional data, or preprocess data for machine learning models. It helps in situations where you believe there are underlying, unobserved factors that explain correlations between your variables (e.g., "exam difficulty" as an unobserved factor influencing all exam scores). But when your variables represent fundamentally different domains, asking PCA to combine them into a single, easily interpretable "super-variable" might lead to more confusion than clarity.

[
[var(f1), cov(f1,f2), cov(f1,f3)],
[cov(f2,f1), var(f2), cov(f2,f3)],
[cov(f3,f1), cov(f3,f2), var(f3) ]
]

The mathematical operations themselves are clear: we are calculating variances and covariances, then decomposing these into characteristic vectors and their importance scores. The interpretation of these "eigen-characteristics" (eigen is German for "own," implying inherent properties) then becomes the crucial, often subjective, step. When applied to exam scores, the highest-weighted component makes sense as "difficulty." When applied to customer purchasing habits, it might be "price sensitivity" or "brand loyalty." But without a clear conceptual link between the original variables, the principal components can become abstract and difficult to name convincingly.

percentage_explained = Σ(selected eigenvalues) / Σ(all eigenvalues)

Making Smart Choices as a Data Practitioner

For informed industry professionals, the takeaway here is nuanced. PCA and SVD remain indispensable tools for data scientists and analysts. They are invaluable for tackling the challenges of massive datasets, improving model performance through dimensionality reduction, and uncovering latent patterns that simpler methods miss. However, the choice of implementation matters. When selecting a library, especially for core numerical operations, it's not enough to know *what* it does; you need to understand *how* it does it. Investigate its underlying algorithms, its stated design philosophy, and its performance characteristics. Benchmark with your own data, especially if you anticipate working with "moderately large" datasets that could push an O(n^3) solution to its limits.

Moreover, approach the interpretation of PCA results with a critical eye. While powerful for compression and identifying underlying factors within related variables, be cautious about over-interpreting combined components when your original variables are conceptually disparate. PCA reduces complexity, but it doesn't magically create new, easily explainable insights where none existed before.