Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Copyright 2018 The TensorFlow Authors.
Licensed under the Apache License, Version 2.0 (the "License");
Probabilistic PCA
You can run this in CoCalc! |
![]() |
Probabilistic principal components analysis (PCA) is a dimensionality reduction technique that analyzes data via a lower dimensional latent space (Tipping and Bishop 1999). It is often used when there are missing values in the data or for multidimensional scaling.
Imports
The Model
Consider a data set of data points, where each data point is -dimensional, . We aim to represent each under a latent variable with lower dimension, . The set of principal axes relates the latent variables to the data.
Specifically, we assume that each latent variable is normally distributed,
The corresponding data point is generated via a projection,
where the matrix are known as the principal axes. In probabilistic PCA, we are typically interested in estimating the principal axes and the noise term .
Probabilistic PCA generalizes classical PCA. Marginalizing out the the latent variable, the distribution of each data point is
Classical PCA is the specific case of probabilistic PCA when the covariance of the noise becomes infinitesimally small, .
We set up our model below. In our analysis, we assume is known, and instead of point estimating as a model parameter, we place a prior over it in order to infer a distribution over principal axes.
The Data
We can use the Edward2 model to generate data.
We visualize the dataset.
Maximum a Posteriori Inference
We first search for the point estimate of latent variables that maximizes the posterior probability density. This is known as maximum a posteriori (MAP) inference, and is done by calculating the values of and that maximise the posterior density .
We can use the Edward2 model to sample data for the inferred values for and , and compare to the actual dataset we conditioned on.
Variational Inference
MAP can be used to find the mode (or one of the modes) of the posterior distribution, but does not provide any other insights about it. We next use variational inference, where the posterior distribtion is approximated using a variational distribution parametrised by . The aim is to find the variational parameters that minimize the KL divergence between q and the posterior, , or equivalently, that maximize the evidence lower bound, .
Acknowledgements
This tutorial was originally written in Edward 1.0 (source). We thank all contributors to writing and revising that version.