Review 15: Predictive Coding, Variational Autoencoders, and Biological Connections Pt 1
Predictive Coding, Variational Autoencoders, and Biological Connections by Joseph Marino
Marino, Joseph. “Predictive coding, variational autoencoders, and biological connections.” Neural Computation 34.1 (2021): 1-44.
- Predictive encoding from the perspective of theoretical neuroscience and machine learning.
- Informal defintion for predictive encoding: Neural circuits as probabilistic models of other neurons.
- Inception of this idea in cirucits for sensory processing, like the retina and visual pathways.
- Feedbackward connections that apply prediction error signal.
- Rao, Ballard 1999
- For neuroscience, cybernetics, hemholtz machine, and predictive encoding are the inspiration for Friston 2005 and 2008 work on free energy and active inference.
- For machine learning, this earlier work in variational inference and encoder-decoder models, culminates in the variational autoencoder.
- Lots of overlap conceptually, but much of the research between the two fields is divided.
- The next section is titled connecting predictive coding and VAEs. I take it that these two concepts form the bridge between the fields.
- Just from the intro I like that this paper points two very similar concepts in two different fields and then describes the relationship between the two. I feel that there the’s potential for a lot of discovery in understanding the union and intersection between these two concepts.
- Paper posits two possible correspondences between ML and neuro.
- Dendrites of pyramidal neurons and neuronal networks.
- Lateral inhibition and normalizing flows.
- Background info
- MLE
- How can we find some distribution $p_{data}$ using r.v. samples $\bf{x} \sim \hat{p}_{data}(\bf{x})$?
- Maximizing log likelihood of samples under that distribution. \(\theta^* \longleftarrow argmax_{\theta} \mathbb{E}_{x \sim p_{data}(\bf{x})} [log(p_{theta}(\bf{x}))]\)
- Probabilistic models
- autoregressive models: $ p_{\theta(\bf{x})} = \prod_{j=1}^{m} p(\bf{x}_{j} \mid x_{< j} $
- Can factor out $\bf{x}$ in above as a series of distributions, from $t=1 … T$
- Latent variable models (LVMs) with $z$ have the joint distribtion: \(p_{\theta}(\bf{x}, \bf{z}) = p_{theta}(\bf{x} \mid z)p_{theta}(\bf{z})\)
- $\bf{z}$ is being used here to describe $\bf{x}$, incurs computational cost.
- flow based LVMs \(p_{\theta}(\bf{x}) = p_{\theta}(x) \| det(\frac{\partial{\bf{x}}}{\partial{\bf{z}}}) \|^{-1}\)
- Distilling this down to the idea that there if some function $f_{\theta}$ that can transfrom $f_{\theta}(\bf{x}) = \bf{z}, \ f_{\theta}^{-1}(\bf{z}) = \bf{x}$
- Can combine the above techniques for hierachical LVMs, sequential LVMS etc.
- An interesting example is stacking latent variables $ \bf{z}^{1:L} = [ \bf{z}_{1} … \bf{z}_{L} ] $ and \(p_{\theta}(\bf{x},\bf{z}^{1:L}) = p_{\theta}(\bf{x}, \bf{z}^{1:L}) \prod_{\ell=1}^{L} p(\bf{z}^{\ell} \mid z^{\ell+1:L})\)
- I am confused by the conditional probability $ p(\bf{Z^{\ell}} \mid z^{\ell+1:L})$ and the directionality of the hierarchy. Why wouldn’t it go backwards: $ p(\bf{Z^{\ell}} \mid z^{1:\ell+1})$
- Fitting these models
- Log density of unit normal distribution becomes mean squared error.
- A simple univariate autoregressive model can be formulated as $ p_{\theta}(x_j \mid x_{< j}) = \mathcal{N}(x_j; \mu_{\theta}(x_{< j}), \sigma^2_{\theta}(x_{< j})) $ .
- I think there’s a minor notation error here that the $x_{< j}$ should be boldface since it is still a vector.
- Training using the gradient of log likelikehood: $\nabla_{\theta}\mathbb{E}_{\bf{x} \sim p_{data}}[log p_{\theta}(\bf{x})]$
- Deep autoregressive and variational models can be broken apart by their latent variable impact on the objective with some interesting results.
- Variational Inference
- TODO: Left off here section 2.3
- MLE