Review 15: Predictive Coding, Variational Autoencoders, and Biological Connections Pt 1

Predictive Coding, Variational Autoencoders, and Biological Connections by Joseph Marino

Marino, Joseph. “Predictive coding, variational autoencoders, and biological connections.” Neural Computation 34.1 (2021): 1-44.

Predictive encoding from the perspective of theoretical neuroscience and machine learning.
Informal defintion for predictive encoding: Neural circuits as probabilistic models of other neurons.
- Inception of this idea in cirucits for sensory processing, like the retina and visual pathways.
- Feedbackward connections that apply prediction error signal.
- Rao, Ballard 1999
For neuroscience, cybernetics, hemholtz machine, and predictive encoding are the inspiration for Friston 2005 and 2008 work on free energy and active inference.
For machine learning, this earlier work in variational inference and encoder-decoder models, culminates in the variational autoencoder.
Lots of overlap conceptually, but much of the research between the two fields is divided.
The next section is titled connecting predictive coding and VAEs. I take it that these two concepts form the bridge between the fields.
- Just from the intro I like that this paper points two very similar concepts in two different fields and then describes the relationship between the two. I feel that there the’s potential for a lot of discovery in understanding the union and intersection between these two concepts.
Paper posits two possible correspondences between ML and neuro.
- Dendrites of pyramidal neurons and neuronal networks.
- Lateral inhibition and normalizing flows.
Background info
- MLE
  - How can we find some distribution $p_{data}$ using r.v. samples $\bf{x} \sim \hat{p}_{data}(\bf{x})$?
  - Maximizing log likelihood of samples under that distribution. $\theta^* \longleftarrow argmax_{\theta} \mathbb{E}_{x \sim p_{data}(\bf{x})} [log(p_{theta}(\bf{x}))]$
- Probabilistic models
  - autoregressive models: $ p_{\theta(\bf{x})} = \prod_{j=1}^{m} p(\bf{x}_{j} \mid x_{< j} $
  - Can factor out $\bf{x}$ in above as a series of distributions, from $t=1 … T$
  - Latent variable models (LVMs) with $z$ have the joint distribtion: $p_{\theta}(\bf{x}, \bf{z}) = p_{theta}(\bf{x} \mid z)p_{theta}(\bf{z})$
  - $\bf{z}$ is being used here to describe $\bf{x}$, incurs computational cost.
  - flow based LVMs $p_{\theta}(\bf{x}) = p_{\theta}(x) \| det(\frac{\partial{\bf{x}}}{\partial{\bf{z}}}) \|^{-1}$
    - Distilling this down to the idea that there if some function $f_{\theta}$ that can transfrom $f_{\theta}(\bf{x}) = \bf{z}, \ f_{\theta}^{-1}(\bf{z}) = \bf{x}$
  - Can combine the above techniques for hierachical LVMs, sequential LVMS etc.
    - An interesting example is stacking latent variables $ \bf{z}^{1:L} = [ \bf{z}_{1} … \bf{z}_{L} ] $ and $p_{\theta}(\bf{x},\bf{z}^{1:L}) = p_{\theta}(\bf{x}, \bf{z}^{1:L}) \prod_{\ell=1}^{L} p(\bf{z}^{\ell} \mid z^{\ell+1:L})$
    - I am confused by the conditional probability $ p(\bf{Z^{\ell}} \mid z^{\ell+1:L})$ and the directionality of the hierarchy. Why wouldn’t it go backwards: $ p(\bf{Z^{\ell}} \mid z^{1:\ell+1})$
  - Fitting these models
    - Log density of unit normal distribution becomes mean squared error.
    - A simple univariate autoregressive model can be formulated as $ p_{\theta}(x_j \mid x_{< j}) = \mathcal{N}(x_j; \mu_{\theta}(x_{< j}), \sigma^2_{\theta}(x_{< j})) $ .
      - I think there’s a minor notation error here that the $x_{< j}$ should be boldface since it is still a vector.
    - Training using the gradient of log likelikehood: $\nabla_{\theta}\mathbb{E}_{\bf{x} \sim p_{data}}[log p_{\theta}(\bf{x})]$
    - Deep autoregressive and variational models can be broken apart by their latent variable impact on the objective with some interesting results.
- Variational Inference
  - TODO: Left off here section 2.3