  • Making RL Tractable by Learning More Informative Reward Functions: Example-Based Control, Meta-Learning, and Normalized Maximum Likelihood.

    • A limiting factor in RL is having reward functions that accurately represent the intended goal of the actions. There’s a gap between what the reward function specifies and what we actually want it to specify.
    • Also, there’s a trade off with having reward functions that are specific to an environment and having reward function that can generalize well.
    • The authors introduce a basic reward function as classifying an action as successful or unsuccessful.
      • works well in limited settings where a lot of exploration is not required.
      • downside is that neural network based classifiers, even with some kind of regularization can converge upon sub optimal policies.
    • They describe this downside as “overconfidence”.
    • Conditional Normalized maximum likelihood (CNML) to add uncertainty in the model.
      • Given query point $x$ and data $\mathcal{D}[x_0, y_0, … , x_N, y_N]$ CNML solves \begin{equation} \theta_i = argmax_{\theta}\mathbb{E}_{D \cup (x_q, \mathcal{C}_i)} \mathcal{L}(f(x_q), y) \end{equation} Where there are $\mathcal{C}_i \in [ \mathcal{C}_0, \mathcal{C}_k ]$, so $k$ possible classes outcome. $x_q$ is the current point. This policy makes a prediction for each class (because do the above calculation for $i \in k$), so they suggest normalizing.

      \(p_{CNML}(\mathcal{C}_i \mid x) = \frac { f_{\theta_i}(x)} {\sum\limits_{j=1}^{k}f_{\theta_j}(x)}\)

      • This reward policy is better equipped to handle exploring in uncertain situations.
    • The authors then discuss generalization to continuous valued states.
    • The authors discuss tractability of solving $k$ NMLE equations for every potential action. This is slow.
    • Then they introduce the meta learning algorithm.
      • I kind of lost the thread here.
      • train another model to predict outcome $2N$ actions corresponding to postitive and negative likelihood.
    • Final thoughts.
      • The core algorithm is really simple which is awesome but I am not entirely sure about the meta learning part. Here’s my interpreation:
        • You sample $n$ points from the replay buffer and try train another model to predict what the NML will be, given that the point is classified as helpful and unhelpful.
        • For the maze problem, this should start bulding a NML surrogate model that has a good spatial understanding of how different helpful and unhelpful point layouts shape the map.