[Coral_current] generalization in RL

Mon Mar 28 13:52:30 EDT 2022

This is mostly for Anjali, because we're looking at reducing sample
complexity in RL systems, but I'd like to see if anyone else has any
thoughts.

Consider Q-learning.  Tabluar Q-learning is slow because it cannot
generalize across states.  RL methods that use function approximators
can generalize across states.  But I would claim that the nature of
that generalization is poorly understood.  For example, the "standard"
approach to deep learning these days is to start with a very large
network, train it forever on the training data, and hope that magic
happens (which often does) and that it generalizes well to the full
data distribution.

Does that happen in settings like RL that are not supervised?  What is
the right way to maximally leverage (generalize over) limited
experience?  I propose that we perform a detailed empirical study of
what happens as deep networks generalize during RL.  Here is a
concrete set of steps.

(1) Pick some domain with a simple real-valued state space, such as
cart-pole or swing up.  We want the states to be continuous so that we
can see generalization in neural networks, and we want them to be
simple so that we can visualize what has been learned.

(2) Pick a modern deep RL method and train to convergence.  Develop a
visualization of the policy.  That could take a few forms, like doing
a TSNE embedding in 2D for all of the states and coloring them by the
action (left or right in the two domains above).  Importantly, the
visualization should indicate something like "confidence" in the
action, or how much more the best action is preferred to the second
best or alternative action(s).

(3) Run RL again, but this time generate the plot (using the same states
and embeddings from (2)) with the current confidence values.  This will
allow us to see how the algorithm is generalizing through time.

(4) Explore the space of architectures and hyperparameters a bit to
see how they influence generalization.  What makes generalization
slower or faster?  To what extent?  Can we get ideas from what we're
seeing to inform ways of traning the networks to speed appropriate
generalization?

Has anyone done this kind of exploration before?  I suspect that what
works for classification will be different from what works for RL.
Thoughts?

   - tim

---------------------------------------
Tim Oates, Professor
Department of CS and EE
University of Maryland Baltimore County
(410) 455-3082
https://coral-lab.umbc.edu/oates/