Abstract:
|
In the presence of non-admissible heuristics, A* and
other best-first algorithms can be converted into anytime
optimal algorithms over OR graphs, by simply continuing
the search after the first solution is found. The same
trick, however, does not work for best-first algorithms
over AND/OR graphs, that must be able to expand leaf
nodes of the explicit graph that are not necessarily part
of the best partial solution. Anytime optimal variants
of AO* must thus address an exploration-exploitation
tradeoff: they cannot just ”exploit”, they must keep exploring
as well. In this work, we develop one such variant
of AO* and apply it to finite-horizon MDPs. This
Anytime AO* algorithm eventually delivers an optimal
policy while using non-admissible random heuristics
that can be sampled, as when the heuristic is the cost
of a base policy that can be sampled with rollouts. We
then test Anytime AO* for action selection over large
infinite-horizon MDPs that cannot be solved with existing
off-line heuristic search and dynamic programming
algorithms, and compare it with UCT. |