We propose a way to differentiate through MDP Planning for Restless Multi-Armed Bandits. We use this approach to better learn the Transition Matrices from "features" associated with different arms using Decision-Focused Learning.
We propose a way to optimally differentiate through Reinforcement Learning. Specifically, we propose two optimality conditions that hold at convergence and show how to (approximately) calculate gradients using them.
We propose two online model-free algorithms to learn the Whittle Index associated with *multi-action* Restless Multi-Armed Bandits.