Yaqi Duan is a postdoctoral researcher at the Laboratory for Information & Decision Systems at Massachusetts Institute of Technology, working with Professor Martin J. Wainwright. Her research interests lie broadly in the intersection of statistics and machine learning, in particular data-driven sequential decision making and reinforcement learning. In fall 2023, she will join the Stern School of Business at New York University as an Assistant Professor at the Department of Technology, Operations, and Statistics. Yaqi graduated with a PhD from the Department of Operations Research and Financial Engineering at Princeton University. Prior to that, she received a B.S. in Mathematics from Peking University. Yaqi will give this talk in the Lawrence Brown PhD Student Award session at JSM Toronto.

Optimal policy evaluation using kernel-based temporal difference methods

Policy evaluation in reinforcement learning refers to evaluating the performance of a decision policy using previously collected datasets in batch. The quality of a given policy is assessed by its value function—that is, the expected sum of (discounted) rewards under a trajectory generated by running the given policy. Policy evaluation is central to many applications. For example, in the setting of clinical treatments, the value function might correspond to the expected long-term survival rate of patients, whereas in inventory management, it measures the profits/losses of a company over time.

In practice, policy evaluation is rendered challenging by the complexity of the underlying state space, which can be of finite size but prohibitively large, or continuous in nature. In most cases of interest, it is essential to use some type of function approximation to compute what is known as a projected fixed point associated with the Bellman operator. In particular, we study projected fixed point approximations that are based on non-parametric regression using reproducing kernel Hilbert spaces (RKHSs).

We provide non-asymptotic characterizations of the statistical properties of regularized kernel-based least-squares temporal difference (LSTD) estimators for policy evaluation. We study the case of infinite-horizon γ-discounted Markov reward process (MRP). The difference between the empirical and population estimators is measured in L2μ norm, with μ denoting the stationary distribution. At a high level, the main contributions of our work are sharp and partially instance-dependent analyses of this estimation error. 

We use empirical process theory techniques to derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator, as well as the instance-dependent variance of the Bellman fluctuation. In addition, we prove minimax lower bounds over sub-classes of MRPs, which shows that our rate is optimal in terms of the sample size n and the effective horizon H=(1−γ)1. Whereas existing worst-case theory predicts cubic scaling H3 in the effective horizon, our theory reveals that there is in fact a much wider range of scalings, depending on the kernel, the stationary distribution, and the variance of the Bellman fluctuation. Notably, it is only parametric and near-parametric problems that can ever achieve the worst-case cubic scaling.

We further carry out a more refined analysis on how temporal dependencies and model mis-specification affect the estimation error, in particular when the dataset takes the form of one or more trajectories collected by applying the policy of interest. Such trajectory-based data opens the possibility of using more sophisticated multi-step kernel LSTD methods, including canonical k-step look-ahead TD for k=1,2,… and the TD(λ) family for λ [0,1] as special cases.

Our non-asymptotic upper bounds on estimation error reveal some delicate interactions between mixing time and model mis-specification and provides guidance on the choice of look-ahead defining the multi-step estimator itself. For a given LSTD method applied to a well-specified model, its statistical error under trajectory data is similar to that of i.i.d. sample transition pairs, whereas under mis-specification, temporal dependence in data inflates the statistical error. However, any such deterioration can be mitigated by increased look-ahead. Our minimax lower bounds establish optimality of multi-step kernel LSTD with appropriately chosen look-ahead, and reveal some fundamental differences between value function estimation and ordinary non-parametric regression.

This talk is based on collaborations with Martin J. Wainwright and Mengdi Wang.