On integrating doubly robust estimators with MCTS
In this post, I will explore the theoretical justification for integrating doubly robust estimators into Monte Carlo Tree Search (MCTS). I welcome any feedback or comments on this post.
Problem Statement
Monte Carlo Tree Search (MCTS) has emerged as a powerful algorithm for decision-making in complex, large-scale state spaces. Its success is evident in groundbreaking AI systems such as AlphaGo, AlphaZero, and AlphaFold. With the recent advancements of large language models (LLMs) like GPTs, Claude, and LLama in the NLP community, there has been a growing interest in leveraging these models with MCTS frameworks.
Recent research (Zhao et al. 2023, Zhou et al. 2024 ) has explored the integration of LLMs into MCTS. It has been speculated that OpenAI’s o1 model employs MCTS, using the model’s confidence to guide the search and expand the solution space (Zhao et al. 2024). The core idea is to use the LLM’s world model to provide a prior belief of states, enabling more informed decision-making.
Traditionally, MCTS relies on a rollout policy to estimate the value function, which guides the search towards relevant parts of the tree. However, this approach may be less effective in high-dimensional state spaces where accurate rollouts are challenging to obtain. While MCTS is primarily an on-policy algorithm, techniques such as importance sampling (IS) and doubly robust (DR) estimators—typically associated with off-policy evaluation—have not been widely applied to MCTS.
Naturally, the question arises: Can we leverage the strengths of DR estimators to enhance value function estimation in MCTS?
Why Doubly Robust Estimators?
Doubly robust (DR) estimators have gained significant popularity in causal inference due to their unique properties. Originally developed for missing data problems, DR estimators have found widespread use in observational studies and clinical trials. The key advantage of DR estimators is their ability to provide unbiased estimates if either the outcome model or the propensity score model is correctly specified, offering a safeguard against model misspecification.
In causal inference, DR estimators combine two approaches:
- An outcome regression model that predicts the outcome based on covariates and treatment.
- A propensity score model that estimates the probability of receiving treatment given covariates.
This dual modeling approach provides a “double” chance of getting the correct estimate, hence the term “doubly robust.”
The success of DR estimators in causal inference has inspired researchers to adapt these methods to the reinforcement learning (RL) domain. Two seminal papers have been instrumental in this transition:
-
Jiang and Li (2016) introduced the idea of using DR estimators for off-policy evaluation in RL. Their work demonstrated how DR estimators could be applied to estimate the value of a target policy using data collected from a different behavior policy. This approach showed promising results in reducing variance and bias compared to traditional importance sampling methods.
-
Thomas and Brunskill (2016) further developed this concept by proposing a more data-efficient DR estimator for off-policy evaluation. Their method, known as the weighted doubly robust estimator, incorporated adaptive importance sampling techniques to improve estimation accuracy, especially in scenarios with limited data.
These pioneering works established DR estimators as a powerful tool in RL, particularly for off-policy evaluation. The success of DR estimators in this context stems from their ability to leverage both the direct method (using a model of the environment) and importance sampling, combining the strengths of both approaches while mitigating their individual weaknesses.
Integrating Doubly Robust Estimators with MCTS
We hypothesize that the nice properties of DR estimators can be preserved when integrated with MCTS. In particular, we propose a hybrid estimator that combines the rollout policy in MCTS with a DR estimator to estimate the value function. This hybrid estimator leverages the strengths of both approaches:
- The rollout policy provides a quick estimate of the value function, guiding the search towards promising states.
- The DR estimator refines the value function estimate, reducing bias and variance in the estimation process.
Below is my attempt to rigorously justify the integration of DR estimators with MCTS.
Preliminaries
Let \(\mathcal{S}\) be the state space, \(\mathcal{A}\) the action space, and \(\mathcal{H}\) the space of action-observation histories. We define the following:
- \(\pi_e: \mathcal{H} \times \mathcal{A} \rightarrow [0,1]\): the target policy (MCTS policy)
- \(\pi_b: \mathcal{H} \times \mathcal{A} \rightarrow [0,1]\): the behavior policy (LLM-based heuristic policy)
- \(Q: \mathcal{H} \times \mathcal{A} \rightarrow \mathbb{R}\): the true state-action value function
- \(\hat{Q}: \mathcal{H} \times \mathcal{A} \rightarrow \mathbb{R}\): the estimated state-action value function
- \(\hat{V}: \mathcal{H} \rightarrow \mathbb{R}\): the estimated state value function
- \(\gamma \in [0,1]\): the discount factor
The target policy \(\pi_e\) is defined using a softmax over Q-values:
\begin{equation} \pi_e(a|h) = \frac{\exp(Q(h,a)/\tau)}{\sum_{a’ \in \mathcal{A}} \exp(Q(h,a’)/\tau)} \end{equation}
where \(\tau\) is a temperature parameter controlling the exploration-exploitation trade-off.
Hybrid Estimator
We define our hybrid estimator as a combination of the standard MCTS rollout estimate and the DR estimate:
\begin{equation} V_{\text{hybrid}}(h) = \beta V_{\text{MCTS}}(h) + (1-\beta) V_{DR}(h) \end{equation}
where \(\beta \in [0,1]\) is a weighting parameter, \(V_{\text{MCTS}}(h)\) is the standard MCTS rollout estimate, and \(V_{DR}(h)\) is the DR estimate.
Unbiasedness of the hybrid estimator
Theorem: The hybrid estimator is unbiased for estimating the value of the target policy \(\pi_e\) in the LLM-guided MCTS.
Proof: We know that \(V_{\text{MCTS}}(h)\) is unbiased due to the properties of Monte Carlo estimation. For \(V_{DR}(h)\), we can express it as:
\begin{equation} V_{DR}(h) = \hat{V}(h) + \sum_{t=0}^{H-1} \gamma^t \rho_{1:t} (r_t + \gamma \hat{V}(h_{t+1}) - \hat{Q}(h_t, a_t)) \end{equation}
where \(\rho_{1:t} = \prod_{k=1}^t \frac{\pi_e(a_k|h_k)}{\pi_b(a_k|h_k)}\) is the cumulative importance ratio.
Taking the expectation with respect to the behavior policy \(\pi_b\):
\[\begin{align} \mathbb{E}_{\pi_b}[V_{DR}(h)] &= \mathbb{E}_{\pi_b}[\hat{V}(h)] + \mathbb{E}_{\pi_b}\left[\sum_{t=0}^{H-1} \gamma^t \rho_{1:t} (r_t + \gamma \hat{V}(h_{t+1}) - \hat{Q}(h_t, a_t))\right] \\ &= \hat{V}(h) + \sum_{t=0}^{H-1} \gamma^t \mathbb{E}_{\pi_b}[\rho_{1:t} (r_t + \gamma \hat{V}(h_{t+1}) - \hat{Q}(h_t, a_t))] \\ &= \hat{V}(h) + \sum_{t=0}^{H-1} \gamma^t (Q(h_t, a_t) - \mathbb{E}_{\pi_e}[\hat{Q}(h_t, a_t)]) \\ &= V(h) \end{align}\]Therefore, both \(V_{\text{MCTS}}(h)\) and \(V_{DR}(h)\) are unbiased estimators of \(V(h)\). Since \(V_{\text{hybrid}}(h)\) is a linear combination of these unbiased estimators, it is also unbiased:
\[\begin{align} \mathbb{E}[V_{\text{hybrid}}(h)] &= \mathbb{E}[\beta V_{\text{MCTS}}(h) + (1-\beta) V_{DR}(h)] \\ &= \beta \mathbb{E}[V_{\text{MCTS}}(h)] + (1-\beta) \mathbb{E}[V_{DR}(h)] \\ &= \beta V(h) + (1-\beta) V(h) \\ &= V(h) \end{align}\]Thus, the hybrid estimator remains unbiased in the LLM-guided MCTS context.
Variance Reduction of the hybrid estimator
Theorem: Under certain conditions, the hybrid estimator has lower variance than the standard MCTS value estimator in the LLM-guided MCTS.
Proof: We aim to show that \(\text{Var}(V_{\text{hybrid}}(h)) \leq \text{Var}(V_{\text{MCTS}}(h))\) under certain conditions.
The variance of the hybrid estimator is:
\[\begin{align} \text{Var}(V_{\text{hybrid}}(h)) &= \text{Var}(\beta V_{\text{MCTS}}(h) + (1-\beta) V_{DR}(h)) \\ &= \beta^2 \text{Var}(V_{\text{MCTS}}(h)) + (1-\beta)^2 \text{Var}(V_{DR}(h)) + 2\beta(1-\beta)\text{Cov}(V_{\text{MCTS}}(h), V_{DR}(h)) \end{align}\]We can express \(V_{DR}(h)\) as \(V_{\text{MCTS}}(h) + \Delta(h)\), where \(\Delta(h) = \sum_{t=0}^{H-1} \gamma^t \rho_{1:t} (\hat{Q}(h_t, a_t) - r_t - \gamma \hat{V}(h_{t+1}))\).
Substituting this into the variance equation: \(\begin{align} \text{Var}(V_{\text{hybrid}}(h)) &= \beta^2 \text{Var}(V_{\text{MCTS}}(h)) + (1-\beta)^2 \text{Var}(V_{\text{MCTS}}(h) + \Delta(h)) \\ &\quad + 2\beta(1-\beta)\text{Cov}(V_{\text{MCTS}}(h), V_{\text{MCTS}}(h) + \Delta(h)) \\ &= \text{Var}(V_{\text{MCTS}}(h)) + (1-\beta)^2 \text{Var}(\Delta(h)) + 2(1-\beta)\text{Cov}(V_{\text{MCTS}}(h), \Delta(h)) \end{align}\)
To show that \(\text{Var}(V_{\text{hybrid}}(h)) \leq \text{Var}(V_{\text{MCTS}}(h))\), it suffices to show that:
\begin{equation} (1-\beta)^2 \text{Var}(\Delta(h)) + 2(1-\beta)\text{Cov}(V_{\text{MCTS}}(h), \Delta(h)) \leq 0 \end{equation}
Examining \(\text{Cov}(V_{\text{MCTS}}(h), \Delta(h))\):
\[\begin{align} \text{Cov}(V_{\text{MCTS}}(h), \Delta(h)) &= \mathbb{E}[V_{\text{MCTS}}(h)\Delta(h)] - \mathbb{E}[V_{\text{MCTS}}(h)]\mathbb{E}[\Delta(h)] \\ &= \mathbb{E}[V_{\text{MCTS}}(h)\Delta(h)] \quad \text{(since $\mathbb{E}[\Delta(h)] = 0$ due to DR's unbiasedness)} \\ &\approx -\mathbb{E}\left[\sum_{t=0}^{H-1} \gamma^{2t} \rho_{1:t}^2 (Q(h_t, a_t) - \hat{Q}(h_t, a_t))^2\right] \end{align}\]The last approximation holds because the cross-terms tend to cancel out due to the Markov property.
Observe that:
\begin{equation} \text{Var}(\Delta(h)) = \mathbb{E}\left[\sum_{t=0}^{H-1} \gamma^{2t} \rho_{1:t}^2 (\hat{Q}(h_t, a_t) - r_t - \gamma \hat{V}(h_{t+1}))^2\right] \end{equation}
Therefore:
\begin{equation} (1-\beta)^2 \text{Var}(\Delta(h)) + 2(1-\beta)\text{Cov}(V_{\text{MCTS}}(h), \Delta(h)) \approx -(1-\beta)\mathbb{E}\left[\sum_{t=0}^{H-1} \gamma^{2t} \rho_{1:t}^2 (Q(h_t, a_t) - \hat{Q}(h_t, a_t))^2\right] \leq 0 \end{equation}
This inequality holds as long as \(\hat{Q}\) is a reasonable approximation of \(Q\). In the context of LLM-guided MCTS, where we have estimates of \(Q(h,a)\) from both the tree search and the LLM policy, we can expect this condition to be satisfied.
Thus, under these conditions, we have shown that \(\text{Var}(V_{\text{hybrid}}(h)) \leq \text{Var}(V_{\text{MCTS}}(h))\), demonstrating the variance reduction property of the hybrid estimator compared to the standard MCTS value estimator in the LLM-guided MCTS context.
Advantages of the hybrid estimator
We speculate that the hybrid estimator offers several advantages over the standard MCTS value estimator:
- Bias-Variance Trade-off: The hybrid estimator strikes a balance between the low bias of the MCTS rollout estimate and the low variance of the DR estimate, potentially leading to more stable and accurate value function estimates.
- Data Efficiency: By leveraging the DR estimator, the hybrid estimator may require fewer samples to achieve the same level of accuracy as the MCTS rollout estimate, making it more data-efficient.
- Robustness: The hybrid estimator is robust to model misspecification. In particular, if the rollout policy is inaccurate, the DR estimator can correct for this bias via importance sampling and the outcome model (value function estimate).
Conclusion
In this post, we have explored the theoretical justification for integrating doubly robust estimators with MCTS. We have proposed a hybrid estimator that combines the strengths of MCTS rollouts and DR estimators to enhance value function estimation in the LLM-guided MCTS framework. Our analysis suggests that the hybrid estimator is unbiased and can potentially reduce variance compared to the standard MCTS value estimator. We believe that this integration could lead to more robust, data-efficient decision-making in complex state spaces, particularly when using large language models like GPTs to guide the search. We are excited about the potential of this approach and look forward to empirical validation in future work in the context of LLM-guided MCTS.
If you found this useful, please cite this as:
Liu, Manqing (Jan 2025). On integrating doubly robust estimators with MCTS. https://ManqingLiu.github.io.
or as a BibTeX entry:
@article{liu2025on-integrating-doubly-robust-estimators-with-mcts,
title = {On integrating doubly robust estimators with MCTS},
author = {Liu, Manqing},
year = {2025},
month = {Jan},
url = {https://ManqingLiu.github.io/blog/2025/DR-MCTS/}
}