What is the reward formula and range?

reward = completeness * correctness * efficiency, each in [0,1], with higher values indicating better rollout outcomes.

What defines a terminal state during rollout?

A terminal state is reached when the environment reports a terminal condition or the max_depth is reached.

What outputs are produced by mcts_simulate?

Returns terminal_state, reward with breakdown, rollout_path, and reasoning to inform backpropagation.

mcts-simulate

Scanned

npx machina-cli add skill NewJerseyStyle/plugin-mcts/mcts-simulate --openclaw

Files (1)

SKILL.md

2.3 KB

MCTS Simulation Phase

You are executing the SIMULATION (rollout) phase of Monte Carlo Tree Search.

LLM as Heuristic Policy

Use your knowledge to:

Guide the rollout toward realistic outcomes
Evaluate terminal states with meaningful scores
Detect dead ends early to save computation

Simulation Algorithm

Start from the expanded node
Rollout to terminal state:
- Select actions using LLM policy (not random!)
- Simulate state transitions
- Continue until terminal or max depth
Evaluate the outcome:
- Success: positive reward (e.g., 1.0)
- Partial success: proportional reward (e.g., 0.5)
- Failure: zero or negative reward

Using MCP Tools

Call mcts_simulate with:

node_id: The node to simulate from
max_depth: Maximum rollout depth (default: 10)
evaluation_criteria: What constitutes success

The tool returns:

terminal_state: The final state reached
reward: Numerical evaluation [0, 1]
rollout_path: Sequence of actions taken
reasoning: Explanation of the evaluation

Simulation Strategy

For the current context: $ARGUMENTS

Rollout Policy

Instead of random rollout, use informed policy:

At each step, consider 2-3 likely actions
Choose based on domain knowledge
Prefer actions that lead to decisive outcomes

Evaluation Criteria

For Research:

Does the path lead to valid conclusions?
Is evidence sufficient and reliable?
Are there logical gaps?

For Planning:

Does the plan achieve the goal?
Are resources within budget?
Are there critical risks?

For Coding:

Does the solution work correctly?
Is the code clean and maintainable?
Are edge cases handled?

Reward Assignment

reward = completeness * correctness * efficiency

Where each factor is in [0, 1]:

completeness: How much of the goal is achieved
correctness: How valid is the solution
efficiency: How elegant/optimal is it

Output

After simulation, report:

Terminal state reached
Reward value with breakdown
Key insights from the rollout
Any observations to record

Proceed to BACKPROPAGATION with the reward.

Source

git clone https://github.com/NewJerseyStyle/plugin-mcts/blob/main/skills/mcts-simulate/SKILL.mdView on GitHub

Overview

This skill executes the Simulation (rollout) phase of Monte Carlo Tree Search using an LLM as the heuristic policy. It guides rollouts, evaluates terminal states with meaningful scores, and detects dead ends to save computation. Results feed back into backpropagation to improve search guidance.

How This Skill Works

Start from an expanded node and, at each step, select 2-3 likely actions using an LLM policy instead of random moves. Simulate state transitions until reaching a terminal state or hitting max depth, then compute a reward based on completeness, correctness, and efficiency. The rollout data (terminal state, reward, rollout_path, reasoning) is prepared for backpropagation.

When to Use It

Large search spaces where random rollouts are ineffective
Tasks requiring domain-informed rollout trajectories
Situations needing meaningful terminal-state rewards (0-1 scale)
Limited rollout depth where early dead-end detection saves compute
Backpropagation that benefits from rollout path and reasoning data

Quick Start

Step 1: Call mcts_simulate with node_id and max_depth (default 10)
Step 2: At each step, select 2-3 likely actions using LLM policy and simulate
Step 3: On terminal or max depth, compute reward and collect rollout_path and reasoning; prepare for backpropagation

Best Practices

Use 2-3 likely actions per step instead of random choices
Ground the LLM policy in domain knowledge to guide outcomes
Record terminal_state, reward, rollout_path, and reasoning for backpropagation
Normalize and report reward as completeness * correctness * efficiency (0-1 each)
Review rollout insights and observations before updating the tree

Example Use Cases

AI planning tool estimating a project plan under budget
Code synthesis workflow evaluating correctness and maintainability
Research assistant validating a hypothesis with evidence chain
Strategy game AI testing sequences to reach decisive outcomes
Robotics task planner checking feasibility under time/resource limits

Frequently Asked Questions

Add this skill to your agents