# What is Monte Carlo tree search used for?

What is Monte Carlo tree search used for?

#### Monte Carlo Tree Search is a method usually used in games to predict the path (moves) that should be taken by the policy to reach the final winning solution.

Now let’s at a use case for Monte Carlo Tree Search in gaming.

we implemented the agent for a game called Colosseum Survival using three approaches: Monte Carlo Tree Search, chasing algorithm and ensemble learning for calculating the heuristic. The third approach was chosen as our final approach for this project with a win percentage of 100% against the random agent, 100% against the first approach, and 95% against the second approach, each for 1000 runs of the game. There were 4 base heuristics involved in our final approach, and they were combined to form the final heuristic to give the policy for choosing our next move. Our final approach is more robust than the previous ones, but it can still be improved by adding heuristics to consider future moves of our agent and the adversary. In the future, we would like to build a model with more layers to mimic more steps based on our heuristic and backpropagate them to the current step, so that our agent will be much stronger.

#### 2. Introduction

The main task of this project is to propose different AI algorithms to find a succeeding method for a turn-based board game involving two plays called Colosseum Survival. The game starts with a M x M chessboard (M ranging from 6 to 12) with randomly positioned barriers. For each turn, one of the players moves vertically or horizontally on the chessboard with a maximum step number of K = M+1

2. Then, the player must put a barrier in one of

the four directions up, right, down, or left. The players are not allowed to move diagonally, cross barriers, pass or take the position of its adversary. The game ends when the two players are separated by the barriers in two bounded regions and the score of each player is the number of blocks in its region. Our goal for this project was to build an agent with an algorithm that maximized the block number in our region and thus maximized our score. The agent that we implemented was named ”DuanJuan agent” which was a combination of the last name and first name of our group members respectively.

#### 2.1 Motivation

The motivation for this project came from the AI algorithms for game playing covered in lectures. Typical gaming algorithms include Minimax search, alpha-beta pruning and Monte Carlo Tree Search (MCTS). For our first trial, we chose MCTS instead of Mini max search because firstly, as covered by Bogdan Mazoure in the lecture of “Monte Carlo Tree Search” [2], the time complexity of Minimax search is O(bm) where b is the branch ing factor and m is the depth of the tree. It is very computational expensive even with alpha-beta pruning; secondly, Minimax search has a high requirement on the reasonability and practicability of its evaluation function; thirdly, Minimax search strongly depends on the assumption that both our player and the adversary play optimally with respect to the same evaluation function, which is totally not guaranteed in this project since all groups implement agents with different strategies and thus different evaluation functions. In contrast, MCTS combines tree expansion with random simulations so that it can work well under both deterministic and stochastic game environments and does not rely on specific evaluation functions. Thus, MCTS is more suitable for this project.

After implementing MCTS and playing it against the random agent, we found that the win percentage was around 51%, maximum 55%, for 1000 runs of the game which means that the agent did not win the random agent most times. This was because the time limit of each move during which the Monte Carlo trees were built did not allow us to expand the tree to a very deep depth. Thus, for most times we still moved very randomly. By playing the game several rounds using the human agent, we discovered that the strategy of chasing the adversary and putting barriers around it was a promising approach to defeat the random agent. The implemented agent had a win percentage of 96.5% for 1000 runs of the game against the random agent, and it was called chasing agent.

With the motivation to add more functionality to the chasing agent, we applied ensemble learning with multiple base heuristic algorithms. In addition to the chasing heuristic which prompts the agent to move closer to its adversary, heuristics such as moving the agent to the center and putting more barriers around adversary and less barriers around our agent were added to the algorithm. We also avoided moving to a position where the adversary would win in the next step. The improved agent had a win percentage of 100% and 95% against the random agent and the chasing agent respectively for 1000 runs of the game. Therefore, we used the approach which applied ensemble learning on defining heuristic as our final approach for this project.

#### 2.2.1 Monte Carlo Tree Search

For our initial approach of finding a winning strategy to the Colosseum Survival game, we used the algorithm of Monte Carlo Tree Search (MCTS).

As shown in the lecture [2], MCTS based on a search tree incorporates the idea of choosing the more promising next move using the results of random simulations of the game. It combines search tree expansion and Monte Carlo simulations. MCTS applies to both deterministic and stochastic game environments and does not depend on specific evaluation functions. The MCTS procedure can be divided into four processes:

#### 1. Selection

In the selection process, MCTS selects a promising path in the search tree by recursively selecting a promising node for each depth until it reaches the frontier of the tree i.e., a leaf node. It selects the nodes based on the tree policy which uses Upper Confidence Trees (UCT) defined as:

Q(s, a) = Q(s, a) + c 2

log n(s

n(s, a)(1)

Final project COMP 424, McGill University

Where Q(s,a) is the value of taking action a in state s

n(s,a) is the number of times we have taken action a in state s

n(s) is the number of times we have visited s in simulations

c is the scaling constant, usually

The UCT policy considers both the exploitation process which lets the algorithm select a promising node and the exploration process which gives opportunities to nodes that have not participated in any simulation to be picked and explored.

#### 2. Expansion

If the leaf node reached by the selection process has any children i.e., it is not a terminal node that ends the game, then it is the node to be expanded. A random child of that leaf node is added to the search tree in the expansion process.

#### 3. Simulation

Starting from the expansion node, the algorithm simulates the game for several times and performs sample rollouts for both the player and the adversary using the default policy which simply lets the players choose a random move in each step.

#### 4. Backpropagation

After each rollout, the algorithm updates the visit count and score of the nodes visited during selection and expansion according to the result of that rollout. In other words, the algorithm increases the score by one if winning the game and does not increase the score if losing the game; it always increases the visit counts by one.

#### 2.2.2 Ensemble Learning

Ensemble learning combines the prediction of base models with various weights [1], and there are three kinds of ensemble learning: bagging, stacking, and boosting. Bagging only involves a single algorithm but trained with different samples from the same dataset. Then, it makes the final prediction by voting or averaging using statistical methods. Stacking method uses different types of algorithms to fit on the data and utilizes a single model to combine the predictions from base algorithms as the final output. The boosting method makes the model focus on training samples for which previous models make wrong predictions via changing the training dataset. For this project, we calculated the heuristic of our model using the concept of stacking ensemble learning, and the flowchart for the concept of our heuristic is shown in Figure 1.

As shown in the figure, the input was a valid move which was a state with position and direction of the barrier. There were four base heuristics in our algorithm, and the outputs were merged using the combined heuristic with different weights. The final combined heuristic value was returned by the heuristic function.

#### 3. Method

Three approaches were tried in this project each using MCTS, chasing algorithm and ensemble learning respectively. Detailed working principles of each approach are explained in this section. After experiments, we chose the third approach of calculating heuristic with ensemble learning which had the highest win percentage as our final approach of this project.

#### 3.1.1 Monte Carlo Tree Search

Before we implemented the tree structure of MCTS, we first built a class called MCTSState to record all the information of the states in MCTS nodes. MCTSState instances had attributes of ”position” (current position of the player), ”numvisit” (visit count of the MCTS node), ”score” (score of the MCTS node computed by UCT policy) and ”dir” (direction of the barrier put by the player). Then, we built another class called MCTSNode to create nodes in the Monte Carlo trees. MCTSNode instances had attributes “state” to store its MCTSState instance, ”parent” to store its parent node and ”children ” to store its list of children. The function to get the list of children was all legal next state in the MCTSState class which checked all the positions on the chessboard to see if they could be reached by the current position with the maximum step by calling the function check valid step. The code that built the Monte Carlo tree and ran the four processes of MCTS (selection, expansion, simulation and backpropagation) is in the mcts.py file. To use MCTS, the optimal move function should be called in the step function in file student agent.py.

Initially, we built a Monte Carlo tree for each move of the agent to ensure that the tree fitted the newest state of the chessboard including our position, adversary position and the barrier positions and found the promising next move. However, since the time limit for each move except for the first move was two seconds, very few nodes of the tree were expanded. In most cases, the tree root had hundreds of children and the subsequent nodes also had tens or hundreds of children, so indeed the tree selected moves very randomly. Then, we tried to build a large tree for the first thirty seconds, store it in the StudentAgent class and search for the next move from the tree. However, the problem was still not addressed, so we further tried other approaches.

#### 3.1.2 Chasing Algorithm

Our second approach implemented the chasing algorithm which let our agent chase the adversary during the game. In this algorithm, our agent always went to the state which was closest to the adversary and put a barrier towards it if our agent could move there within the maximum step. If there was no valid move that could reach the adversary, our agent would perform a random walk automatically. This algorithm was implemented in the function called chase adv in the file mcts.py. When we would like to use this algorithm as the policy for moving, the function to get optimal move2 should be called in the step function in file student agent.py.

#### 3.1.3 Calculating Heuristic using Ensemble Learning

Our final approach used the concept of stack ensemble learning to calculate the heuristic for choosing our next step. There were 4 base heuristics in this algorithm. The first one was about chasing an adversary, and it calculated the distance between the adversary and the chosen movie. With larger distance, the algorithm would output a lower heuristic value, so that our agent could choose the move that was closer to the adversary by using this heuristic. This base heuristic was implemented in the function heuristic adv in mcts.py file. The goal of the second base heuristic was penalizing the move far away from the center. In this heuristic algorithm, the distance between the valid move and the center was measured, and it returned a lower value for a large distance.

The function heuristic center in the mcts.py file applied this heuristic algorithm. Moreover, for the third base heuristic algorithm, it gave higher heuristic value to the state that set the barrier towards the adverse agent. In this algorithm, we also calculated the number of barriers around our next move and that around the adversary, so that it gave lower value to the state for which the number of barriers around our position was larger than that of the adversary position. Part of this algorithm was implemented in the heuristic barrier function, and the rest of it was implemented in the combined heuristic function heuristic all in the mcts.py file. The fourth base heuristic checked whether the game was over with the valid move. If the game was over and our agent won the game, the returned value would be really high, but if the adversary won the game, the heuristic would be an extremely negative number.

Finally, all base heuristics were combined with different weights in the function heuristic all, and the final combined heuristic value was returned. Meanwhile, after obtaining the state with the best heuristic, the next optimal step of the adversary was also checked. If the adversary won the game, our agent would skip this state and check this for the state with the second-best heuristic. The state satisfied this constraint with the highest heuristic value would be returned. The function to get optimal move3 should be called in the step function in student agent.py file when we would like to use it as the policy for moving.

#### 4. Result and Discussion

In this section, we first discuss and compare the performance of different approaches. Then, the advantages and disadvantages of our final approach are investigated, and the situation in which our agent will lose the game is also discussed. Moreover, we summarize the weakness of our proposed approach. The last subsection explores the possible future work that we can do to improve the performance of our current model.

#### 4.1 Comparison of Approaches

In this subsection, we compare the performance of our final approach with the random agent and our previous algorithms. For this project, we wrote three approaches: the first approach used MCTS Policy; the second one was about chasing the adversary; and the last method calculated and combined several heuristic algorithms together to find a valid step with the highest heuristic. Our last method outperformed the first two proposed algorithms and became chosen as the final approach.

##### Approach 1 Win Percentage with 1000 runs

Random Walk 0% vs. 100%

MCTS Algorithm 51% vs. 49%

MCTS Algorithm 0% vs. 100%

Chasing Algorithm 96.5% vs. 3.5%

Chasing Algorithm 5% vs. 95%

#### 4.1.1 compare with random approach

We first compared our final approach with the random agent using the mode ”autoplay” of the file simulator.py. The simulator was run 5 times each for 1000 runs of the game, and we found that our agent could beat the random agent with a win percentage of 100% each time. Thus, our agent could completely defeat the random agent.

#### 4.1.2 compare with our first approach (Monte Carlo Tree Search)

Our first approach used Monte Carlo Tree Search to find the best choice for the next move. The performance of this approach was first compared with that of the random walk method by running the ”autoplay” of simulator.py file against the random agent, and the win percentage of this approach was only 51%, which implied that it could not win the random agent for most of the time. Then, when this approach played against our final approach, for sure, the final approach could reach a win percentage of 100% for 1000 simulations. The reason for the weak performance of our first approach might be that 2 seconds was not enough for the tree search to find an appropriate move with its policy since there were many valid moves, and it needed to run lots of simulations to get the score for each valid move. Besides, we also tried to build a MCT at the beginning of the game in 30 seconds, but the performance of this approach was also very bad. Thus, our first approach performed similar to random walk, and the final proposed method could beat it with a win percentage of 100%.

4.1.3 compare with our second approach (chasing)

For the second approach, we implemented an algorithm called chasing. When it played against the random agent, it could beat the random agent with a win percentage of 96.5% on average for 1000 runs, which indicated that the chasing algorithm could win the game for most of the time. However, when the chasing agent played against our final approach, the win percentage of it was only 5% for 1000 runs. This might be because our final approach considered the combination of different heuristics, which was similar to the concept of ensemble learning, while the second approach only used the chasing algorithm to block the adversary. Therefore, our final approach was the best one with the highest win percentage.

Our final approach has many advantages. For instance, our agent can detect and move to the state which makes it win the game immediately, and it can also chase and block the adversary. Additionally, if our agent has a valid step after which the adversary can block our agent in the next turn, our approach can forecast that and avoid moving there. When our agent can move to a state next to the adversary, it prefers to set the barrier in the direction towards the adversary. Besides, our agent will not move to a state which will make the adversary win the game, but it will definitely move to a state which allows us to win.

There are several disadvantages of our approach. For instance, we cannot detect the heuristic of future moves after the next step, which may make us lose the game in a few steps. Also, when we choose the next state which cannot reach the adversary, the heuristic values of that position with all directions of the barrier are all the same, which can be improved in the future. Moreover, if the adversary is optimal and there is only one exit for which we cannot reach with the max step but the adversary can reach to block us, we will be defeated; this is our expected failure mode.

#### 4.3 Future Work

For future work, we would like to consider more than one step when we choose the next move, so that more exploitation will be done. In this way, we can implement a tree model, calculate and update the heuristic of possible children steps according to the heuristic of future steps (i.e. deeper nodes). We may need to do alpha-beta pruning for the tree implementation, so that it can work more efficiently and take within 2 seconds to decide the best children each time.

Moreover, even though we used ensemble learning approach for our heuristic calculation, we can add more base heuristic algorithms, such as forecasting the action of the adversary, into our current heuristic with different weights, so that the combined heuristic can be more robust than the current one, and the agent can choose the better move for each round.

#### 5. Conclusion

In brief, our final approach using ensemble learning to calculate the heuristic outperformed our previous MCTS approach, chasing algorithm, and default random walk. Our proposed agent could completely defeat the random agent and the MCTS approach, and it could reach a win percentage of 95% for 1000 runs of the game when we played against the chasing algorithm. In a prospective view, we can utilize the starting 30 seconds to proceed further steps of exploitation, which will improve our policy for choosing the next move. All in all, we anticipate that our agent can beat other agents and the human agent in the future.

#### References

[1] Jason Brownlee. A gentle introduction to ensemble learning algorithms. April 2021. [2] Bogdan Mazoure. Monte Carlo tree search. February 2022.