Play, Learn, Conquer: The Journey to a Self-Improving Game AI

Mastering Gomoku (Five-in-a-Row) and Othello with the AlphaZero Algorithm in PyTorch through Self-Play.

Jul 05, 2025

When we were kids, most of us might have played some board games like Go or chess. Back then, few believed AI could ever compete with humans in such deep-strategy contests. Yet thanks to breakthroughs in hardware and machine learning, AI now goes head-to-head with (or even defeats) the world’s best players in Go, chess, and more. This milestone shows how machines have learned to think several moves ahead and plan over the long term.

Imagine a board game as an arena governed by clear rules and a fixed grid. Each move not only changes the board but also spawns a whole tree of future possibilities — every response opens up another set of options, and each of those leads to yet more. In a tiny game like tic-tac-toe, that tree is small enough to fully explore, so you can brute-force every move and guarantee at least a draw.

However, in larger games such as Go or chess, the search tree explodes. In a 19×19 Go game, each player faces roughly 250 legal moves per turn, and a full game takes about 150 moves — resulting in on the order of 250¹⁵⁰ possible trajectories, far beyond brute-force reach.

To break through this computational wall, researchers turned to reinforcement learning algorithms. In 2016, DeepMind’s AlphaGo fused policy and value networks with Monte Carlo Tree Search: it first learned from human pro games via supervised learning, then sharpened itself with self-play RL to beat Lee Sedol (world champion at that time ) 4–1.

Mastering Game Worlds: A Deep Dive into Reinforcement Learning

Ben 🍉

Jan 29

Read full story

A year later, AlphaGo Zero went further — starting from pure random play and no human data, a single ResNet learned both move selection and position evaluation entirely by self-play + MCTS, and in just days it surpassed every previous AlphaGo model. In this article, I’ll cover the basics of Monte Carlo Tree Search (MCTS) and Alpha Zero algorithm, and show you how to apply them to Gomoku and Othello.

Monte Carlo Tree Search

In board games, we choose the next move by weighing the current position against possible opponent responses, aiming to maximise both immediate and future rewards. However, as I mentioned earlier, when the board becomes larger and more complex, the number of possible moves explodes and brute-force search quickly becomes impractical.

A common way to tackle this challenge is Monte Carlo Tree Search (MCTS). Instead of exhaustively expanding every branch, MCTS runs randomised simulations — called playouts — only along the most promising or under-explored branches, giving a reliable win-rate estimate without expanding every subtree.

It also apply the UCT formula to balances exploration (trying out less-visited moves) with exploitation (focusing on high-win-rate moves), so computation is concentrated where it will have the greatest impact.

Process of Monte Carlo Tree Search (source)

MCTS can be briefly divided into four processes: Selection, Expansion, Simulation, and Backpropagation. I will use Tic Tac Toe as an example to explain how it applies. To visualise the MCTS process, you can also try this web GUI: link.

Upper Confidence Bound for Tree (UCT)

UCT picks each child by combining the average reward (left) with an exploration term (right), so MCTS consistently balances exploiting high-value moves and exploring less-visited branches.

Q(s, a): average reward, Q = W / N
N(s): how many times state ‘s’ has been visited.
N(s, a): how many times action ‘a’ was tried from ‘s’.
P(s, a): probability of taking move ‘a’ (prior)
c is a constant that balances exploration versus exploitation.

Selection

At the beginning, set the current game board as the root node. Each node represents a unique game state; no children exist initially.

From the root, repeatedly select the action ‘a’ that maximises UCT and move to the corresponding child node
Stop until reaching a node that is not fully expanded (i.e., has at least one untried legal move).

Now, let’s try to compute the UCT for node ‘A’. First, its average reward Q = 0.75 (as 3 wins out of 4 visits). Then we add an exploration bonus, which is the square root of ln⁡(20) over 4 (since the parent has been visited 20 times and ‘A’ itself 4 times). By setting the prior ‘P’ and exploration constant ‘c’ to 1, the final UCT score for node ‘A’ is 0.75 + 0.866 = 1.616.

Example of Monte Carlo Tree Search — W: Reward, N: number of visit, Q: Average reward = W/ N (by Author)

Note: In the selection phase, the player switches at each level (for example, if it’s X’s turn in state A, then it’s O’s turn in state A1), ensuring that each node correctly represents whose turn it is when evaluating and back propagating values.

Expansion

If the selected node corresponds to a non-terminal state with untried moves, expand by adding one new child node:

Pick one empty cell (legal move) that hasn’t been expanded yet.
Create a child node representing the board after placing the current player’s move in that cell.

Initialise visit count N(⋅) = 0 and total value W(⋅) = 0 for the new child.

Simulation (Rollout)

From the newly added node (or from a terminal node if already expanded), play a random game until it ends:

Alternate turns between both player.
At each step, choose one of the remaining empty cells uniformly at random.
Continue until one player wins or the board is full (draw).

Record the result as the reward z = 1 if the simulated game ends in a win for the player who just moved at the expanded node, -1 if the opponent wins, else 0 (draw). For example, starting from leaf node ‘A1a’, you simulate until the game’s end and then record ‘z’ for ‘A1a’ based on which player wins.

Back Propagation

Propagate the simulation result ‘z’ back up the tree to the root. For each state–action (s,a) on the path:

Visit count: N(s) += 1
Total value: W(s,a) += z (or −z depending on perspective)

Below is an example of how back propagation works (assume z = +1 at node A1a). As the active player alternates at each level, the reward contribution switches sign accordingly when updating parent nodes.

MCTS Implementation

The following is an implementation of Tic-Tac-Toe game: it stores the board state, current player, and available moves.

do_move: places a mark and switches players.
game_end: checks all win lines or a full board, returning whether the game is over and who won (or 0 for a draw).

import math
import numpy as np

# Tic-Tac-Toe game state
class TicTacToe:
    def __init__(self):
        self.states = [0] * 9        # 0=empty, 1=X, -1=O
        self.current_player = 1      # X starts
        self.availables = list(range(9))

    def do_move(self, move):
        self.states[move] = self.current_player
        self.availables.remove(move)
        self.current_player *= -1    # switch player

    def game_end(self):
        lines = [(0,1,2),(3,4,5),(6,7,8),
                 (0,3,6),(1,4,7),(2,5,8),
                 (0,4,8),(2,4,6)]
        
        for i, j, k in lines:
            s = self.states[i] + self.states[j] + self.states[k]
            if s == 3:  return True,  1  # X wins
            if s == -3: return True, -1  # O wins
                
        if not self.availables:
            return True, 0           
        return False, None

def softmax(x):
    e = np.exp(x - np.max(x))
    return e / np.sum(e)

class TreeNode:
    def __init__(self, parent=None, prior=1.0):
        self.parent   = parent
        self.children = {}            # action -> node
        self.n_visits = 0
        self.Q        = 0.0           # mean value
        self.P        = prior         # prior prob

    def expand(self, action_priors):
        for a, p in action_priors:
            if a not in self.children:
                self.children[a] = TreeNode(self, p)

    def select(self, c_puct):
        # pick child with highest UCT
        return max(self.children.items(),
                   key=lambda act_node: act_node[1].Q +
                       c_puct * act_node[1].P *
                       math.sqrt(self.n_visits) /
                       (1 + act_node[1].n_visits))
    
    def simulate(self, state):
        """
        Perform a random rollout from the current state until terminal,
        returning +1 if the current player at `state` wins,
        -1 if they lose, or 0 for a draw.
        """
        cur = state
        while True:
            end, winner = cur.game_end()
            if end:
                if winner == 0:
                    return 0
                return 1 if winner == cur.current_player else -1
            a = random.choice(cur.availables)
            cur.do_move(a)

    def update_recursive(self, value):
        if self.parent:
            self.parent.update_recursive(-value)
        self.n_visits += 1
        self.Q += (value - self.Q) / self.n_visits

class MCTS:
    def __init__(self, policy_fn, c_puct=1.4, n_playout=1000):
        self.root      = TreeNode(prior=1.0)
        self.policy_fn = policy_fn
        self.c_puct    = c_puct
        self.n_playout = n_playout

    def _playout(self, state):
        node = self.root
        # move down until leaf
        while node.children:
            action, node = node.select(self.c_puct)
            state.do_move(action)
        end, winner = state.game_end()
        
        if not end:
            action_priors, leaf_value = self.policy_fn(state)
            node.expand(action_priors)
            
            # Comment out this line when using AlphaZero (no random rollout)
            leaf_value = node.simulate(state) 
        else:
            # terminal value from current player's view
            leaf_value = 0 if winner == 0 else (
                1 if winner == state.current_player else -1
            )
        node.update_recursive(-leaf_value)

    def get_move_probs(self, state, temp=1e-3):
        for _ in range(self.n_playout):
            cpy = TicTacToe()
            # cpy = Board(state.width, state.height, state.n_in_row)
            
            cpy.states = state.states.copy()
            cpy.availables = state.availables.copy()
            cpy.current_player = state.current_player
            self._playout(cpy)

        acts, visits = zip(*[(a, n.n_visits)
                             for a, n in self.root.children.items()])
        probs = softmax(np.log(np.array(visits) + 1e-10) / temp)
        return list(acts), probs

    def update_with_move(self, last_move):
        # reuse subtree after actual move
        if last_move in self.root.children:
            child = self.root.children[last_move]
            child.parent = None
            self.root = child
        else:
            self.root = TreeNode(prior=1.0)

The _playout method contains one iteration of the MCTS process—including select, expand, simulate, and back propagation. By repeating these playouts multiple times, we could build up more accurate UCT scores to guide our next move.

The policy function returns the estimated action priors and a leaf value. For a pure MCTS model, it returns a uniform random distribution for the priors and sets the leaf value to zero. You can use the following code to player the game.

# uniform random policy
def policy_fn_uniform(state: TicTacToe):
    moves = state.availables
    prob = 1 / len(moves)
    return [(m, prob) for m in moves], 0
  
def play_vs_human(c_puct=1.4, n_playout=1000):
    game = TicTacToe()
    mcts = MCTS(policy_fn=policy_fn_uniform, c_puct=c_puct, n_playout=n_playout)
    human = 0
    while human not in (1, -1):
        human = int(input("Choose side: X(1) or O(-1): "))
    symbols = {1:'X', -1:'O', 0:'.'}
    print("Board: 0-8 positions\n")

    while True:
        end, winner = game.game_end()
        if end:
            print("Result:", "Draw" if winner==0 else f"{symbols[winner]} wins")
            break

        if game.current_player == human:
            move = -1
            while move not in game.availables:
                move = int(input(f"Your move {game.availables}: "))
            game.do_move(move)
            mcts.update_with_move(move)
        else:
            print("AI thinking...")
            acts, probs = mcts.get_move_probs(game)
            ai_move = acts[np.argmax(probs)]
            print("AI move:", ai_move)
            game.do_move(ai_move)
            mcts.update_with_move(ai_move)
        
        for i in range(3): # display board
            print(' '.join(symbols[game.states[3*i+j]] for j in range(3)))
        print()

AlphaZero

AlphaZero, introduced by DeepMind in 2017, is a general reinforcement-learning system that learns solely through self-play without any human game data. It uses a single neural network to predict both move probabilities (the policy) and state values.

Note: AlphaZero is a general version of AlphaGo Zero that can be applied to games other than Go, such as chess and shogi.

During gameplay, it employs a lightweight MCTS: instead of running many random rollouts to evaluate moves, it relies on the policy network to guide the search and on the value network to evaluate positions directly. This approach makes the search far more accurate and efficient, enabling AlphaZero to rapidly surpass human-level performance in games such as Go, chess, and shogi.

Gomuku Game Implementation

Now let’s implement AlphaZero to play a Gomoku game. The rules are simple: each player takes turns placing a stone of their colour on an empty intersection, and the first to form an unbroken line of five stones — horizontally, vertically, or diagonally — wins.

class Board:
    def __init__(self, width=8, height=8, n_in_row=5):
        self.width = width
        self.height = height
        self.n_in_row = n_in_row
        self.players = [1, 2]               # player IDs
        self.states = {}                    # {move: player}
        self.availables = []                # list of legal moves
        self.current_player = None
        self.last_move = -1

    def init_board(self, start_player=0):
        if self.width < self.n_in_row or self.height < self.n_in_row:
            raise ValueError(f"Board must be at least {self.n_in_row} in both dimensions.")
        self.current_player = self.players[start_player]
        self.availables = list(range(self.width * self.height))
        self.states.clear()
        self.last_move = -1

    def move_to_location(self, move):
        """Convert move index to (row, col)."""
        return divmod(move, self.width)

    def location_to_move(self, loc):
        if len(loc) != 2:
            return -1
        move = loc[0] * self.width + loc[1]
        return move if 0 <= move < self.width * self.height else -1

    def current_state(self):
        state = np.zeros((4, self.width, self.height))
        for m, p in self.states.items():
            r, c = self.move_to_location(m)
            plane = 0 if p == self.current_player else 1
            state[plane, r, c] = 1.0

        if self.last_move >= 0:
            r, c = self.move_to_location(self.last_move)
            state[2, r, c] = 1.0

        # plane 3 indicates which player's turn (first player → all ones on even moves)
        if len(self.states) % 2 == 0:
            state[3, :, :] = 1.0
        return state[:, ::-1, :] 

    def do_move(self, move):
        self.states[move] = self.current_player
        self.availables.remove(move)
        self.last_move = move
        self.current_player = (
            self.players[1] if self.current_player == self.players[0]
            else self.players[0]
        )

    def has_a_winner(self):
        if len(self.states) < 2 * self.n_in_row - 1:
            return False, -1

        w, h, n = self.width, self.height, self.n_in_row
        for m, p in self.states.items():
            r, c = divmod(m, w)
            
            if c <= w - n and all(self.states.get(m + i, -1) == p for i in range(n)):
                return True, p # horizontal
            if r <= h - n and all(self.states.get(m + i * w, -1) == p for i in range(n)):
                return True, p # vertical
            if r <= h - n and c <= w - n and all(self.states.get(m + i * (w + 1), -1) == p for i in range(n)):
                return True, p # diag down-right
            if r <= h - n and c >= n - 1 and all(self.states.get(m + i * (w - 1), -1) == p for i in range(n)):
                return True, p # diag down-left
        return False, -1

    def game_end(self):
        win, winner = self.has_a_winner()
        if win:
            return True, winner
        if not self.availables:
            return True, -1  # draw
        return False, -1

    def get_current_player(self):
        return self.current_player
    
    def graphic(self):
        lines = []
        for i in range(self.height - 1, -1, -1):
            row = []
            for j in range(self.width):
                move = i * self.width + j
                p = self.states.get(move, -1)
                if p == self.players[0]:    row.append('X')
                elif p == self.players[1]:  row.append('O')
                else:                       row.append('.')
            lines.append(' '.join(row))
        return "\n".join(lines)

    def __str__(self):
        return self.graphic()

Note: I strongly recommend starting with a 6×6 board and n_in_row = 4, as it converges much faster than an 8×8 board.

By default this environment uses an 8×8 board with n_in_row = 5 (you can tweak both to try other setups). The neural network takes a NumPy array of shape (batch, 4, width, height) as input, where each board is split into four channels:

Channel 0: current player’s stones
Channel 1: opponent’s stones
Channel 2: a mark on the last move
Channel 3: a constant plane set to 1 if it’s player 1’s turn

This compact 4-channel encoding fully captures the game state for the network. (Note: to visualise the board, you can call the graphic() function)

Different between MCTS & AlphaZero

Classic MCTS uses random rollouts to evaluate leaf nodes and manually sets prior probabilities, but AlphaZero swaps both for a single neural network: its value head instantly estimates a position’s value and its policy head provides move priors.

We can reuse the same MCTS framework — just replace full-game simulations with the value head and handcrafted priors ‘P’ with the policy head — to achieve far stronger play with fewer playouts.

class ResidualBlock(nn.Module):
    """ 2-layer residual block."""
    def __init__(self, channels):
        super().__init__()
        self.net = nn.Sequential(
            nn.Conv2d(channels, channels, 3, padding=1, bias=False),
            nn.GroupNorm(4, channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(channels, channels, 3, padding=1, bias=False),
            nn.GroupNorm(4, channels)
        )

    def forward(self, x):
        return F.relu(self.net(x) + x)

class AlphaNet(nn.Module):
    """
    In: (B, 4, H, W)
    Out: policy log-probs (B, H*W), value (B,)
    """
    def __init__(self, width=8, height=8):
        super().__init__()
        
        self.stem = nn.Sequential(
            nn.Conv2d(4, 128, 3, padding=1, bias=False),
            nn.GroupNorm(4, 128),
            nn.ReLU(inplace=True),
            *([ResidualBlock(128)] * 4)
        )
        # policy head
        self.policy = nn.Sequential(
            nn.Conv2d(128, 2, 1, bias=False),
            nn.GroupNorm(2, 2),
            nn.ReLU(inplace=True),
            nn.Flatten(),
            nn.Linear(2 * width * height, width * height),
            nn.LogSoftmax(dim=1)
        )
        # value head
        self.value = nn.Sequential(
            nn.Conv2d(128, 1, 1, bias=False),
            nn.GroupNorm(1, 1),
            nn.ReLU(inplace=True),
            nn.Flatten(),
            nn.Linear(width * height, 128),
            nn.ReLU(inplace=True),
            nn.Linear(128, 1),
            nn.Tanh()
        )

    def forward(self, x):
        out = self.stem(x)
        policy = self.policy(out)
        value = self.value(out).view(-1)
        return policy, value

Self Game Play & Dirichlet Distribution

To collect training data, we let the model play against itself in self-play, recording each trajectory as (state, prior, reward). However, pure self-play can quickly converge on a narrow set of moves and risk overfitting its own strategy.

To maintain exploration, AlphaZero injects Dirichlet noise into the root-node priors — mixing network-generated probabilities with random samples — so the search stays diverse and uncovers stronger lines over time.

def policy_value_fn(board, net):
    state = board.current_state()[None, ...]  # (1,4,w,h)                      
    state_tensor = torch.from_numpy(state).float().to(next(net.parameters()).device)
    net.eval()
    with torch.no_grad():
        log_probs, value = net(state_tensor)
    probs = np.exp(log_probs.cpu().numpy().flatten())
    action_priors = [(m, probs[m]) for m in board.availables]
    return action_priors, value.item()

def start_self_play(net, board_width, board_height, n_in_row,
                    playouts, c_puct, temp=1e-3):
    board = Board(board_width, board_height, n_in_row)
    board.init_board()
    mcts = MCTS(policy_fn=lambda b: policy_value_fn(b, net),
                c_puct=c_puct, n_playout=playouts)
    states, mcts_probs, players = [], [], []

    while True:
        acts, probs = mcts.get_move_probs(board, temp)
        prob_array = np.zeros(board.width * board.height)
        prob_array[list(acts)] = probs

        # Apply Dirichlet Noise
        noise = np.random.dirichlet([0.3] * len(probs))
        move = np.random.choice(acts, p=0.75 * probs + 0.25 * noise)
        states.append(board.current_state())
        mcts_probs.append(prob_array)
        players.append(board.get_current_player())

        board.do_move(move)
        mcts.update_with_move(move)
        end, winner = board.game_end()
        if end:
            # Compute Z: win +1, lose -1, draw 0
            z = np.zeros(len(players))
            if winner != -1:
                z = np.where(np.array(players) == winner, 1.0, -1.0)
            mcts.reset() 
            return winner, list(zip(states, mcts_probs, z))

We can also accelerate the data-collection process by running the environment in parallel — for example, if your machine has 12 CPU cores, you could launch 12 workers to gather data simultaneously.

def _self_play_worker(net_state, board_cfg, playouts, c_puct, temp):
    net = AlphaNet(board_cfg["w"], board_cfg["h"]).cpu()
    net.load_state_dict(net_state)
    winner, play_data = start_self_play(
        net,
        board_width=board_cfg["w"],
        board_height=board_cfg["h"],
        n_in_row=board_cfg["n"],
        playouts=playouts,
        c_puct=c_puct,
        temp=temp
    )
    return winner, play_data
 
from joblib import Parallel, delayed
def collect_selfplay_data(net, data_buffer, board_cfg, playouts, c_puct, temp=1.0, n_games=12):              
    net_state = {k: v.cpu() for k, v in net.state_dict().items()} # Move weight to cpu

    args_list = [
        (net_state, board_cfg, playouts, c_puct, temp)
        for _ in range(n_games)
    ]
   
    results = Parallel(n_jobs=n_games, backend="loky")(delayed(_self_play_worker)(*args) for args in args_list)                                                   
    for winner, play_data in results:
        for state, mcts_prob, z in play_data:
            data_buffer.append((state, mcts_prob, z))

During training, we will use separate losses for each head: v_loss is the MSE between the predicted value and the actual game outcome, and p_loss is the cross-entropy between the predicted and target priors to maximise log-likelihood.

def train_step(net, optimizer, scheduler, scaler, device, loader, epochs):
    net.train()
    data_iter = iter(loader)
    total_loss, total_entropy = 0.0, 0.0

    for _ in range(epochs):
        try:
            state_b, prob_b, winner_b = next(data_iter)
        except StopIteration:
            data_iter = iter(loader)
            state_b, prob_b, winner_b = next(data_iter)

        state_b = state_b.float().to(device)
        prob_b  = prob_b.float().to(device)
        winner_b= winner_b.float().to(device)

        optimizer.zero_grad()
        with torch.amp.autocast(device_type=device.type):
            log_probs, value = net(state_b)
            v_loss = F.mse_loss(value.view(-1), winner_b)
            p_loss = -torch.mean((prob_b * log_probs).sum(dim=1))
            loss   = v_loss + p_loss

        if scaler:  # mixed precision
            scaler.scale(loss).backward()
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(net.parameters(), max_norm=1.0)
            scaler.step(optimizer)
            scaler.update()
        else:
            loss.backward()
            torch.nn.utils.clip_grad_norm_(net.parameters(), max_norm=1.0)
            optimizer.step()

        entropy = -torch.mean((torch.exp(log_probs) * log_probs).sum(dim=1))
        total_loss    += loss.item()
        total_entropy += entropy.item()
        
    scheduler.step()
    return total_loss / epochs, total_entropy / epochs

Training Loop & Tips for AlphaZero

Finally, here is the main training loop. It will take around 12 hours to train on a Colab L4 GPU. To further improve the model’s performance, here are some tips you can try:

Dynamic playout budget: start with fewer MCTS playouts per move in early training and ramp up as the network strengthens — focus compute where it helps most.
Adaptive c_puct: decay your exploration constant over time so you explore heavily at first, then shift toward exploitation as policy/value accuracy improves.
Pure MCTS benchmark: periodically play your trained network against a pure-MCTS (no neural net) opponent to directly measure the model’s strategic strength.
Temperature annealing: start self-play with a higher action-sampling temperature (e.g. 1.0) and gradually lower it toward 0.1 as training progresses to sharpen your policy.

import os
import math
import random
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
from collections import deque
from joblib import Parallel, delayed

output_dir = 'Put your dir here'
os.makedirs(output_dir, exist_ok=True)

c_puct = 3
epochs = 3
playouts = 200
batch_size = 256
learn_rate = 1e-3
buffer_size = 5000
num_iterations = 2000
loss_history = []
board_cfg = {'w': 6, 'h': 6, 'n': 4}

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
net = AlphaNet(board_cfg["w"], board_cfg["h"]).to(device)
optimizer = torch.optim.Adam(net.parameters(), lr=learn_rate, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[500, 1000], gamma=0.5)
scaler    = torch.amp.GradScaler('cuda')
data_buffer = deque(maxlen=buffer_size)

for it in range(1, num_iterations + 1):
    collect_selfplay_data(net, data_buffer, board_cfg,
                          playouts, c_puct, temp=1.0)

    if len(data_buffer) < batch_size:
      continue

    loader = DataLoader(list(data_buffer), batch_size=batch_size, shuffle=True)
    loss, entropy = train_step(net, optimizer, scheduler, scaler, device, loader, epochs)
                                
    if loss is not None:
        loss_history.append(loss)

    if it % 200 == 0:
        snap_path = os.path.join(output_dir, f"snapshot_it{it}.pth")
        torch.save(net.state_dict(), snap_path)
        print(f"[Iter {it}] Snapshot saved to {snap_path}; Loss = {loss}")
print("Training finished!")

You can use the following code to play against the AI. Note that you can choose a relatively large n_playout during inference; this allows the model to evaluate more future steps and further increase its win rate.

Note: using a lower temperature (e.g., 0.001) will sharply peak the action probability distribution, making the move selection nearly deterministic

def human_vs_ai(model_path, board_width=6, board_height=6,\
                n_in_row=4, c_puct=3, n_playout=500):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    net = AlphaNet(board_width, board_height).to(device)
    net.load_state_dict(torch.load(model_path, map_location=device))
    net.eval()  

    board = Board(board_width, board_height, n_in_row)
    board.init_board(start_player=0)
    mcts = MCTS(lambda b: policy_value_fn(b, net), c_puct=c_puct, n_playout=n_playout)

    while True:
        print("\nCurrent board:")
        print(board.graphic())
        end, winner = board.game_end()
        if end:
            if winner == -1:
                print("Game over: Draw")  # -1 indicates no winner
            else:
                print(f"Game over: Player {winner} wins!")
            break

        # Human player's turn
        if board.get_current_player() == board.players[0]:
            while True:
                try:
                    # Prompt user for move input in "row,col" format
                    raw = input("Enter your move (row,col), e.g. 2,3: ")
                    row, col = map(int, raw.strip().split(','))
                    move = board.location_to_move((row, col))
                    if move in board.availables:
                        break
                    else:
                        print("That position is not available. Try again.")
                except Exception:
                    print("Invalid input format. Please enter two integers as 'row,col'.")
            board.do_move(move)
            mcts.update_with_move(move)

        # AI player's turn
        else:
            # Get move probabilities from MCTS
            # Choose move with highest probability 
            acts, probs = mcts.get_move_probs(board, temp=1e-3)
            move = acts[np.argmax(probs)]
            print(f"AI chooses move: {board.move_to_location(move)}")
            board.do_move(move)
            mcts.update_with_move(move)

Practice: Othello and Other Board Games

In this article, I introduced how to leverage Monte Carlo Tree Search (MCTS) to play tic-tac-toe and built an AlphaZero agent for Gomoku. But the power of AlphaZero is not limited to these games; in fact, any board game that can be expressed as a grid of states — whether it’s Othello (Reversi), Checkers, Chess, or even more exotic turn-based contests — can be tackled with the same neural-network and MCTS framework.

By setting up the game environment and redefining the model’s input/output representation, you can repurpose this general algorithm for new games in just a few lines of code. Here are some game environment you can try to implement:

OpenAI Gym — Classic control and board-game environments (e.g. Tic-Tac-Toe, Connect Four).
PettingZoo — A wide range of multi-agent environments, including many turn-based games.
RLCard — Specialised for card and board games (e.g. Blackjack, Uno, Mahjong).
OpenSpiel — DeepMind’s collection of environments for general game-playing research.

Finally, I hope you enjoyed this article. I will write more articles related to AI, including explanations of their underlying principles and how to implement them.

If that sounds interesting to you, feel free to follow me. 👏 😁

AI Insights

Mastering Game Worlds: A Deep Dive into Reinforcement Learning

Discussion about this post