蒙特卡洛树搜索：Tic-Tac-Toe的实施-Java 学习之路

编辑：如果您想查看是否可以让AI更好地执行，请上传完整的源代码：https://www.dropbox.com/s/ous72hidygbnqv6/MCTS_TTT.rar

编辑：搜索搜索空间并找到导致丢失的移动 . 但是由于UCT算法，不会经常访问导致损失的移动 .

要了解MCTS（蒙特卡罗树搜索），我已经使用该算法为经典的井字游戏制作AI . 我使用以下设计实现了算法：

MCTS stages
树策略基于UCT，默认策略是在游戏结束前执行随机移动 . 我在实现中观察到的是，计算机有时会进行错误的移动，因为它无法直接导致特定的移动 .

例如：
Tic Tac Toe example
注意动作6（红色方块）的值略高于蓝色方块，因此计算机标记了这个位置 . 我认为这是因为游戏政策是基于随机移动的，因此很有可能人类不会将"2"放在蓝框中 . 如果玩家没有在蓝色框中放置2，那么计算机就会赢得胜利 .

My Questions

1）这是MCTS的已知问题还是实施失败的结果？

2）有什么可能的解决方案？我正在考虑将这些动作限制在选择阶段，但我不确定:-)

核心MCTS的代码：

//THE EXECUTING FUNCTION
    public unsafe byte GetBestMove(Game game, int player, TreeView tv)
    {

        //Setup root and initial variables
        Node root = new Node(null, 0, Opponent(player));
        int startPlayer = player;

        helper.CopyBytes(root.state, game.board);

        //four phases: descent, roll-out, update and growth done iteratively X times
        //-----------------------------------------------------------------------------------------------------
        for (int iteration = 0; iteration < 1000; iteration++)
        {
            Node current = Selection(root, game);
            int value = Rollout(current, game, startPlayer);
            Update(current, value);
        }

        //Restore game state and return move with highest value
        helper.CopyBytes(game.board, root.state);

        //Draw tree
        DrawTree(tv, root);

        //return root.children.Aggregate((i1, i2) => i1.visits > i2.visits ? i1 : i2).action;
        return BestChildUCB(root, 0).action;
    }

    //#1. Select a node if 1: we have more valid feasible moves or 2: it is terminal 
    public Node Selection(Node current, Game game)
    {
        while (!game.IsTerminal(current.state))
        {
            List<byte> validMoves = game.GetValidMoves(current.state);

            if (validMoves.Count > current.children.Count)
                return Expand(current, game);
            else
                current = BestChildUCB(current, 1.44);
        }

        return current;
    }

    //#1. Helper
    public Node BestChildUCB(Node current, double C)
    {
        Node bestChild = null;
        double best = double.NegativeInfinity;

        foreach (Node child in current.children)
        {
            double UCB1 = ((double)child.value / (double)child.visits) + C * Math.Sqrt((2.0 * Math.Log((double)current.visits)) / (double)child.visits);

            if (UCB1 > best)
            {
                bestChild = child;
                best = UCB1;
            }
        }

        return bestChild;
    }

    //#2. Expand a node by creating a new move and returning the node
    public Node Expand(Node current, Game game)
    {
        //Copy current state to the game
        helper.CopyBytes(game.board, current.state);

        List<byte> validMoves = game.GetValidMoves(current.state);

        for (int i = 0; i < validMoves.Count; i++)
        {
            //We already have evaluated this move
            if (current.children.Exists(a => a.action == validMoves[i]))
                continue;

            int playerActing = Opponent(current.PlayerTookAction);

            Node node = new Node(current, validMoves[i], playerActing);
            current.children.Add(node);

            //Do the move in the game and save it to the child node
            game.Mark(playerActing, validMoves[i]);
            helper.CopyBytes(node.state, game.board);

            //Return to the previous game state
            helper.CopyBytes(game.board, current.state);

            return node;
        }

        throw new Exception("Error");
    }

    //#3. Roll-out. Simulate a game with a given policy and return the value
    public int Rollout(Node current, Game game, int startPlayer)
    {
        Random r = new Random(1337);
        helper.CopyBytes(game.board, current.state);
        int player = Opponent(current.PlayerTookAction);

        //Do the policy until a winner is found for the first (change?) node added
        while (game.GetWinner() == 0)
        {
            //Random
            List<byte> moves = game.GetValidMoves();
            byte move = moves[r.Next(0, moves.Count)];
            game.Mark(player, move);
            player = Opponent(player);
        }

        if (game.GetWinner() == startPlayer)
            return 1;

        return 0;
    }

    //#4. Update
    public unsafe void Update(Node current, int value)
    {
        do
        {
            current.visits++;
            current.value += value;
            current = current.parent;
        }
        while (current != null);
    }

4 回答

6

因此，在任何基于随机的启发式中，您都无法搜索游戏空间的代表性样本 . 例如 . 从理论上讲，你可以随机抽样100次完全相同的序列，完全忽略丢失的相邻分支 . 这使得它与试图找到每一步的更典型的搜索算法区别开来 .

但是，这更有可能是实施失败 . 蜱虫的游戏树不是很大，约9！在移动一个，并迅速收缩，因此树搜索不能搜索每次移动以进行合理数量的迭代，因此不可能找到最佳移动 .

没有你的代码，我无法真正提供进一步的评论 .

如果我要猜测，我会说，也许你选择的是基于最大胜利次数的移动，而不是胜利的最大部分，因此通常偏向大多数时候搜索的移动选择 .

回复于 2024-05-03T22:41:12+08:00
5

我的第一个猜测是，你的算法的工作方式，选择最有可能赢得比赛的步骤（在终端节点中获胜最多） .

如果我是正确的，那么显示AI'失败'的示例不是“错误” . 这种估价方式会从敌人随机移动中获得收益 . 这种逻辑失败了，因为对于玩家而言，显然需要采取一步来赢得比赛 .

因此，您应该擦除包含播放器获胜的下一个节点的所有节点 .

也许我错了，只是第一次猜测......

回复于 2024-05-03T22:41:12+08:00
2
好的，我通过添加代码解决了这个问题：
```
//If this move is terminal and the opponent wins, this means we have 
        //previously made a move where the opponent can always find a move to win.. not good
        if (game.GetWinner() == Opponent(startPlayer))
        {
            current.parent.value = int.MinValue;
            return 0;
        }
```
我认为问题是搜索空间太小了 . 这确保即使选择确实选择实际上是终端的移动，也不会选择此移动，而是使用资源来探索其他移动:) .

现在人工智能与人工智能总是打平，艾未未像人类一样击败:-)
回复于 2024-05-03T22:41:12+08:00
1

我认为您的答案不应被标记为已被接受 . 对于Tic-Tac-Toe，搜索空间相对较小，应在合理的迭代次数内找到最佳动作 .

看起来您的更新功能（反向传播）会向不同树级别的节点添加相同数量的奖励 . 这是不正确的，因为当前的参与者在不同的树级别上是不同的 .

我建议你从这个例子中看看UCT方法中的反向传播：http://mcts.ai/code/python.html

您应该根据前一个玩家在特定级别计算的奖励更新节点的总奖励（示例中为node.playerJustMoved） .

回复于 2024-05-03T22:41:12+08:00

蒙特卡洛树搜索：Tic-Tac-Toe的实施

4 回答

相关问题