Carnegie Mellon University in
collaboration with Facebook developed an AI application called Pluribus that reliably
beat five professional poker players in the same game, or one pro pitted
against five independent copies of itself. It’s a major leap forward in
capability for the machines, and amazingly is also far more efficient than
previous agents, as well. One-on-one poker is a weird game, but the zero-sum
nature of it makes it susceptible to certain strategies in which a computer
able to calculate out far enough can put itself at an advantage. But add four
more players into the mix and things get real complex, real fast. With six
players, the possibilities for hands, bets and possible outcomes are so
numerous that it is effectively impossible to account for all of them,
especially in a minute or less. It’d be like trying to exhaustively document
every grain of sand on a beach between waves. Yet over 10,000 hands played with
champions, Pluribus managed to win money at a steady rate, exposing no
weaknesses or habits that its opponents could take advantage of. The secret is
consistent randomness.
Pluribus was trained, like many
game-playing AI agents these days, not by studying how humans play but by
playing against itself. The training program used something called Monte Carlo
counterfactual regret minimization. Sounds like when you have whiskey for
breakfast after losing your shirt at the casino, and in a way it is machine
learning-style. Regret minimization just means that when the system would
finish a hand, it would then play that hand out again in different ways,
exploring what might have happened had it checked here instead of raised,
folded instead of called and so on. A Monte Carlo tree is a way of organizing
and evaluating lots of possibilities, akin to climbing a tree of them branch by
branch and noting the quality of each leaf you find, then picking the best one
once you think you’ve climbed enough. If you do it ahead of time (this is done
in chess, for instance) you’re looking for the best move to choose from. But if
you combine it with the regret function, you’re looking through a catalog of
possible ways the game could have gone and observing which would have had the
best outcome.
More information: