Evaluating agents and tasks with Nash averaging

Jul 29th, 2020

Tags: machine learning, game theory, reinforcement learning, basketball, nash averaging


Tremendous effort has gone towards optimizing an ever-growing zoo of machine learning agents against tasks. In deep learning, thanks to the ease of collecting image and text data, we’ve seen step changes of performance on image classification and text generation. Using the very same networks paired with the endless data obtainable from simulations, we’ve trained reinforcement learning agents to superhuman performance on Atari games. See this plot from a 2017 paper:

Screenshot 2020-07-27 14.49.38

As an agent is trained, it is evaluated against a suite of 57 Atari games. Each game has a different scoring system and a different human baseline, so the agent’s performance on each game is normalized against the human baseline and then the median score over all games is reported. Using the median implies some notion of equal-weighting – each game is equally valuable. However, if you were to add 100 easy Atari games, like variants of Breakout that colored the tiles differently, you would naturally see the median human-normalized score go up. One would conclude that agent performance exceeded superhuman performance on the median game at an even earlier stage. But the agents haven’t changed at all, we simply engineered the task set to be easier.

We spend all this effort smashing new algorithms against the same set of tasks, and yet we don’t really think about how we can measure and optimize the evaluation itself. Just as human skill is built by seeking out harder challenges, we should expect machine learning innovation to be driven by migrating to harder challenges. To do so, we should demand better evaluation methods. We would expect a good evaluation method to have three properties:

  1. Invariance (adding redundant copies of agents or tasks makes no difference)
  2. Continuous (robust to small changes in data)
  3. Interpretable (subjective, but usually implies that the method is well-founded and easy to introspect)

As we’ve shown in the simple thought experiment above, the uniform averaging of tasks (implied by reporting mean or median performance) is not invariant. Can we do better?

Evaluating agents v. agents

Let’s explore an alternative method to uniform averaging (reporting mean or median performance) that works for both evaluating agents v. agents as well as the motivating example of agents v. tasks. The ideas here are inspired by the notion of Elo score, symmetric two-player games, and Nash equilibria. We begin with agents v. agents. In games like chess, Go, StarCraft II and others, we can pit agents (players) against each other in matchups of their skill and observe the likelihood of players beating each other. The key assumption is that these games are zero-sum: when one player wins, the other loses to the same degree. This assumption is to make analysis easier, we can easily generalize to general-sum games.

If we have \(N\) agents (players), we can pit each pair \(i, j\) of unique agents against each other. If we do this equally for all pairs of agents, we can construct a matrix \(A\) where \(a_{ij}\) is the number of net victories in the agent pair \((i, j)\), positive if \(j\) won more times, negative if \(i\) won more times, and zero if they’re equally matched. This matrix has \(a_{ij} = -a_{ji}\) , so it is antisymmetric by construction.

Here’s an example matrix I’ve written down for a simple game of rock-paper-scissors:

  Rock Paper Scissors
Rock 0 -1 1
Paper 1 0 -1
Scissors -1 1 0

Consider a new two-player game constructed on this matrix where Rob and Carly independently and simultaneously choose a player from \(1\) through \(N\). They accrue dollars according to the entry \(a_{ij}\) that they chose. The best strategy for this game is to find the Nash equilibrium of the game and play that. The Nash equilibrium is either a pure strategy (pick a single player) or a mixed strategy (pick different players according to some probability distribution) where Rob’s best option is to play the Nash if he knows Carly is playing it, and vice-versa. Since the mean of \(A\) is zero, the value of this game when Rob and Carly play the Nash against each other is also zero. It guarantees that the worst Rob (or Carly) could do is zero. If Rob played the Nash and Carly played a different strategy he could do better than zero, but never worse.

In the example of rock-paper-scissors above, our Nash equilibrium is simply [1/3, 1/3, 1/3]: we play each agent with equal probability. This guarantees the game’s value of 0. If we play against another agent who only plays rock, we could exploit this for a higher value by switching to a pure paper strategy, but we could in turn be exploited by a pure scissor player.

The Nash equilibrium on this new game has some interesting properties. If it has multiple agents in its support, then those agents are roughly equivalent. If the game were such that there were really a single best agent (common in high-skill games like chess and Go), the Nash equilibrium would place all of its mass on the best agent. On the other hand, if there are multiple dimensions along which agents can vary and they get compared on several of these, there might not be a single best agent.

Moreover, if we compute \(s = A p\), we get a vector where each value \(s_i\) corresponds to the value of playing the \(i\)th agent against the Nash mixture of agents, or in other words the relative skill of each agent when compared to the best possible mixture strategy. This measure of relative skill is invariant to adding agents that are similar to agents already in the population: if we add a new agent that performs like an existing one, then the reported skill will be very similar as well. With this approach, we’ve arrived at a way of evaluating agents that is invariant and interpretable. Unfortunately, since solutions to linear programs aren’t continuous, our Nash averaging approach to evaluation is not continuous.

Evaluating agents v. tasks

This concept can be extended to asymmetric two-player games. Consider the problem of evaluating agents v. task performance. If we have \(M\) agents and \(N\) tasks, we construct the \(M \times N\) matrix \(S\) , where \(s_{ij}\) is the performance of agent \(i\) on task \(j\). After normalizing the min/max range across columns (tasks) and subtracting the mean of the total matrix, we can construct a matrix

\[A = \begin{pmatrix} 0 & S \\ -S^T & 0 \end{pmatrix}\]

We can then apply our framework from earlier, except in this case we modify the game where Rob and Carly independently and simultaneously choose both an agent as well as a task. Rob’s agent is evaluated against Carly’s task, and Carly’s agent is evaluated against Rob’s task. As such, the Nash equilibrium is actually two probability distributions, one over the agents and one over the tasks.

Example: 2018-2019 NBA teams

Let’s try this out on some real-world data. Data on the 2018-2019 NBA season’s team v. team matchup win and loss rate can be found here. Converting this into net wins for each entry, we get an antisymmetric \(A\) matrix of size \(30 \times 30\) .

The maximum-entropy Nash equilibrium is a vector \(p\) of size \(30 \times 1\). To find it, we solve two linear programs, the first for the value of the game to the row player:

\[\begin{array}{ll@{}ll} \text{maximize} & v &\\ \text{subject to}& p^T A \geq v &\\ & \sum_i p_i = 1 & \\ & p_{i} \in [0, 1], &i=1 ,..., 30 \end{array}\]

and the second to find the Nash (using the \(v^*\) from the solution to the first):

\[\begin{array}{ll@{}ll} \text{maximize} & -\sum_i p_i \log p_i &\\ \text{subject to}& p^T A \geq v^* &\\ & \sum_i p_i = 1 & \\ & p_{i} \in [0, 1], &i=1 ,..., 30 \end{array}\]

Using cvxpy to solve the aforementioned programs, we find the following as our NBA Nash super-team:

Team Nash Probability %
San Antonio Spurs 30.5
Milwaukee Bucks 29.3
Oklahoma City Thunder 18.3
Denver Nuggets 8.5
Houston Rockets 7.3
Golden State Warriors 3.7
Sacramento Kings 2.4

If we were to sort the teams by skills using both the usual W/L % as well as the Nash skill ranking, we would get the following (higher teams are better, left is W/L % sorting, right is Nash sorting, and blue/red corresponds to East/West conferences respectively):

Screenshot 2020-07-27 21.09.59

Notice how there is a significant reordering of red (Western) conference teams towards the top in the Nash skill ranking. Because the league of 30 teams is divided into Eastern and Western conferences, and the Western conference is notoriously more competitive, we can see that a few Western conference teams with poor rankings (due to playing against other good Western conference teams) are actually some of the highest-skill teams in the league.

Most surprisingly, the Toronto Raptors, the team that ended up winning the championship that year, didn’t make it into the Nash equilibrium or the top 10 teams based on Nash ranking, despite being the 2nd highest ranked team by W/L %. Going into the championship series against the Golden State Warriors, the Raptors were the underdogs at an implied probability of winning of ~25%.

The reason this procedure even provides something meaningful for the Nash equilibrium is due to the nature of basketball. Due to non-transitive relations between team strategies and random noise from limited observations, we don’t observe a Nash that collapses to one team. In fact, there’s quite a few teams that are at the top of the league, and the winning strategy (to maximize absolute wins) is to randomly mix between them.

Example: reinforcement learning agents v. Atari games

This framework was also used to resolve the problem I presented at the very beginning: how can we do better evaluation of reinforcement learning agent performance?

In Fig 1A, we see the Nash mixture over the agents (reinforcement learning algorithms + human baseline) and the tasks (Atari games). It’s immediately clear that humans are still competitive: they have the most weight in the agent mixture, surpassing 30%. Of course, humans aren’t better on all counts, which is why we have mixes of the other autonomous agents. In Fig 1B, we can also see that the Nash mixture over tasks only considers 5 of the hardest games.

Screenshot 2020-07-28 22.49.43

Using the found Nash equilibria, we can also compute the agent skills and task difficulties by weighting the achieved scores on each agent/task by the weight assigned to it.

Screenshot 2020-07-28 22.50.31

These figures were taken from Balduzzi et al’s Re-evaluating evaluation paper.


This analysis has been quite oversimplified from the initial game theory research that spawned this. There’s a lot more depth to be explored here, such as using Nash equilibria to perform a so-called “Nash clustering” of the agents, visualizing the strategy landscape using a Schur decomposition, and what the geometry of real-world games looks like.

Here are the papers I referenced that are immediately relevant to the examples here:

And others that might be of interest: