What is the best way of ranking ultimate teams based on their
performance? This fundamental question can be asked both in the context of a whole
season or for a single tournament. While USA Ultimate's use of a season-based ranking
algorithm has lead to quitesomediscussions,
I will in this blog post focus on the simpler case of ranking teams at
tournaments, where various factors such as
home-field advantage, changing rosters over a season etc. play a
smaller role.
One possibility currently used in
the Swissdraw
format is that teams earn "swiss points" at the completion of
every game. The number of swiss points awarded depends on the point
differential (also called margin) of the game. So far, we have been
using the following table to convert point differentials into swiss
points:
There are a number of advantages in using the innovative Swissdraw format
compared to the more common pool format. All teams can potentially
play each other. The Swissdraw format is designed so that teams of
similar strength match up quickly and within a few rounds, the ranking
of the teams represents their level of play. This guarantees
attractive games against different opponents of comparable strength.
While I am personally a big fan of the Swissdraw format, I will argue that
the currently used system to rank teams has the problem that teams are
awarded the same amount of Swiss points for a certain point
differential, independent of the strength of the opponent. This
drawback has particularly bad consequences in big divisions with a
widespread level of play, where in later rounds of Swiss draw, teams
can still make big jumps in the ranking by winning/losing by big
margins. In this post, I would like to suggest another method of
ranking the teams which will make the Swissdraw format work even
better.
Power Rankings
This method has been suggested already back in 1976 by Leake in the
context of ranking American college football teams. A nice
mathematical explanation of it can be found in Chapter 4 of Ken Massey's
undergrad thesis.
The basic assumption is that every team can be assigned a numerical
value representing its strength (or power) so that the point
differential in a game is the difference in strength of the
participating teams. For example, if Team Alice wins against Team
Danny with a score of 15-10, this result could be explained by
assigning a strength of +2.5 to Team Alice and a strength of −2.5 to
Team Danny. (Of course, any two numbers with a difference of +5 would
work, but let us try to keep the numbers as small as possible in
absolute value.)
If there are more teams with many games played among them, it will
become more difficult to assign strengths to the teams, but we can
nevertheless try to optimize these numbers so that they fit the
outcomes as well as possible. In fact, this problem is well-studied in
the area of mathematical regression.
In more mathematical terms, we assume that the game outcome yij
between teams i and j is the difference of their according strength
βi - βj plus some error term ϵi,j which is independent
and identically normally distributed for every game. Expressed in
matrix form, we can write
y=Xβ+ϵ
where y is a
column vector containing all game margins, the rows of the matrix X
are all-zero except for a +1 in column i and a −1 in
column j. The column vector β denotes the strength of
the teams and ϵ is the column vector of the
error terms with normal distribution. There exist efficient
methods (such as the least-square method)
to compute strength vectors β that minimize the square of the
errors:
∥y−Xβ∥2.
Example
Let us consider a simple example based on a
made-up tournament with six teams. The game outcomes reflect the
imaginary fact that their level of play is about evenly spread out
among the top three teams and the bottom three teams where Team Alice is the best and Team Fred the worst team.
The results of the first round are as follows:
resulting in the following strength values and swiss points:
PowerRank denotes the team's rank according to their strength, whereas
SwissRank is the team's rank according to the amount of Swiss
points earned so far.
All game outcomes can be perfectly explained with those strengths.
After the first round (and assuming no prior knowledge of the strength
of these teams), it is impossible to compare Team Alice with Team Bob,
because there is no connection between them yet.
After a second round with the following results:
We can compute the following strength values, Swiss points and
according ranks:
Notice that Team Charlie would be ranked first if sorted according to
swiss points, as it had the largest marginal of +5 in the second round
among the winners of the first round. Analogously, Team Danny would be
ranked last. However, the new method re-evaluates all previous games
from the point of view of the latest results, with the (correct) outcome that
assigning the biggest strength to Alice gives the best explanation of
the results. In fact, the power ranks after only two rounds already reflect the order of
teams we had in mind when making up the results.
The seventh column (entitled "predicted margin") is the difference in
current strength of the teams involved in a particular game which can be
interpreted as the margin predicted by the strength. The values in the
last columns are the squared differences of the actually observed and
predicted margins. If such a value is high, the model could not predict
this game outcome well. Hence, big values stand for surprising game
outcomes. The least-square procedure tries to find strength
values that minimize the sum of the surprise values in the last
column.
Playing one more round with results:
gives:
By now the teams are clearly separated in strength. However, notice
that the Swiss points still do not reflect the strengths of the teams
correctly. Sorting according to number of wins as first criterion (and Swiss points
as second) would put Alice
on first place (she is the only one with three wins), but it would still place Team Eve ahead of Team Danny
(both have one win and two losses).
Let us examine the example graphically in the following
chart. Clicking on series in the legend will toggle its
visibility. Clicking on particular points in the chart will show
detailed explanations how that strength was obtained.
For comparison, let us consider the graph of average Swiss points:
The Swiss score does not reflect the
correct order of the teams, neither after round 2 nor after round
3. In contrast, the power ranking "gets it right" already after two
rounds. The frequent crossings of the lines indicates that teams make
jumps in their placement as illustrated here: (click on series in the
legend to toggle visibility of the power ranks)
Here are the final evaluations of the games, based on the team's
strength after Round 3, sorted by the
upset/surprise value.
The first line can be read as follows: based on the strengths computed
based on all game results, the biggest surprise of all games happened
in the round-2 game, where Team Charlie won against Team Danny with a
margin of +5 where the model predicted a margin of only +3.73.
Equipped with all this knowledge you can dive into the power scores for
Windmill Windup and Wisconsin Swiss provided in the Further Analysis section at the end.
Conclusion
There exist a wide variety
of sports
ranking systems. Most of these systems do not reveal the details
of their algorithms. The power-rating system presented here is a very
basic variant that I suggest to use for ranking teams e.g. in the
Swissdraw phase of an ultimate tournament. As outlined in Chapter 4 of
Massey's
thesis, the system could be extended in various ways to account for
things like home-field advantage, blow-out scores etc. As illustrated in the
example above, it converges faster to the real ranking of the teams.
Here is a list of pros and cons compared to the currently used
swiss-point system:
Pros:
Your strength depends on the performance of your opponents. A
large win against a strong opponent counts more than against a weak one.
Converges faster to the "real" ranking, smaller jumps in rankings
from one round to the next.
Strengths say more about teams than swiss points (e.g. the
difference in strength of two teams directly predicts the game outcome and point
differential. This also allows us to get a sense of "surprising"
results. This information might be of interest for spectators live at a tournament
or when reporting about it afterwards.)
Cons:
Difficult to understand
Games of previous rounds are "re-evaluated", you never have a
certain amount of points for sure.
Your strength depends on the performance of your opponents.
I think that the first concern can be mitigated by providing
interactive graphs like the ones above to help the teams explaining how their
strength was computed. The other two disadvantages are inherent to the
system.
I am very curious to hear what you think about the suggested power-ratings in
Ultimate. Do you see more advantages and disadvantages? Please leave
your comments below.
Further Analysis
To see how these power rankings apply to events in the past, please take a look at some in depth analysis for 5 popular tournaments:
If you want to get involved in discussions about ranking teams and devising optimal ranking algorithms, please take a second to join this new Google Group:
https://groups.google.com/forum/#!forum/sport-stats
References:
Leake, R. J. (1976), "A Method for Ranking Teams with an Application
to 1974 College Football," Management Science in Sports, North Holland.
Massey, Kenneth (1997), "Statistical Models Applied to the Rating of
Sports Teams", Honors Project: Mathematics, Bluefield College. http://masseyratings.com/theory/massey97.pdf
Christian Schaffner has been managing the Swissdraw schedule and scores
since 2009 at Windmill Windup, Europe's largest grass tournament. Since
then, he has been involved in promoting and advancing the Swissdraw
format for Ultimate tournaments. In his everyday life, he is a
researcher in quantum cryptology at the University of Amsterdam.
Nice work and it's good to see multivariate regression analysis being applied. It may be the simplest and easiest to use of the "advanced", mathematically sophisticated ranking algorithms, but I wonder if there were other reasons why you chose this one over the others such as minimizing the absolute errors rather than the squared errors, maximum likelihood, etc. 9:49 p.m. on October 11th, 2012
Hi Mike, thanks for your comment! The main reason was (as you say) that least squares seems to be the simplest and best studied of the methods. There are also a number of statistical reasons why it performs well (e.g. under some normality assumptions, least squares is actually the same as maximum likelihood).
Of course, other methods have been suggested and might work as well, see e.g. a paper by Bassett (http://www.jstor.org/stable/2685396) for minimizing the absolute errors. 8:46 a.m. on October 12th, 2012
Hi! I realize I'm six years too late to the party, but I played my first tournament under this system this weekend, so I thought I'd chip in in case someone is still monitoring the comment fields :-) There's a lot to say, but I'll keep it to two comments:
First of all, I think it's very interesting and positive that people are starting to use statistical methods like this. There's a subtlety in that I think “classical” systems are trying to find the team that did the best (by setting up points as prizes) and a statistical system tries to find the best team—both are fine, they're just different, and which one you prefer is a matter of taste. Teaching people exactly what's going on is a challenge, but OK, most people won't be understanding tiebreaks in regular systems either (and in this one, they nearly never happen).
However, I do find the underlying normality assumption too simplistic. (I also question the independence in a tournament where fatigue plays in, but it's probably harder to do anything about.) Goals in ultimate are probably a result of some kind of Poisson process (at least that's a somewhat more solid approximation), which doesn't map well to the Gaussian distribution when there are few goals, e.g. one team only manages to score 2. Especially when there are lopsided team strengths and/or results, the transitivity-of-goodness assumption also breaks down, and in particular, the system gets too sensitive to outliers—in extreme cases, you could even be asked to get impossible spreads in order not to lose score. (An interesting side note here is that if you're playing a statistical system, where win or loss doesn't really matter, you should allow draws!)
I don't have a definite answer here, and you'd certainly get a model that's even harder for people to understand, but I'd love to see something with more explicit statistical modelling, ie., set up some sort of distribution for each match and maximize the total likelihood. You no longer get the nice closed-form expression, so you'd have to go for some sort of hill-climbing (or minorize-maximization), but the functions tend to be free from local minima and converge really quickly even with thousands of teams involved.
Thanks! 8:43 a.m. on March 5th, 2018
9:49 p.m. on October 11th, 2012
8:46 a.m. on October 12th, 2012
8:43 a.m. on March 5th, 2018