How the rating system works and why high-ratings act strangely

**shannong** · 04-12-2011, 08:07 PM

Player complaints about the rating system used in Shadow Era periodically arise in the forums, so it seems useful to summarize in one place a "plain English" explanation of how the rating system works. Hopefully Kyle will see fit to sticky this thread.

Part 1 - Plain English summary of how the system works (without too much math)

The system is very similar to the well-proven TrueSkill system used in environments such as Xbox Live games. It's essentially a modification of the Glicko rating system, which was an improvement on the ELO rating system used by most chess federations and also many tournament CCG environments such as for Magic the Gathering. Use Wikipedia and Google if you want to know the gory details for these systems.

The system is designed to create a "normal distribution" (aka, a symmetrical "bell curve" with a very specific shape) of ratings throughout the entire Shadow Era player population. This means that not everyone can be an equally high-rated player. Essentially, only a small percentage of the total population can be "at the top". As the population grows (or shrinks), the "at the top" set might contain more players, but the total RATIO of "top players" to total population should remain essentially the same.

This means two very important things:

Your rating is NOT a "score".
Your rating growth is not infinitely positive: the higher your rating gets, the slower it continues to move forward with each "win", even against a higher-rated opponent, and the more your rating will "fall" when you lose. The closer you get to the right side of the "bell curve", the more pressure the system exerts to slow down your rightward movement and actively try to push you back towards the mean rating of the entire population (a rating of roughly 250).

The system is currently designed to have a rating range of 0-500, and it's possible for the players at the very top to "push past" the 500 mark to a small degree, but not by much, because the more you approach and pass the 500 mark, the greater the pressure to push you back towards the mean.

What's counterintuitive to many players who have a passing familiarity with the ELO system used in Chess, MtG, and so on is that in the ELO system, your rating always changes by a noticeable amount after every single match. The math behind the ELO system is greatly simplified and overlooks some real statistical constraints so that it can be easily calculated by hand with pen and paper. The ELO system was designed in the 1930s and its creator knew full well where its weaknesses were.

By contrast, in the TrueSkill system (which is a slight tweak on the Glicko system), your rating actually comprises two values, one of which is hidden. The visible value is your rating, known as the "mu" value in TrueSkill. The hidden value is in plain English a "confidence" value: how confident the system is about your current rating. The TrueSkill system calls this confidence value the "sigma" value, and the Glicko system calls this confidence value the "ratings deviation" or "RD".

Now, the perception that your Shadow Era rating is doing "weird" or "wrong" or "counterintuitive" things, especially if you have a comparatively high rating above 400, is because every match compares both the mu and sigma values for yourself and your opponent, and the total result can seem "odd" compared to the straightforward simplicity of the ELO system. Some examples:

You have a rating of 450. You win 5 games in a row and your rating (mu) doesn't move at all. You then lose one game and your rating (mu) drops by a noticeable amount. In the ELO system, you would still have earned some points for every win and lost some points for the loss.
You have a rating of 450. You win a game against a lower-ranked opponent and yet your rating actually FALLS by 1 point to 449! omgwtfbbq??!!?! In the ELO system there's no way you would ever have lost points for a win.

All of these "odd" results make perfect sense within the TrueSkill/Glicko systems, and they are actually MORE accurate than the ELO system. I really want to drive that home: ELO might SEEM more intuitively "fair" and "accurate", but in fact it is NOT either, and the creator of the system knew it. Back in the 30s it was just too messy to do the math to provide more accurate ratings. Nowadays we have computers to do all this messy math in milliseconds.

To understand why the two "odd" effects in the bullet list above can happen, let's look at a Wikipedia quote about how the Glicko system works. Emphasis mine.

The Glicko rating system and the Glicko-2 rating system are chess rating systems similar to the Elo rating system: a method for assessing a player's strength in games of skill such as chess. It was invented by Mark Glickman as an improvement of the Elo rating system. The main idea is the introduction of a measurement for the ratings reliability called RD for ratings deviation.

Both Glicko and Glicko-2 rating systems are under public domain and found implemented on game servers online (like Free Internet Chess Server, Chess.com and SchemingMind). The formulas used for the systems can be found on the Glicko website.

The RD measures the accuracy of a player's rating. For example, a player with a rating of 1500 and an RD of 50 has a real strength between 1400 and 1600 with 95% confidence. Twice the RD is added and subtracted from their rating to calculate this range. After a game, the amount the rating changes depends on the RD: the change is smaller when the player's RD is low (since their rating is already considered accurate), and also when their opponent's RD is high (since the opponent's true rating is not well known, so little information is being gained). The RD itself decreases after playing a game, but it will increase slowly over time of inactivity.

The bit that I highlighted is equally true in the TrueSkill system. If your rating is very high, you might win lots of games in a row but your rating might not increase even one point because of this effect, if, for example, all of your opponents were lower rated than you and/or had a very wide sigma value (RD value aka "confidence"). To grossly oversimplify, the system only gives you a higher rating when it DOES NOT EXPECT YOU TO WIN, and the higher-ranked you are, the MORE the system expects you to win. So you go for 7 straight wins, your rating does not budge because of this effect, and then you lose one match and your rating suddenly falls drastically. Why? Because the system ALWAYS lowers your rating if you were expected to win but you did not! These two behaviors are what I refer to as "increasing pressure to alway push high-rated players back towards the mean". The closer you get to the right tail of the bell curve (the closer you get to 500 rating or higher), the stronger this pressure becomes.

As for the second bullet point where you WIN a game but your rating actually falls by 1 point? That can happen because every single game you play narrows the hidden sigma ("confidence") value for you, which pushes your mu (rating) value a fractional amount in either direction. Because your real underlying mu (rating) value isn't an integer, the small change might cause the real value to move in a direction where it gets rounded down to the next-lower integer. In plain English, it's like the system wasn't really sure whether your rating was 449 or 450. But after playing one more game, the system became slightly more sure that you're probably 449. It didn't matter whether you won or lost. If the system strongly EXPECTED you to win, because, say, the opponent was 150 points below your own rating, it wasn't going to increase your mu (rating) value anyway. But because its "confidence" grew stronger, it still adjusted your mu rating slightly downward EVEN THOUGH YOU WON.