Rating intransitivity in round robins

BB+ · Post by **BB+** » Mon Jun 02, 2014 7:56 pm

Suppose that you have 30 engines that play a round robin of 100 games against all other opponents (2900 total games each).

There are two main engines, A and B.

A beats everyone but B by 60-40, and loses to B by 30-70, for 1710 points
B beats everyone but A by 58-42, and beats A by 70-30, for 1694 points.

Who should be rated higher by a rating list, A or B? Explain your answer, discussing contempt if you wish.
Similarly for the following situations (and any others you deem relevant):

A beats everyone but B by 75-25, and loses to B by 45-55, for 2145 points
B beats everyone but A by 74-26, and beats A by 55-45, for 2127 points

A beats everyone but B by 74-26, and beats B by 65-35 for 2137 points.
B beats everyone but A by 75-25, and loses to A by 65-35 for 2125 points.

Note that you are essentially determining the A-B relation from playing 2800 games each against opponent(s) C, and 100 games head-to-head. Whether or not this is a valuable weighting is perhaps a philosophical question (and could depend on how much better A and B are than the C's).
For extra credit, in all cases compute the relevant error margins.

BB+ · Post by **BB+** » Tue Jun 03, 2014 10:30 pm

Incidentally, I might point out that methods that have self-predicting results (eg Ordo) on an Elo-like scale (minimally, with a non-decreasing function from rating difference to expected score), scoring more points in a round-robin necessarily gives a higher Elo (though this was not immediately obvious to me). To prove this, consider the case where they score the same number of points, then one should have that both r(A)>=r(B) and r(A)<=r(B) (from consideration of the self-prediction property, this requires a bit of work), so that engines that score the same number of points in a round-robin (independent of how they do against each other) will indeed have the same rating. Then apply continuity. Those systems that use other methods to compute ratings need not have this property. In particular, if you consider a game against an opponent near in strength to be more important, then the results can differ.

Note: here self-predicting means that the ratings for each player satisfy \sum_{games} Expectation(EloDiff+colour_adjustments) is equal to the observed number of points (up to some numerical epsilon if one uses iterative to compute it, but which exist theoretically from a multi-dimensional root-finding theorem). So one could argue that any intransitivity really comes from a departure from the Elo model (which indeed is a valid consideration -- do results in real life only depend on the EloDiff and not on the particularity of the players and their style?).

BB+ · Post by **BB+** » Wed Jun 04, 2014 10:08 am

Proving NOT( r(A)>r(B) ) when A and B score the same number of points is actually quite easy. The EloDiff for A will be bigger than that for B against every opponent (including the cross-term where A plays B), so the expectation of A would then be larger than that for B, a contradiction if expectations=results. Then switch the roles of A and B, and conclude r(A)=r(B) when they get the same number of points.

Also, one needs to ensure equal colours if an adjustment therein is used, else the true "Round-Robin" (same opp and situations) property is not kept.

But I still dispute on a philosophical level whether beating the "weakies" should be weighted the same as the head-to-head match-up (cf. Fischer/Spassky in Mar del Plata 1960).

hyatt · Post by **hyatt** » Thu Jun 05, 2014 6:39 pm

I wonder what kind of instability that would introduce to a player's rating? IE if you count the strong opponent more, would the rating bounce around and be less stable overall? There is something to be said for a stable number. And for computer programs, it might be even more complex. For example, in my cluster testing would improving my overall score vs Stockfish by 5% be a good trade for hurting my overall score against weaker opponents by 10%??

BB+ · Post by **BB+** » Sun Jun 08, 2014 4:52 pm

if you count the strong opponent more, would the rating bounce around and be less stable overall?

This is a not a direct answer, but... Supposing a 2-result model, the deviations when the superior player scores p points on average is about sqrt(p*(1-p)). In particular, if you played 10000 games and A scored 60% against B, the rating edge would be about 70.4 Elo with a deviation of about 3.55 Elo. If A scored 90% against B, it would be 381.7 Elo with a deviation of around 5.8 Elo. A small table is below. This is assuming the loogistic model, which itself might be faulty at larger differentials (not to mention the necessity of adding draws).

Deviation after 10K games in no-draws model under logistic distribution.

50% 00.00 3.47
60% 70.44 3.55
70% 147.2 3.79
80% 240.8 4.34
90% 381.7 5.79
95% 511.5 7.98
99% 798.3 17.52

OpenChess

OpenChess

Rating intransitivity in round robins

Rating intransitivity in round robins

Re: Rating intransitivity in round robins

Re: Rating intransitivity in round robins

Re: Rating intransitivity in round robins

Re: Rating intransitivity in round robins