On "clone testing"

BB+ · Post by **BB+** » Mon Dec 27, 2010 1:59 am

From TalkChess (Don Dailey):

I created a utility called similar which measures how different one chess program is from others. It does this by running 2000 position from random games and noting how often the moves agree and as output returns the percentage of moves that match.

Since this has been in the works for some time, I've had ample time to prepare any criticism.

I will try to leave the semantics aside (though calling it a "clone tester" cries out for nuance), and stick with scientific observations. I must say that I would find such a tool to be valuable if it is done in a scientifically proper manner, and its results parsed according to their proper scope.

Firstly, I would say that the utility measures how much the choice of best move output from one chess program differs from another. It is a different question to say how "similar" this makes them, which seems to be a word with many possible meanings. Indeed, it seems almost tautological to say that "clone testing" (or derivative, if you prefer) is better performed by an actual examination of the executables, though perhaps this is thought too time-consuming (or for the future "rental engines", maybe it is impossible). However, the utility does serve a useful purpose if its output has nonzero correlation with clones and/or derivatives.

The first problem I have with much of the discussion is that no sense of statistical error is ever mentioned. For instance, running a 1000 position suite should give a 95% confidence interval only of plus/minus 30 positions. This is fairly easily remedied simply by appending the additional maths.

In particular, "false positives" should appear rather frequently in a large enough pool, and robust methods to minimise their impact should be used (the numbers seem largely to be in the 550-650 range for random data, and 650-700 for semi-correlated). I can't say I am particularly enamoured by the use of techniques seen in biometry to draw putative hierarchal relationships either.

Another problem is strength conflation, that is, two engines will play similar moves simply because there actually is a "best" move, and suitably strong engines will all agree. This effect is rather hard to measure, and always seems to be in the background. In contrast, for instance with Toby Tal, it was found to be a clone (or at least the move generator) by giving it a battery of ten mate-in-1 positions with multiple solutions, and seeing an exact match with RobboLito (or something in that family). Here is one possible way to take a first whack at the effect of strength. First test (say) 15 engines at 0.1s per move, getting 105 pairwise measurements. Then do the same at 1.0s per move. As engines should play stronger at 1s per move, presumably the typical overlap (among the 105 comparisons) should be greater. By how much is it? A little or a lot?

A third critique involves self-validation, or perhaps more generally what could be called playing style. For instance, comparing Engine X at 0.1s to itself at 1.0s is said to be a way of showing that the utility detects not strength but style, as the correlation factor is still typically quite high. Whether or not this holds for a variety of engines (those deemed "tactical" versus "positional", or perhaps those using MTD(f) simply change their mind more/less than PVS) remains to be seen. I guess I am not so prone to agree with the statement: "I believed [...] it is far more difficult to make it play significantly different moves without making it weaker."

Finally, as noted above, the question of "move selection" versus "similar ideas" (in the sense of intellectual property) is not really resolved, as one can use many of the "same ideas" with different numerology, and get notably different play. It all depends on how much weighting you give in your sense of "clone" to the concept of the "feature set" of an evaluation function as opposed to merely the specific numerical values therein.

The prospective difficulties of drawing conclusions from these methods are seen in:

It looks to me that after Rybka 1.0 the program changed very substantially. From this I would assume he completely rewrote the program, and certainly the evaluation function.

Au contraire, a disassembly of the Rybka 2.3.2a evaluation function will show much of it to be still quite Fruit-like in its framework, with only two or three minor variations in the features from Rybka 1.0 Beta. The PST is slightly more tweaked, but my impression is that almost all the substantive changes from Rybka 1.0 Beta until LK's work with Rybka 3 were in the search (and some tuning of eval weightings, PST, and material imbalances). [Perhaps the fact that Rybka 1.0 Beta used lazy eval way too often due to a mismatch with 3399 vs 100 scalings might also play a rôle here].

BB+ · Post by **BB+** » Mon Dec 27, 2010 11:20 pm

So I'm slowly (re)doing my own data, using my own implementation. Here are some initial problems. Firstly, "go movetime 100" does not work for some engines, such as Rybka 3. I found this out when, expecting the test to take ~2 minutes for 1000 positions, it took 5 minutes.

Indeed, Rybka 3 seems ornery when a movetime is less than a second. For instance, in the initial position a "go movetime 100" took 0.223s for me.

Here are some data at 100ms and 1s.

Code: Select all

IH48v.out1.1s      0 689 657 682 646 596 
R3.out1.1s       689   0 636 653 772 595 
SF191.out1.1s    657 636   0 615 615 782 
IH48v.out1.100ms 682 653 615   0 655 593 
R3.out1.100ms    646 772 615 655   0 604 
SF191.out1.100ms 596 595 782 593 604   0

So the "self-correlation" of 10x the time of Rybka 3 is 772 [recalling the above, that "100ms" for Rybka 3 seems not possible], that of Stockfish 1.9.1 is 782, while IvanHoe 48v is only at 682. This already is a bit odd, but maybe it means that IvanHoe changes its mind at low(er) depths more readily. This could have to do with aspiration windows, or a number of things. It could just be drossy data.

At 100ms, Stockfish is lower than 600 against the other two, but is above 625 at 1s. Again this could be statistical variance, or maybe by 1s they are more likely to agree on the best move.

As Don Dailey originally said, he would prefer fast(er) searches, as they tend to stress eval more.

Also, after a bit of testing (not shown here), I tend to agree with him that (in general) contorting the search is not sufficient to fool the tester.

BB+ · Post by **BB+** » Tue Dec 28, 2010 1:10 am

Au contraire, a disassembly of the Rybka 2.3.2a evaluation function will show much of it to be still quite Fruit-like in its framework, with only two or three minor variations in the features from Rybka 1.0 Beta.

Since this comment of mine seems to be in a bit of dispute, I attach a briefly annotated version of the Rybka 2.3.2a evaluation disassmebly. I made some attempt to make it semi-readable, with suitable cross-references to register assignments. I didn't re-order instructions when possible, though that often can help readability. I didn't proof-check all my comments, and elided things that were less relevant (like storing stuff in the DYN structure). As I've noted elsewhere, there was rather notable attempt to obfuscate/hide the evaluation function (with "dummy" eval functions around) -- I determined that this is the one that is used via the use of breakpoints in a debugger. [There's still the possibility I am fooled somehow, though I count 8 uses of prefetch in my assembly dump of the whole code, and fully expect the evaluation function to be one of these]. I have the PST around somewhere, but not at my fingertips.

As I reiterate, the main differences from Rybka 1.0 Beta in the framework are the additional feature of pawn anti-mobility and the elimination of knight mobility. The pawn endgame code is also new (I only skimmed it in the attachment), as are various underlying things (like the use of prefetch). Everything is largely the same as in Rybka 1.0 Beta (and thus Fruit 2.1). As for numerology, the king attack values (941, 418, 666, 532) are all the same, while the mobility bonuses are slightly varied. I didn't go through the pawneval code particularly, though the "drawishness" consideration from pawn formations is one definite difference there.

BB+ · Post by **BB+** » Tue Dec 28, 2010 1:24 am

I might also note that "go movetime" is just totally broken in Rybka 1.0 Beta in any event (recall ZW mentioning this with a factor of 1000 missing), and as it seems dodgy when the value is too small for later Rybka versions, the value of DD's test for Rybka X vs Rybka Y might be diminished. OTOH, he seems to think that alloted time is not the dominant factor.

Sentinel · Post by **Sentinel** » Tue Dec 28, 2010 2:34 am

Great contribution BB.
One thing about the strength part.
If test was performed with engine playing perfect moves against all other engines the result (if all the representative positions of chess game are included) would be nothing but Elo table.
When noise, time controls, and all other imperfections are removed what remains are "similarity" component and strength component.
The stronger the engine you test is the more weight is on the strength component.
It is of course impossible to measure the exact influence, but my "feeling" is that for engines from top 5 group, the strength component is more than overwhelming.

Regarding time control noise and strength component, there is a way to remove them.
My suggestion would be to use fixed depth controls in the following way (it's very tedious and impractical but I can't help it):
Determine first (by playing bunch of fixed depth games) what is the fixed depth difference which makes engines play close to 50% and do this for various depths (the more the merrier, but something like at least 5 depth pairs should be tested) and all engine pairs. E.g. engine A (depth 13) vs. engine B (depth 15) is roughly equal, then engine A (depth 17) vs. engine B (depth 20), etc. It's more important that averaged through all the depths we get close to 50% (maybe +/- 2-3%) then to get exactly 50% for each depth since that is practically impossible.
Then test each pair for various depths for all the positions and average results.
In that way you would simultaneously test eval and search, avoid strength components and avoid some of noise components since playing engine vs. itself would always give score 100.

The way as it's atm Don's tool/test is just interesting for playing around but scientifically completely useless.

Sentinel · Post by **Sentinel** » Tue Dec 28, 2010 2:58 am

BB+ wrote:So the "self-correlation" of 10x the time of Rybka 3 is 772 [recalling the above, that "100ms" for Rybka 3 seems not possible], that of Stockfish 1.9.1 is 782, while IvanHoe 48v is only at 682. This already is a bit odd, but maybe it means that IvanHoe changes its mind at low(er) depths more readily. This could have to do with aspiration windows, or a number of things. It could just be drossy data.

You omitted a very important info by putting 0 for the exact self-play (self-play in the same TC). Only by having those we can say more accurately about changing the mind with depth.
Suppose Rybka 3 exact self-play score is 800 (I doubt it's much higher). This would give 28 positions difference for 10x time change. Supposing branching factor is about 2, this would be 3.3 higher depth for 1s vs. 100ms on average. So 2.8% change for 3.3 which gives about 0.85% average move change for 1 depth change. Not much, and certainly well below 10% which Bob claimed for Crafty many times.
Ok in some occasions after 2 more iterations the search returns the old best move, but still the percentage is too low.
What would be interesting to repeat the same test with fixed depth D and D+1 (for the same engine) and see the difference. That would give exactly the percentage of "engine mind changing". And to repeat this at various depths to see if there is a trend in search stabilization with higher depths or not. Maybe even to plot "search stabilization" curve.

BB+ · Post by **BB+** » Tue Dec 28, 2010 3:04 am

Suppose Rybka 3 exact self-play score is 800 (I doubt it's much higher). This would give 28 positions difference for 10x time change. Supposing branching factor is about 2, this would be 3.3 higher depth for 1s vs. 100ms on average. So 2.8% change for 3.3 which gives about 0.85% average move change for 1 depth change. Not much, and certainly well below 10% which Bob claimed for Crafty many times.

I agree that "self-play" (re-sampling the same positions, relying on random factors with timing, or SMP if you test with it on) should be included somehow, if for nothing else as a basis measurement. I seem to recall it was closer to 900 the last time I did this. I think your comparison to Crafty's "mind-changing" is wrong in any case, as there is no guarantee that the 772 is a subset of the 800, so the difference is likely more.

EDIT: I guess I am wrong. I did two separate runs with the latest IH at 100ms per move, and the same move was only played 714 of 1000 times.

I will check to see if something broke.

Another factor to worry about is SMP. I did it all in single cpu mode, though some engines like to make the default be SMP, and you have to make sure they are really in single cpu mode.

BB+ · Post by **BB+** » Tue Dec 28, 2010 3:33 am

EDIT: I guess I am wrong. I did two separate runs with the latest IH at 100ms per move, and the same move was only played 714 of 1000 times. I will check to see if something broke.

No, it seems that IH just behaves like this. I got 731 and 747 matches in a third run. Stockfish, on the other hand, managed 994 matches, a model of stability.

Maybe it has to do with how hash is overwritten or something, or how I/O is processed with time slicing. Yet another thing to worry about when doing comparisons...

[As many experimentalists know, the first task is to ensure you understand how the equipment works before starting to generate and analyse data].

EDIT: I'm beginning to dizzy over all the things that can go wrong in such a simple test. Stockfish seems not to regard "movetime" that strictly either:

Code: Select all

setoption name Threads value 1
go movetime 100
[...]
info nodes 143179 nps 1154669 time 124

BB+ · Post by **BB+** » Tue Dec 28, 2010 3:46 am

If my understanding is correct, the "go movetime 100" stability has to do with Stockfish polling only once every 30000 nodes, while IH does every 4K. This makes SF much more stable in short searches (in the case here, there is a nice point at which to break between 120000 and 150000 nodes, and usually it hits exactly that). Perhaps this is the sort of thing that should be "immediately obvious", as it were.

Sentinel · Post by **Sentinel** » Tue Dec 28, 2010 3:47 am

BB+ wrote:
Suppose Rybka 3 exact self-play score is 800 (I doubt it's much higher). This would give 28 positions difference for 10x time change. Supposing branching factor is about 2, this would be 3.3 higher depth for 1s vs. 100ms on average. So 2.8% change for 3.3 which gives about 0.85% average move change for 1 depth change. Not much, and certainly well below 10% which Bob claimed for Crafty many times.
I agree that "self-play" (re-sampling the same positions, relying on random factors with timing, or SMP if you test with it on) should be included somehow, if for nothing else as a basis measurement. I seem to recall it was closer to 900 the last time I did this. I think your comparison to Crafty's "mind-changing" is wrong in any case, as there is no guarantee that the 772 is a subset of the 800, so the difference is likely more.

EDIT: I guess I am wrong. I did two separate runs with the latest IH at 100ms per move, and the same move was only played 714 of 1000 times. I will check to see if something broke.

Another factor to worry about is SMP. I did it all in single cpu mode, though some engines like to make the default be SMP, and you have to make sure they are really in single cpu mode.

If you include SMP then it really becomes messy since "entropy" of SMP is way higher than "entropy" of TC.
Fixed depth tests are the key since they remove uncertainty.
Regarding measuring "mind changing" as a difference in case of fixed time TCs, you are correct it's very imprecise. The base number changes also, so calculating the difference is hard.
Base number at 1s would be 800, while at 100ms would be only for example 785 (smaller number is logical since for shorter TCs impact of OS scheduling is far more serious). Now, how to interpret 772 between 1s and 100ms is just questionable.

OpenChess

OpenChess

On "clone testing"

On "clone testing"

Re: On "clone testing"

Re: On "clone testing"

Re: On "clone testing"

Re: On "clone testing"

Re: On "clone testing"

Re: On "clone testing"

Re: On "clone testing"

Re: On "clone testing"

Re: On "clone testing"