Some experimental "similarity" data

BB+ · Post by **BB+** » Wed Dec 29, 2010 9:07 pm

I've decided to stick this in a more technical subforum.

Here are the results of my experiment in "bestmove" matching, à la Don Dailey. I used fixed depth for a variety of reasons, notably that some engines screw up movetime, while others have polling behaviour that can sully any data in fast searches (with SMP another worry). While I much prefer the reproducibilty of 1-cpu fixed depth searches, I don't think that this should seen as a great advance in "scientific" methodology, however.

I formed a suite of 8306 positions. I did this by taking a few hundred games, and pruning all opening/endgame positions, then pruning those which had a move that was more than +0.25 (in a 1.0s search) above any others, and those for which the eval was more than 2.00 in size. Whether this is a good method to generate positions is an open matter, but hopefully it would give some control over strength issues.

Then I tested 10 engines at various depths. The determination of the proper "depth" is not a science, but I intended for it take between 2 and 4 hours to run (about 1 second per move). I restricted myself to the engine families listed below as with them I understand how to ensure that the data obtained are what is desired. [With some non-negligible but feasible effort, I could also completey isolate the evaluate() function in each of these if desired, so as to see if "bestmove" correlation and evaluate() correlation are themselves correlated].

Here are the bestmove-matching data:

Code: Select all

                 FR10 FR21 IH47 Ryb1 Ry12 R232 Ryb3 Gla2 SF15 SF19  Time
FR10.at.dp9         0 3920 3290 3529 3600 3581 3381 3876 3611 3528  3:36
FR21.at.dp10     3920    0 3927 4551 4478 4436 4064 4330 4248 4127  4:06
IH47c.at.dp15    3290 3927    0 4333 4423 4641 4921 3885 4370 4411  3:09
R1.at.dp10       3529 4551 4333    0 5523 5259 4552 4264 4408 4283  2:45
R12.at.dp11      3600 4478 4423 5523    0 5464 4638 4272 4468 4379  3:18
R232.at.dp11     3581 4436 4641 5259 5464    0 4840 4206 4454 4378  3:21
R3.at.dp10       3381 4064 4921 4552 4638 4840    0 4057 4434 4380  2:51
GL2.at.dp12      3876 4330 3885 4264 4272 4206 4057    0 4735 4365  2:41
SF151.at.dp13    3611 4248 4370 4408 4468 4454 4434 4735    0 5238  3:57
SF191.at.dp14    3528 4127 4411 4283 4379 4378 4380 4365 5238    0  2:35

The table has the nice feature that it confirms all "known" interactions (and/or preconceived notions). For instance, Rybka 1 to Rybka 2.3.2a have an incredible match, while Rybka 3 still shares a lot with them, and IvanHoe shares much with R3 also. The Fruit 2.1 overlap with earlier Rybkas also seems apparent (though not quite so pronounced), and this looks to disappear in R3. The 95% confidence interval for any given correlation should be about ±100, so that (for instance) the Glaurung 2 correlation with Fruit 2.1 at 4330 is distinctly less than the Rybka 1.0 Beta correlations with Fruit 2.1 and Rybka 3 of 4551 and 4552 respectively. Again I note that it is not all that clear that strength issues have been adequately addressed. Fruit 2.1 did see a complete (re)write of the eval function of Fruit 1.0, but still it seems that Fruit 1.0 might be too weak to correlate well.

All data and programmes are in the attached 7zip archive, which is in a semi-usable form (for instance, I #define things to be 8306 in the C code, to concord with the data size). The DEPTH needs to be given at compile time, while the engine name can be given as a command-line option. As noted, the correlation data I obtained should be entirely reproducible, though it would likely be more useful to run a similar experiment on a different set of positions (possibly pruned as above). I would usually run this via commands like:

Code: Select all

gcc -O3 -DDEPTH=\"11\" -o bestmove bestmove.c
time ./bestmove LINKS/Ryb232 < PRUNE.LIST > R232.at.dp11 &
[...]
./compare FR10.at.dp9 FR21.at.dp10 IH47c.at.dp15 R1.at.dp10 R12.at.dp11 \
          R232.at.dp11 R3.at.dp10 GL2.at.dp12 SF151.at.dp13 SF191.at.dp14

where LINKS is a (sub)directory with links to the engines in question.

BB+ · Post by **BB+** » Wed Dec 29, 2010 9:26 pm

I did this by taking a few hundred games, and pruning all opening/endgame positions, then pruning those which had a move that was more than +0.25 (in a 1.0s search) above any others, and those for which the eval was more than 2.00 in size. Whether this is a good method to generate positions is an open matter, but hopefully it would give some control over strength issues.

I might point out that there are still a number of positions (998, or 12%) where all 10 engines agree. Usually these are something like: a queen is attacked by an opposing piece, and can move various places -- all engines agree that place X is the best, even though place Y is not more than 0.25 behind. Also, having all 10 engines agree "randomly" on a given move is not surprising, considering that many are related.

kingliveson · Post by **kingliveson** » Wed Dec 29, 2010 11:46 pm

BB+ wrote: Here are the bestmove-matching data:

Code: Select all

                 FR10 FR21 IH47 Ryb1 Ry12 R232 Ryb3 Gla2 SF15 SF19  Time
FR10.at.dp9         0 3920 3290 3529 3600 3581 3381 3876 3611 3528  3:36
FR21.at.dp10     3920    0 3927 4551 4478 4436 4064 4330 4248 4127  4:06
IH47c.at.dp15    3290 3927    0 4333 4423 4641 4921 3885 4370 4411  3:09
R1.at.dp10       3529 4551 4333    0 5523 5259 4552 4264 4408 4283  2:45
R12.at.dp11      3600 4478 4423 5523    0 5464 4638 4272 4468 4379  3:18
R232.at.dp11     3581 4436 4641 5259 5464    0 4840 4206 4454 4378  3:21
R3.at.dp10       3381 4064 4921 4552 4638 4840    0 4057 4434 4380  2:51
GL2.at.dp12      3876 4330 3885 4264 4272 4206 4057    0 4735 4365  2:41
SF151.at.dp13    3611 4248 4370 4408 4468 4454 4434 4735    0 5238  3:57
SF191.at.dp14    3528 4127 4411 4283 4379 4378 4380 4365 5238    0  2:35

BB+ · Post by **BB+** » Wed Dec 29, 2010 11:56 pm

Oh right, the diagonal should have 8306's on it by Sentinel's demand.

I intend to try to add a few more engines to this, but one problem with older engines (Pepito, Faile, Phalanx) is figuring out what a good running time/depth is, due to the strength issue. Houdini, Komodo, and Critter will also likely be added. I found some early Glaurungs to throw in also.

BB+ · Post by **BB+** » Thu Dec 30, 2010 1:37 am

Another issue, perhaps an aside for the "similarity" test but a topic that shouldn't be ignored in general, is the great differential in depth seen between the engines. Even if we add 3 to that of Rybka 3 for proper normalisation, the "typical" depth in a 1-second search is nearly 2 ply below that of IvanHoe, to say nothing of Fruit 1.0 which lanquishes more than 6 ply behind. If there were no conflating factors, I would guess the R3/IH distinction to be almost a 100 Elo gain, so maybe half of that "speed-improvement" remains when move-quality is folded in. Although search depth is not a panacea, reaching large depths without tossing out too many "interesting" moves really seems (to me at least) to be what separates merely "good" engines from those at the top [I'm sure VD with his +300 Elo from eval-tuning would claim I'm all wet, but]. Of course, having a robust (or maybe stable) eval can also expedite the finding of search improvements, so the two are not inseparable.

As I've stated elsewhere, my (partially educated) guess for the ~230-Elo gain from Rybka 1 to Rybka 3 is that ~150 of it came from search (much of which I would term "statistical pruning", for lack of a better term), and the rest from LK's eval improvements (and maybe 20 or so from bug fixes). He had a variety of estimates (some a few months before the R3 release, and some closer to when it occurred), but I think it was almost 100 Elo in fixed depth matches from R2.3.2a to R3, and the extra cpu work slows it down enough that 60-70 is probably a better (still wild) guess for eval gains from R1 to R3. Osipov concluded the material imbalance table alone was of similar worth.

Sentinel · Post by **Sentinel** » Thu Dec 30, 2010 2:14 am

BB+ wrote:Another issue, perhaps an aside for the "similarity" test but a topic that shouldn't be ignored in general, is the great differential in depth seen between the engines. Even if we add 3 to that of Rybka 3 for proper normalisation, the "typical" depth in a 1-second search is nearly 2 ply below that of IvanHoe, to say nothing of Fruit 1.0 which lanquishes more than 6 ply behind. If there were no conflating factors, I would guess the R3/IH distinction to be almost a 100 Elo gain, so maybe half of that "speed-improvement" remains when move-quality is folded in. Although search depth is not a panacea, reaching large depths without tossing out too many "interesting" moves really seems (to me at least) to be what separates merely "good" engines from those at the top [I'm sure VD with his +300 Elo from eval-tuning would claim I'm all wet, but]. Of course, having a robust (or maybe stable) eval can also expedite the finding of search improvements, so the two are not inseparable.

Null-move reduction is typically 1.5 plies bigger in IH vs. R3. Moreover LMR is also more aggressive. SF for example has (specifically for 1.8) crazy strong forward pruning, much stronger than any other engine.
VD is really funny since removing eval completely by having a lazy one (so only material imbalance) everywhere except in PV nodes is worth only about 100 Elo.

As I've stated elsewhere, my (partially educated) guess for the ~230-Elo gain from Rybka 1 to Rybka 3 is that ~150 of it came from search (much of which I would term "statistical pruning", for lack of a better term), and the rest from LK's eval improvements (and maybe 20 or so from bug fixes). He had a variety of estimates (some a few months before the R3 release, and some closer to when it occurred), but I think it was almost 100 Elo in fixed depth matches from R2.3.2a to R3, and the extra cpu work slows it down enough that 60-70 is probably a better (still wild) guess for eval gains from R1 to R3. Osipov concluded the material imbalance table alone was of similar worth.

Osipov's guess (test) was overestimated. Moreover there is one more factor ppl easily miss. Contempt factor. This alone gives over 30 Elo to R3. It has been removed in R4 (Vas probably wanted to have better head to head ratio against Ippo family by sacrificing some of the rating). However, it can be seen by careful adjustment (as this guy is doing with experimental versions) you can gain quite a bit against weak opponents.
Regarding material values, going from Larry's basic ones (325, 325, 500, 975, 50) without any phase adjustment to full set of what R3 has gives less than 50Elo.
Btw. VD is right for material imbalance tuning. Larry gave the ideas, however, values in R3 are completely different from what Larry suggested and some really heavy automated tuning has been made.

BB+ · Post by **BB+** » Thu Dec 30, 2010 2:22 am

Osipov's guess (test) was overestimated. [...] Regarding material values, going from Larry's basic ones (325, 325, 500, 975, 50) without any phase adjustment to full set of what R3 has gives less than 50Elo.

I had almost written a parenthetical [but I haven't tested this myself] regarding Osipov's claim, but figured I would just do it tonight at hyper-bullet, and report back if differing. Of course, "68 Elo" is (quantitatively) meaningless in any event w/o the conditions given.

Don · Post by **Don** » Thu Dec 30, 2010 10:09 pm

I have also updated my similarity tester which you can find on the komodo web site.

It has 10,000 positions minus all the positions every program agree's on which is nearly 2000 positions. It is configurable, so that you can set threads to 1 and experiment with other UCI options if you want. I also put a configurable scale factor - for those who want to try to normalize for strength. I would suggest that if the scale factor needs to be more than 10 or 20 it's not that important.

I personally don't really see a point to scaling but I added it because it was a popular request. If a program is significantly stronger than another, it ALREADY implies significant changes and it should get credit for whatever affect that has on the similarity scoring. For example if you believe Houdini is "strongly" related to Robbolito, it is still clearly stronger and giving Robbolito more time is like trying to cancel out those differences, adding bias to the test. In other words the strength of a program is ALSO a function of how similar it is to another. In fact this is primary argument used to "prove" that a program is different on both sides of the "clone" arguments.

BB+ wrote:I've decided to stick this in a more technical subforum.

Here are the results of my experiment in "bestmove" matching, à la Don Dailey. I used fixed depth for a variety of reasons, notably that some engines screw up movetime, while others have polling behaviour that can sully any data in fast searches (with SMP another worry). While I much prefer the reproducibilty of 1-cpu fixed depth searches, I don't think that this should seen as a great advance in "scientific" methodology, however.

I formed a suite of 8306 positions. I did this by taking a few hundred games, and pruning all opening/endgame positions, then pruning those which had a move that was more than +0.25 (in a 1.0s search) above any others, and those for which the eval was more than 2.00 in size. Whether this is a good method to generate positions is an open matter, but hopefully it would give some control over strength issues.

Then I tested 10 engines at various depths. The determination of the proper "depth" is not a science, but I intended for it take between 2 and 4 hours to run (about 1 second per move). I restricted myself to the engine families listed below as with them I understand how to ensure that the data obtained are what is desired. [With some non-negligible but feasible effort, I could also completey isolate the evaluate() function in each of these if desired, so as to see if "bestmove" correlation and evaluate() correlation are themselves correlated].

Here are the bestmove-matching data:
Code: Select all
                 FR10 FR21 IH47 Ryb1 Ry12 R232 Ryb3 Gla2 SF15 SF19  Time
FR10.at.dp9         0 3920 3290 3529 3600 3581 3381 3876 3611 3528  3:36
FR21.at.dp10     3920    0 3927 4551 4478 4436 4064 4330 4248 4127  4:06
IH47c.at.dp15    3290 3927    0 4333 4423 4641 4921 3885 4370 4411  3:09
R1.at.dp10       3529 4551 4333    0 5523 5259 4552 4264 4408 4283  2:45
R12.at.dp11      3600 4478 4423 5523    0 5464 4638 4272 4468 4379  3:18
R232.at.dp11     3581 4436 4641 5259 5464    0 4840 4206 4454 4378  3:21
R3.at.dp10       3381 4064 4921 4552 4638 4840    0 4057 4434 4380  2:51
GL2.at.dp12      3876 4330 3885 4264 4272 4206 4057    0 4735 4365  2:41
SF151.at.dp13    3611 4248 4370 4408 4468 4454 4434 4735    0 5238  3:57
SF191.at.dp14    3528 4127 4411 4283 4379 4378 4380 4365 5238    0  2:35
The table has the nice feature that it confirms all "known" interactions (and/or preconceived notions). For instance, Rybka 1 to Rybka 2.3.2a have an incredible match, while Rybka 3 still shares a lot with them, and IvanHoe shares much with R3 also. The Fruit 2.1 overlap with earlier Rybkas also seems apparent (though not quite so pronounced), and this looks to disappear in R3. The 95% confidence interval for any given correlation should be about ±100, so that (for instance) the Glaurung 2 correlation with Fruit 2.1 at 4330 is distinctly less than the Rybka 1.0 Beta correlations with Fruit 2.1 and Rybka 3 of 4551 and 4552 respectively. Again I note that it is not all that clear that strength issues have been adequately addressed. Fruit 2.1 did see a complete (re)write of the eval function of Fruit 1.0, but still it seems that Fruit 1.0 might be too weak to correlate well.

All data and programmes are in the attached 7zip archive, which is in a semi-usable form (for instance, I #define things to be 8306 in the C code, to concord with the data size). The DEPTH needs to be given at compile time, while the engine name can be given as a command-line option. As noted, the correlation data I obtained should be entirely reproducible, though it would likely be more useful to run a similar experiment on a different set of positions (possibly pruned as above). I would usually run this via commands like:
Code: Select all
gcc -O3 -DDEPTH=\"11\" -o bestmove bestmove.c
time ./bestmove LINKS/Ryb232 < PRUNE.LIST > R232.at.dp11 &
[...]
./compare FR10.at.dp9 FR21.at.dp10 IH47c.at.dp15 R1.at.dp10 R12.at.dp11 \
          R232.at.dp11 R3.at.dp10 GL2.at.dp12 SF151.at.dp13 SF191.at.dp14
where LINKS is a (sub)directory with links to the engines in question.

BB+ · Post by **BB+** » Thu Dec 30, 2010 11:17 pm

The "RandomBits" of IvanHoe seem just to add some binomial noise to every evaluation. The tester is not at all fooled, giving a match (at depth 15) of about 70%. As ~75% is the "base" overlap (from doing "time" rather than "depth"), this implies that such "binomial noise" does little to really perturb the bestmove. The number was similar no matter what parameters I chose for RandomBits/RandomCount. I didn't do more than 1000 positions. This again shows how robust the PV tree is, I guess.

Then I turned to StaticWeighting, which is on a 1024-scale. I tested with 512, 768, 1280, 1536, 1792. Except for the extremes (512 vs 1792 in particular, and also 768-1792 and 512-1536 to a lesser extent), everything was above the 60% matching level. The comparison to 1024 was always at least around 65%. Again I stopped the test early, as the pattern (and magnitudes) were apparent.

There are four other useful UCI parameters in IvanHoe, MaterialWeighting, KingSafetyWeighting, MobilityWeighting, and PawnsWeighting, all again on this 1024 basis. I next tested MobilityWeighting, using the 512/768/1280/1536/1792 demarcations. Only the extremal 512-1792 comparison was below 60%. A similar story held true for PawnsWeighting, though the spread looked to be a little larger. Every comparison to the 1024 standard was again 65% or more. The same held true for KingSafetyWeighting.

If there would be one of these to actually have a great impact, it should seem that MaterialWeighting must be the most likely candidate. Indeed, except for the closest comparison (1024/1280 and others), the numbers are typically below 60%. The effect of this parameter on the strength remains to be seen.

Another test would be to change all five of these UCI parameters, say to 768 with MaterialWeighting at 896, and observe the effect. This seemed to be in the 65-70% range.

Sentinel · Post by **Sentinel** » Fri Dec 31, 2010 2:57 am

BB+ wrote:If there would be one of these to actually have a great impact, it should seem that MaterialWeighting must be the most likely candidate. Indeed, except for the closest comparison (1024/1280 and others), the numbers are typically below 60%. The effect of this parameter on the strength remains to be seen.

If you keep up the ratios between pieces close to original strength should not suffer much.
Another thing that will "trick" the test a lot is changing of PST. Unfortunately for changing that you have to do a bit of compiling

.

OpenChess

OpenChess

Some experimental "similarity" data

Some experimental "similarity" data

Re: Some experimental "similarity" data

Re: Some experimental "similarity" data

Re: Some experimental "similarity" data

Re: Some experimental "similarity" data

Re: Some experimental "similarity" data

Re: Some experimental "similarity" data

Re: Some experimental "similarity" data

Re: Some experimental "similarity" data

Re: Some experimental "similarity" data