Even Yet Another Fast Testing Scheme

gaard · Post by **gaard** » Fri Aug 20, 2010 2:51 pm

For this scheme two or more engines are tasked to solve thousands of positions. In this example, I'll be using 6400 random middle game positions from high level games and Rybka 4. A will denote Rybka 4 using 900ms to solve for the 6400 positions and B will be Rybka 4 using 1100ms. By my estimations this give an Elo difference of ~20 Elo. Ordinarily, thousands of games would have to be played over a long period of time, hours at least, to determine superiority of one over the other. After solving for the 6400 positions, all positions where A and B agree are discarded. Therefore, positions where both fail, and positions where both solved are thrown out. In this case I am left with 126 positions. Of these, A solved 44 and B solved 82, equating to a 3.54-sigma difference or a 99.98% LOS.

In cases where the difference in strength is greater, fewer positions are needed. For instance, out of these same 6400 positions and compared with Stockfish 1.8 using 900ms per position, Rybka 4 900ms triumphs over Stockfish with a 4.47-sigma superiority translating to a 99.9997% LOS. Using a cutoff where the search is stopped if the correct answer is held over x ply, all these tests were conducted in parallel using 45 minutes of CPU time. I have yet to find one instance where this method has given an answer contrary to my 4" testing scheme over thousands of games.

Jarkko · Post by **Jarkko** » Fri Aug 20, 2010 4:48 pm

Hi,

Are these position and solutions available ?
I would like to try using these positions to tune a program.

- Jarkko

gaard · Post by **gaard** » Fri Aug 20, 2010 10:47 pm

Jarkko wrote:Hi,

Are these position and solutions available ?
I would like to try using these positions to tune a program.

- Jarkko

I have no problem describing how I chose these positions but for now these will remain private. Here is the criteria:

1) Taken from high level correspondence games that end in a draw.

2) Positions from move 30-40 have the greatest bearing on overall strength. Opening positions and endgame positions can safely be discarded. I have another suite that I use for testing endgame prowess. Ideally, the positions should include both end and middle game positions; this is a work in progress.

3) Extract the FEN's from the PGN's with pgn2fen. Change "pm" to "bm" in the epd output file. From the epd, remove all positions that do not meet the criteria outlined in 2).

4) Run the suite with your engine of choice using your chosen interface (only few are useful for this purpose), that is, Arena, polyglot, etc.

BB+ · Post by **BB+** » Sat Aug 21, 2010 1:20 am

Is this an accurate description?

You took 12800s and achieved a 3.85 sigma result, between engines that are about 15-20 Elo apart (that is, log(11/9)/log(2)*N, where N is 50 or 70).

Comparatively, if I played 1-second chess (with some minor increment, say 20 or 25ms), I would get 1 game about every 4 seconds, and thus achieve 3200 games (I am thinking of single cpu, as SMP would never be efficient at this speed -- I'm not sure SMP would be that great even at 1s per move), which would have a sigma of about 4 Elo, or again about 4-5 sigma.

So I conclude your method is approximately as statistically significant as 1s/game chess. One advantage of your method is that longer searches (1s per move) are done. Another is that 1s/game is not exactly feasible for some engines, due to overhead and/or time management buffers. Do your results change much if you increase the time per move (say 5s?). One disadvantage is that you won't uncover various bugs/problems that only occur when you look at actual games (such as dubious endgame knowledge).

gaard · Post by **gaard** » Sat Aug 21, 2010 2:47 am

BB+ wrote:Is this an accurate description?

You took 12800s and achieved a 3.85 sigma result, between engines that are about 15-20 Elo apart (that is, log(11/9)/log(2)*N, where N is 50 or 70).

Comparatively, if I played 1-second chess (with some minor increment, say 20 or 25ms), I would get 1 game about every 4 seconds, and thus achieve 3200 games (I am thinking of single cpu, as SMP would never be efficient at this speed -- I'm not sure SMP would be that great even at 1s per move), which would have a sigma of about 4 Elo, or again about 4-5 sigma.

So I conclude your method is approximately as statistically significant as 1s/game chess. One advantage of your method is that longer searches (1s per move) are done. Another is that 1s/game is not exactly feasible for some engines, due to overhead and/or time management buffers. Do your results change much if you increase the time per move (say 5s?). One disadvantage is that you won't uncover various bugs/problems that only occur when you look at actual games (such as dubious endgame knowledge).

I used polyglot and for Rybka "-depth-delta 6" so the average time per position is actually closer to ~600 ms. I was also able to run Stockfish, and both Rybka's at the same time on my quad core. I used N=70 which, if anything, is probably too large an estimate.

There are those drawbacks where solving for positions can fail to find important bugs. However, my main selling point is that there is no better way of determining objective playing strength. Gauntlets and RR's, or even self play creates a lot of room for inbreeding. I could play 10^5 games against some modified version of my engine and have a 10 Elo increase with 3-sigma (to make up some numbers) and then find I have a regression after testing against a battery of unrelated engines.

I've tested at 8" and the results look mostly consistent with those at 1" or even 400ms, which is my base line for testing changes that are, or should be, unrelated to scaling or are especially time sensitive. I have some more tests lined up with a time of 2" that I will post later, time allowing.

gaard · Post by **gaard** » Sat Aug 21, 2010 3:05 am

Bear in mind too that the methodology for selecting these positions was not incredibly elaborate. It's not inconceivable to me that with some research, I could find some number of positions that could determine superiority with, say, half the amount of positions I used here for two engines 20 Elo apart.

gaard · Post by **gaard** » Sat Aug 21, 2010 7:24 am

Testing at 2" yields even better results for Rybka 4 with 5.38-sigma over the same 6400 positions compared with Stockfish. According to my rating list these engine are separated by ~65 Elo for a sigma of 12 Elo if my math is correct, in only 3.5 hours.

gaard · Post by **gaard** » Sun Aug 22, 2010 8:22 pm

Again, the results I've obtained from this are entirely consistent with my 4" CTPM rating list. To remark on my more interesting findings:

Houdini and Rybka are incredibly close as far as middle game analysis. Eerily close. Houdini has a slight edge but it is not significant. Where Houdini's edge over Rybka is significant is in endgame analysis and is even marginally better than IvanHoe's.

FireBird 1.1 DD looks slightly worse than the default and in no way threatens Rybka or IvanHoe. More tests for IvanHoe are scheduled with triplebases.

Stefan · Post by **Stefan** » Mon Aug 23, 2010 11:31 am

gaard wrote:Again, the results I've obtained from this are entirely consistent with my 4" CTPM rating list.

What do you mean with 'consistent' ?

gaard · Post by **gaard** » Mon Aug 23, 2010 3:51 pm

Stefan wrote:
gaard wrote:Again, the results I've obtained from this are entirely consistent with my 4" CTPM rating list.
What do you mean with 'consistent' ?

WRT LOS. They agree as far as rank and likelihood of superiority are concerned. I've never had a result in this testing scheme that said A is superior to B and the opposite be true in my 4" testing scheme.

OpenChess

OpenChess

Even Yet Another Fast Testing Scheme

Even Yet Another Fast Testing Scheme

Re: Even Yet Another Fast Testing Scheme

Re: Even Yet Another Fast Testing Scheme

Re: Even Yet Another Fast Testing Scheme

Re: Even Yet Another Fast Testing Scheme

Re: Even Yet Another Fast Testing Scheme

Re: Even Yet Another Fast Testing Scheme

Re: Even Yet Another Fast Testing Scheme

Re: Even Yet Another Fast Testing Scheme

Re: Even Yet Another Fast Testing Scheme