Best Engine Testing Practices?

LucenaTheLucid · Post by **LucenaTheLucid** » Tue Dec 20, 2011 3:03 am

I was wondering what some of the other practices are of engine testers? This is the setup I use:

Ponder                      : OFF
CPU                         : 2.7 GHz Phenom2 (x6)
Operating system            : Windows 7-64 bit
Cores                       : All Engines with 1 core
Hash                        : 128 MB
Nalimov/Gaviota	          : 5 pieces
Cache for Tbs               : 8 MB
Time control                : 30/sec per game + .5/second increment
Openings                    : Default.bin
Learning                    : OFF

First I test using the newest release version as the baseline using a gauntlet of about 10 opponents, 5 weaker and 5 stronger. With the default version I play about 5000-6000 games. Then every subsequent version I use the same exact settings but only play 3000 or so games. This brings the error bars down quite low.

What time controls and settings do other testers use? Is there anything I should be weary of? Especially considering the time control?

Uly · Post by **Uly** » Tue Dec 20, 2011 8:41 am

LucenaTheLucid wrote: Is there anything I should be weary of? Especially considering the time control?

I would say that you should be aware of the randomness and noise you introduce when testing with a book (this is something that happens with any book, not just Default.bin), in that it introduces unfairness for the engines depending on the openings that are given to them from the book.

It doesn't matter how good are the positions out of book from the engines, what matters are that they're different, and the engines react differently depending on the opening. So, say, randomly one engine gets to play more of the Spanish, while another gets to play more the Sicilian, this causes unnecessary biases in the results, specially if the engines play those positions better.

Another problem is lack of variety, say, an engine is really good when coming out of book in an unbalanced position, but the book doesn't have any such position. This makes the engine have a lower elo than in real life conditions. But having unbalanced positions randomly just worsens the above problem.

A nice solution is to use a varied opening suite, one that includes all kinds of playable positions out of book, and all the engines get to play them from both sides against all the opponents. The only problem is getting a hold of such a suite, specially when you need 3000 unique positions for 6000 games (or 4500 unique positions for total 9000 games), but the concept is that when you say "I use the same exact settings", it's not accurate if the openings are not the same for everyone, or the openings are biased (too balanced out of book, too many Sicilians in there, etc.)

How many 1.g3 games do you have there? I think they should be at least half as much as 1.c4, if not, you have a book that isn't representing the major openings fairly enough, and it's probably focusing too much on specific chess slices that don't represent well real life scenarios.

Finally, the randomness you get rid of by using 1CPU engines is brought back by the book.

ernest · Post by **ernest** » Tue Dec 20, 2011 7:26 pm

Uly wrote:...representing the major openings fairly enough

Well, how do you ascertain that, in a practical way?
(or how do you check that some book you could use, say Sedat's Perfect2012, is "correct" in that matter)?

Uly · Post by **Uly** » Wed Dec 21, 2011 12:02 am

It's easy to do that in the extreme cases, for instance, if 90% of the games are Sicilians, then you have a problem.

In non-extreme cases, it gets more difficult, I'd start seeing how often are the opening moves played in the opening.

I don't know how good values look like, but I know how bad values do, for instance, if only 1% of positions are 1.c4, then you have a problem again.

If the values of the opening position don't look wrong, then you may want to look for representation of ECO codes, so, if for instance, your d4s and c4s end transposing to E10, you are under-covering c4.

You have to check for potential problems and fix them, if there were right values I'd go and make my own opening suite, and release it to the world, but in reality, individually, one out of book position is as good as another, it's when comparing them to the other ones that you may notice biases.

kingliveson · Post by **kingliveson** » Wed Dec 21, 2011 5:22 am

What Uly said on the second post...I would add that since the CPU is hexa-core, and you are doing single-core test, you should consider ponder on. Check out this opening suite: viewtopic.php?f=3&t=53&p=225

noctiferus · Post by **noctiferus** » Wed Dec 21, 2011 9:36 am

seems that it can't be downloaded, now...

Uly · Post by **Uly** » Wed Dec 21, 2011 12:36 pm

kingliveson wrote:I would add that since the CPU is hexa-core, and you are doing single-core test, you should consider ponder on.

No, that would be actually worse than giving 2 CPUs to each engine.

On 2CPU, a MP engine with Ponder OFF is expected to use at least 70% of the resources of the second core. With Ponder ON, an engine is expected to guess about 60% of the opponent's moves, so 40% of the CPU is wasted on Ponder Misses. Furthermore, in Ponder Misses, the engine has overwritten the useful info on the hash about the move that the opponent played, which is hurtful.

The only way Ponder makes sense is if you are testing on multiple computers.

The best way to use resources on a 6core, is running 12 1CPU engines simultaneously playing against each other. All CPU's will be at 100% load, and no time will be wasted on MP inefficiency or ponder misses.

LucenaTheLucid · Post by **LucenaTheLucid** » Wed Dec 21, 2011 1:24 pm

That is a pretty clever way to look at CPU usage. Also if any one knows of where I can find a large opening suite please let me know. I don't have much interest in making one. Thanks

Uly · Post by **Uly** » Wed Dec 21, 2011 10:59 pm

LucenaTheLucid wrote:I don't have much interest in making one.

But it's easy considering you already have thousands and thousands of games played.

Say, if you need 4500 starting positions, just pick 4500 games played by Default.bin, cut them to 8 ply (or something), remove duplicates, replace the missing games, and you have your opening suite.

It won't solve many of the problems I mentioned, but at least you'll get rid of randomness and noise (since future results will be based on those 4500 positions, and not unknown positions).

OpenChess

OpenChess

Best Engine Testing Practices?

Best Engine Testing Practices?

Re: Best Engine Testing Practices?

Re: Best Engine Testing Practices?

Re: Best Engine Testing Practices?

Re: Best Engine Testing Practices?

Re: Best Engine Testing Practices?

Re: Best Engine Testing Practices?

Re: Best Engine Testing Practices?

Re: Best Engine Testing Practices?