Creating a new (and independent) rating list

Uly · Post by **Uly** » Thu Jun 24, 2010 6:54 am

BB+ wrote:For instance, should a uniform platform be adopted, or is the "benchmark and adjust" procedure sufficient? What aspects of the engine are you trying to measure (for instance, is time management important)? What interference is allowed from the GUI (for instance, is N straight moves with both at 0.00 a draw, even if no repetition has been made)? Is the focus for top engines (top 10 or 20), or for a wide variety (200+) of amateur engines?

When Banks was asked such questions he answered that all these testers play such games for fun. THAT is the sole reason they do it at all. Any restriction would make it more difficult to have fun, so the way they do it maximizes it.

BTO7 · Post by **BTO7** » Thu Jun 24, 2010 8:49 am

BB+ wrote:I'm not sure I like the methodology of any of the current groups, but then I have a very strong preference for science. For instance, some of the rating groups let the operator choose/create the book. I have no idea how much mischief this could entail, though I see anecdotes around that Engine X does (relatively) better than Engine Y with Book Z. If we want to make it a scientific venture, more discussion is needed. For instance, should a uniform platform be adopted, or is the "benchmark and adjust" procedure sufficient? What aspects of the engine are you trying to measure (for instance, is time management important)? What interference is allowed from the GUI (for instance, is N straight moves with both at 0.00 a draw, even if no repetition has been made)? Is the focus for top engines (top 10 or 20), or for a wide variety (200+) of amateur engines? If you want the latter, then you will likely have to sacrifice "science" to some degree, as to cover such a broad spectrum you will need many different testers involved.

Great point BB. I was thinking about this myself. In my view I would like to see the engines tested.....computer to computer so ponder can be on and no chance of one robbing from the other. Both equal of course. No books....no tbs with a decent medium time control. Also in addition to this i think a good way to have a nice parallel would be to have two lists run the same way one in windows and other in linux lets say. Being a different operating systems all together would equal certain chips such as intel vs amd and other factors much better. My few cents anyways

In addition all then run on strategic suit and some how have that weighted into a overall elo score. Their speed and such would be valued too.

Regards
BT

Rebel · Post by **Rebel** » Thu Jun 24, 2010 9:20 am

BB+ wrote:I'm not sure I like the methodology of any of the current groups, but then I have a very strong preference for science. For instance, some of the rating groups let the operator choose/create the book. I have no idea how much mischief this could entail, though I see anecdotes around that Engine X does (relatively) better than Engine Y with Book Z. If we want to make it a scientific venture, more discussion is needed. For instance, should a uniform platform be adopted, or is the "benchmark and adjust" procedure sufficient? What aspects of the engine are you trying to measure (for instance, is time management important)? What interference is allowed from the GUI (for instance, is N straight moves with both at 0.00 a draw, even if no repetition has been made)? Is the focus for top engines (top 10 or 20), or for a wide variety (200+) of amateur engines? If you want the latter, then you will likely have to sacrifice "science" to some degree, as to cover such a broad spectrum you will need many different testers involved.

Of course a reliable rating list has to be done in SSDF style. Equal hardware, equal hashsize, own books, strongest settings provided by the programmer. I am a bit out-dated but reading the follow-up's of this thread I get the impression CCRL (Graham Bank isn't it?) do not follow these requirements. What about CEGT?

Ed

Uly · Post by **Uly** » Thu Jun 24, 2010 8:05 pm

Rebel wrote:What about CEGT?

They also test with generic books.

BB+ · Post by **BB+** » Fri Jun 25, 2010 1:50 am

CCRL has the following:

Code: Select all

Time Control: Equivalent to 40 moves in N minutes on AMD X2 4600+ at 2.4GHz. We use Crafty 19.17 BH as a benchmark to determine the equivalent time control for particular machine.

We use repeating time control. It means that, say, in 40/40 the engines have 40 minutes for the first 40 moves. Then they get another 40 minutes for the next 40 moves, and so on.

Endgame tablebases: 4 or 5 piece tablebases.

Pondering: OFF.

Tournament format: Any format of tester's choice: Match, Round-robin, Gauntlet, Swiss, etc.

Hash size: Should be set to the same value of either 128 or 256 MB for all engines in a match or tourney. There are two exeptions: 1) Engines using 2 CPUs should have double hash size, compared to single-CPU engines in the same tourney. 4-CPU engines should have 4 times amount of hash. 2) Smaller hash size can be used if an engine has problems with particular hash size, or if it does not allow to configure hash size.

EGTB hash: 32 MB.

Tournament Interface: Any. Examples: Winboard, Arena, Shredder, Chessbase, Chess Partner.

Opening book: Any generic. Examples: remis.ctg, draw.ctg, 5moves.ctg, perfect.ctg etc. Book line length has to be limited to 12 moves per side maximum. The same book should be used for all engines in the same match or tournament.

Engines with their own books should have them disabled (deleted or switched off in parameters). Engines which can't disable their own book can't participate in CCRL testing.

Book learning: Off for all engines.

Position learning: Off for all engines.

CEGT has

Code: Select all

CEGT Games are tournament time control games 120/40, medium time control games 40/20 and blitz time control games 40/4. The meaning of 40/120, 40/20 and 40/4 is 40 moves in 120, 20 and 4 minutes and another 120, 40, 4 minutes for moves 40 to 80 and so on. Given the different hardware from testers we agreed to adapt to AMD64X2 4200+ for 40/120 and 40/20 and 2 GHz Pentium CPU for 40/4. Hash given is usually 256 MB for each engine. Very few testers who have less RAM available are allowed to give 128 MB. Deep versions: Deep Shredder 9. Deep Fritz 8, Deep Junior 9 and others are tested on dual machines using 2 CPU´s and 512 MB hash. There is an exception for Junior 9.003 using only 256 MB, because there seem to occur bugs when giving 512 MB to this one.

Books:
In the first months of CEGT all Nunn Suite 1 and 2 positons were used and also many from Noomen Select. Currently we use mainly books like 8move.ctg. remis.ctg, Perfect books, Powerbooks, Master Elect and Arena books mainly by Harry Schnapp. We have started to use, to a greater extent, a testsuite with 220 positions by Harry Schnapp

Tablebases:
Most Testers use 5 men EGTB. Some use only 4 men. Testers using 5 men give 32 MB EGTB hash. Testers using 4 men give 16 MB EGTB hash.

GUI´s:
All testers use one or more different GUI´s. Most used are Shredder 9 GUI, Arena and Shredder Classic GUI. Chess Partner GUI and Winboard can also be used. Not used are buggy GUI´s like Fritz 9, Fritz 8 with server update, known buggy UCI.dll´s.

Adjudications:
Testers and GUI´s are allowed to adjudicate totally won or drawn games

Benching: [...]

kingliveson · Post by **kingliveson** » Fri Jun 25, 2010 2:41 am

I have an Athlon 64 X2 4200+ (old, I know) and can dedicate it full time for testing to the Open Chess Rating List. To touch up on some points already brought up. My preference is PGN test suites, and no books. Though the end result will not change -- meaning if you played 1000 games with 2 engines using a book, the top engine will still come out ahead, but that margin is distorted because of bias introduced by book. Now are we comfortable with such bias? If so, fine by me. PGN suites can also create its own problem -- a one dimensional games database if not updated regularly or at least have large database from which to select these test positions.

As for time control, is it time to define new standards; 40/10', 40/30', 40/60' repeating? Or, 10+0, 30+0, 60+0, non-repeating?
And Tablebases, one could argue that they create interference. An engine may not be able to solve certain endgames without external help.

Just me 2 cents...

Peter C · Post by **Peter C** » Fri Jun 25, 2010 2:54 am

I agree with Franklin on pretty much all his points, though I think 40/4, 40/20, and 40/40 are fine time controls. Definitely test suites over opening books. I have a Q8300 @ 2.50gHz that I can use for an OpenChess Rating List sometimes.

Peter

IWB · Post by **IWB** » Fri Jun 25, 2010 6:22 am

Hello

BB+ wrote:I'm not sure I like the methodology of any of the current groups, but then I have a very strong preference for science. For instance, some of the rating groups let the operator choose/create the book. I have no idea how much mischief this could entail, though I see anecdotes around that Engine X does (relatively) better than Engine Y with Book Z. If we want to make it a scientific venture, more discussion is needed. For instance, should a uniform platform be adopted, or is the "benchmark and adjust" procedure sufficient? What aspects of the engine are you trying to measure (for instance, is time management important)? What interference is allowed from the GUI (for instance, is N straight moves with both at 0.00 a draw, even if no repetition has been made)? Is the focus for top engines (top 10 or 20), or for a wide variety (200+) of amateur engines? If you want the latter, then you will likely have to sacrifice "science" to some degree, as to cover such a broad spectrum you will need many different testers involved.

I fully agree here and that is one of the resons why I decided to run the IPON years ago and finale went public last year. The IPON is the most "scientific" list with clear and repeatable conditions you will find at the moment.
There is one exception, which is the "N (3) straight moves with 0.00 are a judged as draw" rule. In a Chess-way this might not nessessarily a draw but at least I have it right from the beginning for ALL engines (and it is les common in unclear positions than one might think as most engines evaluate unclear material as 0.01)

Bye
Ingo

Rebel · Post by **Rebel** » Fri Jun 25, 2010 9:14 am

BB+ wrote:CCRL has the following:
CEGT has

From the stipulations I understand the intent of both CCRL and CEGT is to measure the raw engine strength. While that choice has its merits injustice is done to other efforts of the programmer to add extra elo-points to his brainchild. Opening-books, book-learning, position-learning are essential parts of a chess program, they are able to fix holes, adapt, even avoid previous made mistakes.

IMO programs should be tested as a whole as the programmer intended and not be handicapped.

This has always been the policy of the SSDF.

Ed

aiorla · Post by **aiorla** » Fri Jun 25, 2010 10:06 am

I will dedicate some time of my i5-750 to the list if it is created!
About the time control, I think that repeating control or an incrementing sound better than 10+0, 15+0...
About tablebases I think they have to be included, 3-4-5 man for Nalinov one's (7GB is affordable), because some engines are designed to have them and they have some lacks of knowledge in simples endgames. And Robbobases will have to be included too if we want a fair rating list!

OpenChess

OpenChess

Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list