Page 2 of 3

Re: Stockfish settings

Posted: Mon Jun 21, 2010 3:06 am
by LucenaTheLucid
I understand however I don't think testing it vs. itself is optimal. I think vs. a wide range of opponents should do much better. Even then testing with ponder=off isn't optimal and can give some skewed results. IMHO the optimal testing conditions should be 2 different computers vs. a wide range of opponents.

Since this is not possible with most, such as myself I just have to make due. =O)

However do keep in mind in 5/0 time controls it did better than the default settings against Rybka. I think after this is done I will run a test vs. Rybka 4, Naum, and against the default settings.

The 5/0 results:

Stockfishtest13 2010

Deep Rybka 4 w32 - Stockfish 1.7.1 JA SPOON 26.0 - 24.0 +20/=12/-18 52.00%
Deep Rybka 4 w32 - Stockfish 1.7.1 JA Default 28.0 - 22.0 +23/=10/-17 56.00%

Stockfishtest14 2010

Deep Rybka 4 w32 - Stockfish 1.7.1 JA Default 70.5 - 49.5 +43/=55/-22 58.75%
Deep Rybka 4 w32 - Stockfish 1.7.1 JA SPOON 68.5 - 51.5 +49/=39/-32 57.08%

Re: Stockfish settings

Posted: Mon Jun 21, 2010 5:59 am
by mcostalba
Yes, I agree, the best test conditions are the ones most similar to actual real use of the engine.

But because a test should be also reliable (read you need many games) and because we normally don't have unlimited CPU and time resources we have to accept a compromise driven by experience and sensibility.

I agree self play is not always a perfect picture of reality, but has two advantages:

1) If a version is stronger then another one then "very probably" is also stronger against an engine pool, although it is impossible to say how much stronger is in the second case given the first case result.

2) Self play it is the most efficient in terms of number of games played. Playing the same individual number of games against an engines pool requires much more.

For instance, if you want to tests against Rybka a new SF setting then you need first to test the original version, then to repeat the test with the modified version. And this it means to double testing time against a simple self-play test.

Re: Stockfish settings

Posted: Mon Jun 21, 2010 9:31 pm
by Taner Altinsoy
Ok 1000 1 min games completed. Default setting won against spoon by 512/488 (% 51.2/48.8) which equates to 8 Elo. So simply there's no spoon :)

A question to developers. Do you think amateurs like us fiddling with settings have any real chance to come up with something really better than default?

regards,
Taner

Re: Stockfish settings

Posted: Mon Jun 21, 2010 10:31 pm
by Robert Houdart
Sure, as long as you're making serious tests under well controlled conditions, play enough games (at least 1000), and are aware of the statistical relevance of your results.

For example, your 51.2 % result after 1000 games is not very relevant: the standard deviation of a 1000 games match lies somewhere between 1% and 1.5% meaning that you could very easily obtain the 51.2 % with two engines of identical strength. More games are required to make a final judgement.

Robert

Re: Stockfish settings

Posted: Mon Jun 21, 2010 11:47 pm
by LucenaTheLucid
Thanks Robert,

How about this?

Stockfishtest16-1 2010


1 Stockfish 1.7.1 JA +220/=602/-177 52.15% 521.0/999
2 Stockfish 1.7.1 JA SPOON +177/=602/-220 47.85% 478.0/999

1 minute games of course...

Re: Stockfish settings

Posted: Tue Jun 22, 2010 12:23 am
by Robert Houdart
52.1% with 1000 games is a lot more significant, I think the Stockfish team will reject the proposed change ;).

Robert

Re: Stockfish settings

Posted: Tue Jun 22, 2010 12:25 am
by LucenaTheLucid
Yes indeed Robert, back to the ole' drawing board...=O(

Re: Stockfish settings

Posted: Tue Jun 22, 2010 5:23 am
by mcostalba
Robert Houdart wrote: More games are required to make a final judgement.
Final judgment does not exsist in chess engine testing. Sorry. ;-)

What does exist is a more or less reliable judgment. A result like the Taner's one does not give you the reliability that default is better then spoon at 99% of probability, but perhaps it gives you the reliability that default is better then spoon at 80% of probability. Is this enough ?

Difficult question.

With 'a posteriori' look, i.e. after knowing the result of Lucena, we could have said that if we had taken Taner's result for good we (probably) would have been lucky in that case because 80% of probability it turned out to be enough. :-)

Re: Stockfish settings

Posted: Tue Jun 22, 2010 5:36 am
by mcostalba
Taner Altinsoy wrote: Do you think amateurs like us fiddling with settings have any real chance to come up with something really better than default?
Theoretically it is possible but could become quickly frustrating because those parameters are tuned and the possibility to find something better with an almost random choice of values is very low.

Re: Stockfish settings

Posted: Tue Jun 22, 2010 9:51 am
by Taner Altinsoy
mcostalba wrote:
Taner Altinsoy wrote: Do you think amateurs like us fiddling with settings have any real chance to come up with something really better than default?
Theoretically it is possible but could become quickly frustrating because those parameters are tuned and the possibility to find something better with an almost random choice of values is very low.
Thank you, that is fair and clear enough :). I will still keep searching tho.

Taner