Stockfish settings
-
- Posts: 160
- Joined: Thu Jun 10, 2010 2:14 am
- Real Name: Luis Smith
Re: Stockfish settings
I understand however I don't think testing it vs. itself is optimal. I think vs. a wide range of opponents should do much better. Even then testing with ponder=off isn't optimal and can give some skewed results. IMHO the optimal testing conditions should be 2 different computers vs. a wide range of opponents.
Since this is not possible with most, such as myself I just have to make due. =O)
However do keep in mind in 5/0 time controls it did better than the default settings against Rybka. I think after this is done I will run a test vs. Rybka 4, Naum, and against the default settings.
The 5/0 results:
Stockfishtest13 2010
Deep Rybka 4 w32 - Stockfish 1.7.1 JA SPOON 26.0 - 24.0 +20/=12/-18 52.00%
Deep Rybka 4 w32 - Stockfish 1.7.1 JA Default 28.0 - 22.0 +23/=10/-17 56.00%
Stockfishtest14 2010
Deep Rybka 4 w32 - Stockfish 1.7.1 JA Default 70.5 - 49.5 +43/=55/-22 58.75%
Deep Rybka 4 w32 - Stockfish 1.7.1 JA SPOON 68.5 - 51.5 +49/=39/-32 57.08%
Since this is not possible with most, such as myself I just have to make due. =O)
However do keep in mind in 5/0 time controls it did better than the default settings against Rybka. I think after this is done I will run a test vs. Rybka 4, Naum, and against the default settings.
The 5/0 results:
Stockfishtest13 2010
Deep Rybka 4 w32 - Stockfish 1.7.1 JA SPOON 26.0 - 24.0 +20/=12/-18 52.00%
Deep Rybka 4 w32 - Stockfish 1.7.1 JA Default 28.0 - 22.0 +23/=10/-17 56.00%
Stockfishtest14 2010
Deep Rybka 4 w32 - Stockfish 1.7.1 JA Default 70.5 - 49.5 +43/=55/-22 58.75%
Deep Rybka 4 w32 - Stockfish 1.7.1 JA SPOON 68.5 - 51.5 +49/=39/-32 57.08%
Re: Stockfish settings
Yes, I agree, the best test conditions are the ones most similar to actual real use of the engine.
But because a test should be also reliable (read you need many games) and because we normally don't have unlimited CPU and time resources we have to accept a compromise driven by experience and sensibility.
I agree self play is not always a perfect picture of reality, but has two advantages:
1) If a version is stronger then another one then "very probably" is also stronger against an engine pool, although it is impossible to say how much stronger is in the second case given the first case result.
2) Self play it is the most efficient in terms of number of games played. Playing the same individual number of games against an engines pool requires much more.
For instance, if you want to tests against Rybka a new SF setting then you need first to test the original version, then to repeat the test with the modified version. And this it means to double testing time against a simple self-play test.
But because a test should be also reliable (read you need many games) and because we normally don't have unlimited CPU and time resources we have to accept a compromise driven by experience and sensibility.
I agree self play is not always a perfect picture of reality, but has two advantages:
1) If a version is stronger then another one then "very probably" is also stronger against an engine pool, although it is impossible to say how much stronger is in the second case given the first case result.
2) Self play it is the most efficient in terms of number of games played. Playing the same individual number of games against an engines pool requires much more.
For instance, if you want to tests against Rybka a new SF setting then you need first to test the original version, then to repeat the test with the modified version. And this it means to double testing time against a simple self-play test.
-
- Posts: 47
- Joined: Thu Jun 10, 2010 9:43 am
- Real Name: Taner Altinsoy
Re: Stockfish settings
Ok 1000 1 min games completed. Default setting won against spoon by 512/488 (% 51.2/48.8) which equates to 8 Elo. So simply there's no spoon
A question to developers. Do you think amateurs like us fiddling with settings have any real chance to come up with something really better than default?
regards,
Taner
A question to developers. Do you think amateurs like us fiddling with settings have any real chance to come up with something really better than default?
regards,
Taner
- Robert Houdart
- Posts: 180
- Joined: Thu Jun 10, 2010 4:55 pm
- Contact:
Re: Stockfish settings
Sure, as long as you're making serious tests under well controlled conditions, play enough games (at least 1000), and are aware of the statistical relevance of your results.
For example, your 51.2 % result after 1000 games is not very relevant: the standard deviation of a 1000 games match lies somewhere between 1% and 1.5% meaning that you could very easily obtain the 51.2 % with two engines of identical strength. More games are required to make a final judgement.
Robert
For example, your 51.2 % result after 1000 games is not very relevant: the standard deviation of a 1000 games match lies somewhere between 1% and 1.5% meaning that you could very easily obtain the 51.2 % with two engines of identical strength. More games are required to make a final judgement.
Robert
-
- Posts: 160
- Joined: Thu Jun 10, 2010 2:14 am
- Real Name: Luis Smith
Re: Stockfish settings
Thanks Robert,
How about this?
Stockfishtest16-1 2010
1 Stockfish 1.7.1 JA +220/=602/-177 52.15% 521.0/999
2 Stockfish 1.7.1 JA SPOON +177/=602/-220 47.85% 478.0/999
1 minute games of course...
How about this?
Stockfishtest16-1 2010
1 Stockfish 1.7.1 JA +220/=602/-177 52.15% 521.0/999
2 Stockfish 1.7.1 JA SPOON +177/=602/-220 47.85% 478.0/999
1 minute games of course...
- Robert Houdart
- Posts: 180
- Joined: Thu Jun 10, 2010 4:55 pm
- Contact:
Re: Stockfish settings
52.1% with 1000 games is a lot more significant, I think the Stockfish team will reject the proposed change .
Robert
Robert
-
- Posts: 160
- Joined: Thu Jun 10, 2010 2:14 am
- Real Name: Luis Smith
Re: Stockfish settings
Yes indeed Robert, back to the ole' drawing board...=O(
Re: Stockfish settings
Final judgment does not exsist in chess engine testing. Sorry.Robert Houdart wrote: More games are required to make a final judgement.
What does exist is a more or less reliable judgment. A result like the Taner's one does not give you the reliability that default is better then spoon at 99% of probability, but perhaps it gives you the reliability that default is better then spoon at 80% of probability. Is this enough ?
Difficult question.
With 'a posteriori' look, i.e. after knowing the result of Lucena, we could have said that if we had taken Taner's result for good we (probably) would have been lucky in that case because 80% of probability it turned out to be enough.
Re: Stockfish settings
Theoretically it is possible but could become quickly frustrating because those parameters are tuned and the possibility to find something better with an almost random choice of values is very low.Taner Altinsoy wrote: Do you think amateurs like us fiddling with settings have any real chance to come up with something really better than default?
-
- Posts: 47
- Joined: Thu Jun 10, 2010 9:43 am
- Real Name: Taner Altinsoy
Re: Stockfish settings
Thank you, that is fair and clear enough . I will still keep searching tho.mcostalba wrote:Theoretically it is possible but could become quickly frustrating because those parameters are tuned and the possibility to find something better with an almost random choice of values is very low.Taner Altinsoy wrote: Do you think amateurs like us fiddling with settings have any real chance to come up with something really better than default?
Taner