Creating a new (and independent) rating list

Matthias Gemuh · Post by **Matthias Gemuh** » Fri Jun 25, 2010 10:22 am

aiorla wrote:I will dedicate some time of my i5-750 to the list if it is created!
About the time control, I think that repeating control or an incrementing sound better than 10+0, 15+0...

Yes, 10+0, 15+0... leads to meaningless results and poor pgn files, if the games are long enough for time trouble to kick in.

Matthias.

Adam Hair · Post by **Adam Hair** » Fri Jun 25, 2010 5:20 pm

Rebel wrote:
BB+ wrote:CCRL has the following:
CEGT has
From the stipulations I understand the intent of both CCRL and CEGT is to measure the raw engine strength. While that choice has its merits injustice is done to other efforts of the programmer to add extra elo-points to his brainchild. Opening-books, book-learning, position-learning are essential parts of a chess program, they are able to fix holes, adapt, even avoid previous made mistakes.

IMO programs should be tested as a whole as the programmer intended and not be handicapped.

This has always been the policy of the SSDF.

Ed

I can understand that you, as a programmer, want all features that you have built into your chess program to be used in testing.
However, that presents a problem for a rating list that is trying to test many engines. Comparing two engines by way of their
head-to-head match does not really give an accurate idea of their relative strengths. So, a comparison of their results against
other engines is also needed. If some of those other engines have the ability to learn, then accuracy in the comparison suffers.
Let's say that I play Yace against a gauntlet of engines, including ProDeo with book learning and position learning turned on.
Then, some time later, I play Trace against the same gauntlet. The comparison between Yace and Trace suffers to some degree
because the ProDeo that Trace played against is not the same ProDeo that Yace played against. ProDeo has evolved during the
time between the two gauntlets ( this is assuming ProDeo has played more games during that time interval ).

A different method of testing is needed to show how an engine such as ProDeo improves over time.

Adam Hair · Post by **Adam Hair** » Fri Jun 25, 2010 5:22 pm

Matthias Gemuh wrote:
aiorla wrote:I will dedicate some time of my i5-750 to the list if it is created!
About the time control, I think that repeating control or an incrementing sound better than 10+0, 15+0...
Yes, 10+0, 15+0... leads to meaningless results and poor pgn files, if the games are long enough for time trouble to kick in.

Matthias.

I think an incremental time control leads to better games. But it is easier to synchronize different computers using a repeating
time control.

aiorla · Post by **aiorla** » Fri Jun 25, 2010 5:57 pm

Adam Hair wrote:
Matthias Gemuh wrote:
aiorla wrote:I will dedicate some time of my i5-750 to the list if it is created!
About the time control, I think that repeating control or an incrementing sound better than 10+0, 15+0...
Yes, 10+0, 15+0... leads to meaningless results and poor pgn files, if the games are long enough for time trouble to kick in.

Matthias.
I think an incremental time control leads to better games. But it is easier to synchronize different computers using a repeating
time control.

Yes, repeating time control seems the better way to do it easy and fine.
And now what?

kingliveson · Post by **kingliveson** » Fri Jun 25, 2010 7:03 pm

aiorla wrote:
Adam Hair wrote:
Matthias Gemuh wrote:
aiorla wrote:I will dedicate some time of my i5-750 to the list if it is created!
About the time control, I think that repeating control or an incrementing sound better than 10+0, 15+0...
Yes, 10+0, 15+0... leads to meaningless results and poor pgn files, if the games are long enough for time trouble to kick in.

Matthias.
I think an incremental time control leads to better games. But it is easier to synchronize different computers using a repeating
time control.
Yes, repeating time control seems the better way to do it easy and fine.
And now what?

I guess well all agree on 40/4, 40/20, and 40/40 repeating time control. What is the consensus on EGTB?

Andrew · Post by **Andrew** » Fri Jun 25, 2010 7:20 pm

BB+ wrote:I'm not sure I like the methodology of any of the current groups, but then I have a very strong preference for science. For instance, some of the rating groups let the operator choose/create the book. I have no idea how much mischief this could entail, though I see anecdotes around that Engine X does (relatively) better than Engine Y with Book Z. If we want to make it a scientific venture, more discussion is needed. For instance, should a uniform platform be adopted, or is the "benchmark and adjust" procedure sufficient? What aspects of the engine are you trying to measure (for instance, is time management important)? What interference is allowed from the GUI (for instance, is N straight moves with both at 0.00 a draw, even if no repetition has been made)? Is the focus for top engines (top 10 or 20), or for a wide variety (200+) of amateur engines? If you want the latter, then you will likely have to sacrifice "science" to some degree, as to cover such a broad spectrum you will need many different testers involved.

I think it silly to create yet another rating list, unless you spend the time to eliminate confounding variables, and focus on truth. To me this means:

No opening book, but rather X starting positions.
Identical hardware for all tests, but different computers for each engine.
Strongest settings provided by the programmer, or another expert (in the case of IPPO*)
Identical Hash size
3-4-5-6 Man Tablebases
Ponder On
Classical Time Controls
As many moves as possible until engines agree to draw, 50 move rule, or 3 reps.
Starting sample of at least 30 engines
Identical number of games as w/b verse every other engine

I'm not sure the full details of SSDF, but it seems the most promising to me.

kingliveson · Post by **kingliveson** » Fri Jun 25, 2010 7:44 pm

Andrew wrote: I think it silly to create yet another rating list, unless you spend the time to eliminate confounding variables, and focus on truth. To me this means:
No opening book, but rather X starting positions.

Identical hardware for all tests, but different computers for each engine.

Strongest settings provided by the programmer, or another expert (in the case of IPPO*)

Identical Hash size

3-4-5-6 Man Tablebases

Ponder On

Classical Time Controls

As many moves as possible until engines agree to draw, 50 move rule, or 3 reps.

Starting sample of at least 30 engines

Identical number of games as w/b verse every other engine
I'm not sure the full details of SSDF, but it seems the most promising to me.

Identical hardware for all tests, but different computers for each engine.

Can you elaborate -- are you suggesting a volunteer in Japan has to use the same hardware as a volunteer in Burkina-Faso? Or, you are saying if a volunteer runs a 40/4 test on one hardware, s/he must also run 40/20, and 40/40 on that same hardware?

Strongest settings provided by the programmer, or another expert (in the case of IPPO*)

There could be an issue based on current release scheduling.

Ponder On

Ponder On for me is only useful for collecting high quality games. I dont see it affecting end result. Am sure most people would prefer Ponder On, but the hardware is just not available at this stage.

aiorla · Post by **aiorla** » Fri Jun 25, 2010 9:56 pm

kingliveson wrote:
Andrew wrote: I think it silly to create yet another rating list, unless you spend the time to eliminate confounding variables, and focus on truth. To me this means:
No opening book, but rather X starting positions.

Identical hardware for all tests, but different computers for each engine.

Strongest settings provided by the programmer, or another expert (in the case of IPPO*)

Identical Hash size

3-4-5-6 Man Tablebases

Ponder On

Classical Time Controls

As many moves as possible until engines agree to draw, 50 move rule, or 3 reps.

Starting sample of at least 30 engines

Identical number of games as w/b verse every other engine
I'm not sure the full details of SSDF, but it seems the most promising to me.

Identical hardware for all tests, but different computers for each engine.

Can you elaborate -- are you suggesting a volunteer in Japan has to use the same hardware as a volunteer in Burkina-Faso? Or, you are saying if a volunteer runs a 40/4 test on one hardware, s/he must also run 40/20, and 40/40 on that same hardware?

Strongest settings provided by the programmer, or another expert (in the case of IPPO*)

There could be an issue based on current release scheduling.

Ponder On

Ponder On for me is only useful for collecting high quality games. I dont see it affecting end result. Am sure most people would prefer Ponder On, but the hardware is just not available at this stage.

I will add too that 3-4-5-6 are really big and not affordable for all the possible testers, who will have to have Robbobases too to make it more fair!
And one problem could be the SSE4.2 implementation of some engines and large pages, these topics have to be talked too.

LetoAtreides82 · Post by **LetoAtreides82** » Sat Jun 26, 2010 1:20 am

Andrew wrote: [*]Ponder On
[/list]

Ponder on doesn't seem to be producing drastically different results from ponder off. Compare rating differences from ponder-off lists like CEGT or CCRL to ponder-on lists like IPON (http://www.inwoba.de/index.html ) ,

IWB · Post by **IWB** » Sat Jun 26, 2010 9:09 am

Hi

LetoAtreides82 wrote:
Andrew wrote: [*]Ponder On
[/list]
Ponder on doesn't seem to be producing drastically different results from ponder off. Compare rating differences from ponder-off lists like CEGT or CCRL to ponder-on lists like IPON (http://www.inwoba.de/index.html ) ,

I want to disagree here. There are engines which handle their time conrol different with ponder on - as they expect to have more time left. These enignes differ in CEGT and IPON. Of course the difference is not hundreds of Elo but it might be visible in ranking. My usuall example is Shredder 12 and Naum 4. While all ponder off lists have Naum 4 in front of Shredder 12 inponder ON it is vice versa. (Naum 4.2 of cousre is good enough to pass then) Nevertheless, there are differences.

Besides that there is another logical point for ponder ON. There is not a single tourney on the world where enignes (or humans

) are playing ponder off. Some people debate about 3,4,5 pc Tbs, books, learning ... as the programmer invested time in that to improve the play of his engines. The same goes for ponder ON and that is something which makes all enignes equal as they all support that.

Bye
Ingo

OpenChess

OpenChess

Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list