Creating a new (and independent) rating list

LucenaTheLucid · Post by **LucenaTheLucid** » Sun Jul 04, 2010 12:14 am

I have an old Athlon XP 2200 which will go towards this.

marcmp · Post by **marcmp** » Tue Jul 20, 2010 7:07 am

I'm not strong enough in statistics to answer my question but there it is: it appears to me that when we are using "test suite" (the engines play the same position twice, with side reversed), we should used a multinomial distribution (i.e. scores will be either 0/2, 0,5/2, 1/2, 1,5/2 or 2/2 per position) to calculate the "performance". From my comprehension neither Bayeselo nor EloStats do that. Instead, from my understanding both methods consider each game as a new position. Would that influence the statistical results by much?

BB+ · Post by **BB+** » Tue Jul 20, 2010 7:10 am

I'm not strong enough in statistics to answer my question but there it is: it appears to me that when we are using "test suite" (the engines play the same position twice, with side reversed), we should used a multinomial distribution (i.e. scores will be either 0/2, 0,5/2, 1/2, 1,5/2 or 2/2 per position) to calculate the "performance". From my comprehension neither Bayeselo nor EloStats do that. Instead, from my understanding both methods consider each game as a new position. Would that influence the statistical results by much?

I thought about exactly this question awhile back, and through sufficient hand-waving convinced myself that it increased the size of the error bars, but not too much. I can try to replicate the thinking if you want. The idea is that instead of taking variance from a 0, 0.5, 1 set on every data point, we now do so from a 0, 0.5, 1, 1.5, 2 set on pairs of data points.

marcmp · Post by **marcmp** » Tue Jul 20, 2010 7:33 am

Thank you for your quick answer BB+,

I'm not a programmer, but I'm my "testing" (for fun of course), I used to use 50 positions test suites. I noticed two things:

1- The multinomial thing, you just mentioned you don't think is much important (thank you for your quick answer).

2- The test suites. It is not uncommon (from my experience) to see "Engine A" scoring 59-41 against "Engine B", but then "Engine A" scoring 48-52 against "Engine B" in an other test suite. That is another variance increase I believe. We pick, say 50 positions out of thousands of millions, so there much be some noise. Do you think that would be important on the reported results ( and "errors bars") by the two most common algorithms? I would think so... A lot of amateur testers like me do that, any thought on that?

marcmp · Post by **marcmp** » Tue Jul 20, 2010 8:29 am

After a think,

I guess the answer is that the error bars refers to the test suite used. I once created a 4-ply 50 positions one (i.e 1. e4 e5 2. f4 ef4, 1.e4 e5 f4 Qh4+, 1.d4 f5 2. c4 g6 etc...) and I noticed important variation in the results (unfortunately I don't have these any more since OS crashed) compared to usual test suites.

I'd be curious if someone has the time to compare results from at least 5 engines (round robin) from very different suites and post the results.

Cheers,

OpenChess

OpenChess

Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list

Re: Creating a new (and independent) rating list