Creating a new (and independent) rating list
-
- Posts: 160
- Joined: Thu Jun 10, 2010 2:14 am
- Real Name: Luis Smith
Re: Creating a new (and independent) rating list
I have an old Athlon XP 2200 which will go towards this.
Re: Creating a new (and independent) rating list
I'm not strong enough in statistics to answer my question but there it is: it appears to me that when we are using "test suite" (the engines play the same position twice, with side reversed), we should used a multinomial distribution (i.e. scores will be either 0/2, 0,5/2, 1/2, 1,5/2 or 2/2 per position) to calculate the "performance". From my comprehension neither Bayeselo nor EloStats do that. Instead, from my understanding both methods consider each game as a new position. Would that influence the statistical results by much?
Re: Creating a new (and independent) rating list
I thought about exactly this question awhile back, and through sufficient hand-waving convinced myself that it increased the size of the error bars, but not too much. I can try to replicate the thinking if you want. The idea is that instead of taking variance from a 0, 0.5, 1 set on every data point, we now do so from a 0, 0.5, 1, 1.5, 2 set on pairs of data points.I'm not strong enough in statistics to answer my question but there it is: it appears to me that when we are using "test suite" (the engines play the same position twice, with side reversed), we should used a multinomial distribution (i.e. scores will be either 0/2, 0,5/2, 1/2, 1,5/2 or 2/2 per position) to calculate the "performance". From my comprehension neither Bayeselo nor EloStats do that. Instead, from my understanding both methods consider each game as a new position. Would that influence the statistical results by much?
Re: Creating a new (and independent) rating list
Thank you for your quick answer BB+,
I'm not a programmer, but I'm my "testing" (for fun of course), I used to use 50 positions test suites. I noticed two things:
1- The multinomial thing, you just mentioned you don't think is much important (thank you for your quick answer).
2- The test suites. It is not uncommon (from my experience) to see "Engine A" scoring 59-41 against "Engine B", but then "Engine A" scoring 48-52 against "Engine B" in an other test suite. That is another variance increase I believe. We pick, say 50 positions out of thousands of millions, so there much be some noise. Do you think that would be important on the reported results ( and "errors bars") by the two most common algorithms? I would think so... A lot of amateur testers like me do that, any thought on that?
I'm not a programmer, but I'm my "testing" (for fun of course), I used to use 50 positions test suites. I noticed two things:
1- The multinomial thing, you just mentioned you don't think is much important (thank you for your quick answer).
2- The test suites. It is not uncommon (from my experience) to see "Engine A" scoring 59-41 against "Engine B", but then "Engine A" scoring 48-52 against "Engine B" in an other test suite. That is another variance increase I believe. We pick, say 50 positions out of thousands of millions, so there much be some noise. Do you think that would be important on the reported results ( and "errors bars") by the two most common algorithms? I would think so... A lot of amateur testers like me do that, any thought on that?
Re: Creating a new (and independent) rating list
After a think,
I guess the answer is that the error bars refers to the test suite used. I once created a 4-ply 50 positions one (i.e 1. e4 e5 2. f4 ef4, 1.e4 e5 f4 Qh4+, 1.d4 f5 2. c4 g6 etc...) and I noticed important variation in the results (unfortunately I don't have these any more since OS crashed) compared to usual test suites.
I'd be curious if someone has the time to compare results from at least 5 engines (round robin) from very different suites and post the results.
Cheers,
I guess the answer is that the error bars refers to the test suite used. I once created a 4-ply 50 positions one (i.e 1. e4 e5 2. f4 ef4, 1.e4 e5 f4 Qh4+, 1.d4 f5 2. c4 g6 etc...) and I noticed important variation in the results (unfortunately I don't have these any more since OS crashed) compared to usual test suites.
I'd be curious if someone has the time to compare results from at least 5 engines (round robin) from very different suites and post the results.
Cheers,