Is Houdini 3 Tactical the Strongest Chess Engine?

General discussion about computer chess...
lucasart
Posts: 201
Joined: Mon Dec 17, 2012 1:09 pm
Contact:

Re: Is Houdini 3 Tactical the Strongest Chess Engine?

Post by lucasart » Tue Dec 25, 2012 3:51 am

Sorry for interviening again in this thread (promissed I wouldn't). But I really need to clarify a couple of points
  • error bars aside: H3 is above H3 Tactical in CCRL 40/40 and CCRL 40/4. For example, I see H3 = H3 Tactical + 40 elo on CCRL 40/40, after 53 games for H3 tactical and 995 for H3.
  • for the error bars: one should look at the LOS matrix. For example in CCRL 40/40
    http://computerchess.org.uk/ccrl/4040/c ... +opponents
This means that, if we trust the BayesElo model (*), the probability of H3 being stronger than H3 Tactical is 88.1% in CCRL 40/40 conditions.

(*) and if there's no biais, due to early stopping. This should be a valid assumption, as CCRL testers typically decide the match they want to play beforehand, and do not stop the matches early, when the results look favorable or whatever.
"Talk is cheap. Show me the code." -- Linus Torvalds.

mwyoung
Posts: 43
Joined: Thu Jan 05, 2012 1:13 am
Real Name: Mark Young

Re: Is Houdini 3 Tactical the Strongest Chess Engine?

Post by mwyoung » Tue Dec 25, 2012 4:07 am

Adam Hair wrote:
mwyoung wrote:
Odeus37 wrote:The CCRL 40/40 rating you are refering to is irrelevant : only 53 games yet for Houdini Tactical...

http://www.computerchess.org.uk/ccrl/40 ... t_all.html

Code: Select all

Name                         Rating   Elo+    Elo-   Score   Average Opponent   Draws   Games	
Houdini 3 Tactical 64-bit    3252     +70     −68    65.1%   −98.0              50.9%   53
Houdini 3 64-bit             3221     +20     −19    64.4%   −106.7             41.3%   750

It is not irrelevant to CCRL, they publish the data and rating.

The results themselves are not irrelevant to the question to the thread. Is Houdini 3 Tactical the Strongest Chess Engine? Since you do not need 100's of games to answer this question. You only need 100's or 1000's of games if the elo's are very close, or if you are trying to get very exact elo ratings.

What we are asking Is Houdini 3 Tactical the Strongest Chess Engine? If we don't care by the exact amount, the data is very relevant.

Even with only 50 games and since the elo difference is not very close, you can say even after 50 games that is much more likely that Houdini 3 Tactical is stronger at 40/40 then Houdini 3. We just can't have high confidence in the exact value after 50 games.
The part in bold is the key. We can not have high confidence in the exact value after 53 games. There is a 95% chance that if the test was repeated that the measured Elo of Houdini 3 64-bit Tactical would fall between 3184 and 3322 (unless there is a more appropriate Bayesian interpretation of these error bars). The 95% interval for Houdini 3 64-bit is contained completely inside that interval. It is very hard to say which version is stronger at this point. Furthermore, the games come from different contributors, which makes the real error larger than the reported error for several reasons.

To be honest, though I have not done the math to determine the necessary value, 31 Elo is not a big enough difference after 53 games to say with high confidence that Houdini 3 Tactical is stronger.
You are correct, all we can say is right now it is more likely then not that Houdini 3 Tactical is Stronger.


"Furthermore, the games come from different contributors, which makes the real error larger than the reported error for several reasons."

Again what am I missing Adam? These games don't come from different contributors. They are from CCRL 40/40 rating pool. Were both Versions of Houdini 3 are being tested. The same for the 40/4 rating pool were Houdini 3 is stronger.

We are comparing apples to apples...What am I missing? Because I don't understand why you and the other posters think that this is not the rated games from the same CCRL 40/40 rating pool. Not games from different rating pools or other ratings list.

Please explain what you mean by different contributors.

ThinkingALot
Posts: 144
Joined: Sun Jun 13, 2010 7:32 am
Contact:

Re: Is Houdini 3 Tactical the Strongest Chess Engine?

Post by ThinkingALot » Tue Dec 25, 2012 7:28 am

mwyoung wrote:You are correct, all we can say is right now it is more likely then not that Houdini 3 Tactical is Stronger.
That's the conclusion one can draw from CCRL 40/40. But other tests suggest the Tactical version to be weaker by a fair margin.
mwyoung wrote:These games don't come from different contributors. They are from CCRL 40/40 rating pool.
They do come from different contributors and are compiled into a single rating list. These contributors use different hardware. Hence there's a systematic error in addition to the statistical error.

Adam Hair
Posts: 104
Joined: Fri Jun 11, 2010 4:29 am
Real Name: Adam Hair
Contact:

Re: Is Houdini 3 Tactical the Strongest Chess Engine?

Post by Adam Hair » Tue Dec 25, 2012 3:34 pm

mwyoung wrote:These games don't come from different contributors. They are from CCRL 40/40 rating pool.
ThinkingALot wrote:They do come from different contributors and are compiled into a single rating list. These contributors use different hardware. Hence there's a systematic error in addition to the statistical error.
Exactly. I personally contributed 219 of the games used to derive Houdini 3 64-bit's rating for the 40/40 list and none for Houdini 3 Tactical. As ThinkingALot stated, there is systematic error when combining the results from different computer systems together. Even though we use a benchmark to determine the amount of time needed per 40 moves for each computer, this is an inexact method of calibration.

Overall, I think that we tend to produce reasonably accurate Elo estimates applicable to the random computer chess enthusiast, but this does cause a lack in precision.

Adam Hair
Posts: 104
Joined: Fri Jun 11, 2010 4:29 am
Real Name: Adam Hair
Contact:

Re: Is Houdini 3 Tactical the Strongest Chess Engine?

Post by Adam Hair » Tue Dec 25, 2012 3:45 pm

lucasart wrote:Sorry for interviening again in this thread (promissed I wouldn't). But I really need to clarify a couple of points
  • error bars aside: H3 is above H3 Tactical in CCRL 40/40 and CCRL 40/4. For example, I see H3 = H3 Tactical + 40 elo on CCRL 40/40, after 53 games for H3 tactical and 995 for H3.
  • for the error bars: one should look at the LOS matrix. For example in CCRL 40/40
    http://computerchess.org.uk/ccrl/4040/c ... +opponents
This means that, if we trust the BayesElo model (*), the probability of H3 being stronger than H3 Tactical is 88.1% in CCRL 40/40 conditions.
You are comparing H3 4 CPU to H3 Tactical 1 CPU. H3 1 CPU has played 750 games and has a calculated LOS of 19.6% over H3 Tactical. Given the CCRL testing methods, I think that LOS is virtually meaningless after only 53 games.
lucasart wrote: (*) and if there's no biais, due to early stopping. This should be a valid assumption, as CCRL testers typically decide the match they want to play beforehand, and do not stop the matches early, when the results look favorable or whatever.
This is correct. The only time a test stops early is due to a bug or due to human error in setting up the test.

mwyoung
Posts: 43
Joined: Thu Jan 05, 2012 1:13 am
Real Name: Mark Young

Re: Is Houdini 3 Tactical the Strongest Chess Engine?

Post by mwyoung » Tue Dec 25, 2012 4:34 pm

Adam Hair wrote:
mwyoung wrote:These games don't come from different contributors. They are from CCRL 40/40 rating pool.
ThinkingALot wrote:They do come from different contributors and are compiled into a single rating list. These contributors use different hardware. Hence there's a systematic error in addition to the statistical error.
Exactly. I personally contributed 219 of the games used to derive Houdini 3 64-bit's rating for the 40/40 list and none for Houdini 3 Tactical. As ThinkingALot stated, there is systematic error when combining the results from different computer systems together. Even though we use a benchmark to determine the amount of time needed per 40 moves for each computer, this is an inexact method of calibration.

Overall, I think that we tend to produce reasonably accurate Elo estimates applicable to the random computer chess enthusiast, but this does cause a lack in precision.
I understand this, but I don't know why this was a point to bring up in the testing of Houdini 3 and Houdini 3 Tactical. Since the whole rating list uses this protocol to test every chess engine. That statement applies to ever engine in the rating list.

Is the CCRL's rating list accurate, or is it not accurate when comparing engines in the same rating pool? If it is not accurate in this test, then the whole rating list is in question.

Adam Hair
Posts: 104
Joined: Fri Jun 11, 2010 4:29 am
Real Name: Adam Hair
Contact:

Re: Is Houdini 3 Tactical the Strongest Chess Engine?

Post by Adam Hair » Wed Dec 26, 2012 4:02 am

mwyoung wrote:
Adam Hair wrote:
mwyoung wrote:These games don't come from different contributors. They are from CCRL 40/40 rating pool.
ThinkingALot wrote:They do come from different contributors and are compiled into a single rating list. These contributors use different hardware. Hence there's a systematic error in addition to the statistical error.
Exactly. I personally contributed 219 of the games used to derive Houdini 3 64-bit's rating for the 40/40 list and none for Houdini 3 Tactical. As ThinkingALot stated, there is systematic error when combining the results from different computer systems together. Even though we use a benchmark to determine the amount of time needed per 40 moves for each computer, this is an inexact method of calibration.

Overall, I think that we tend to produce reasonably accurate Elo estimates applicable to the random computer chess enthusiast, but this does cause a lack in precision.
I understand this, but I don't know why this was a point to bring up in the testing of Houdini 3 and Houdini 3 Tactical. Since the whole rating list uses this protocol to test every chess engine. That statement applies to ever engine in the rating list.

Is the CCRL's rating list accurate, or is it not accurate when comparing engines in the same rating pool? If it is not accurate in this test, then the whole rating list is in question.
It is accurate enough to make comparisons. The point is that, given the systematic error, the error bars should be larger than those given by Bayeselo, which are based on statistical error only. Inferences that could normally be made after N games are more suspect due to the systematic error. As more games come in from other contributors, the systematic errors tend to be negated (they average out). That has not happened yet with H3 Tactical. It does not mean you are wrong about H3 Tactical, just that more games from other people are needed.

User avatar
Don
Posts: 42
Joined: Thu Dec 30, 2010 12:28 am
Real Name: Don Dailey

Re: Is Houdini 3 Tactical the Strongest Chess Engine?

Post by Don » Wed Dec 26, 2012 3:38 pm

mwyoung wrote:
Odeus37 wrote:The CCRL 40/40 rating you are refering to is irrelevant : only 53 games yet for Houdini Tactical...

http://www.computerchess.org.uk/ccrl/40 ... t_all.html

Code: Select all

Name                         Rating   Elo+    Elo-   Score   Average Opponent   Draws   Games	
Houdini 3 Tactical 64-bit    3252     +70     −68    65.1%   −98.0              50.9%   53
Houdini 3 64-bit             3221     +20     −19    64.4%   −106.7             41.3%   750

It is not irrelevant to CCRL, they publish the data and rating.

The results themselves are not irrelevant to the question to the thread. Is Houdini 3 Tactical the Strongest Chess Engine? Since you do not need 100's of games to answer this question. You only need 100's or 1000's of games if the elo's are very close, or if you are trying to get very exact elo ratings.
Actually you tens of thousands of games if the ELO's are close. For tactics you probably need thousands of positions unless the difference is very clear.

What we are asking Is Houdini 3 Tactical the Strongest Chess Engine? If we don't care by the exact amount, the data is very relevant.

Even with only 50 games and since the elo difference is not very close, you can say even after 50 games that is much more likely that Houdini 3 Tactical is stronger at 40/40 then Houdini 3. We just can't have high confidence in the exact value after 50 games.
I seriously doubt we can get a legitimate answer to the question of which program is best tactically. We have to define tactics and we usually go by how it does on some tactical test suite and the value of that I have doubt in. It's probably a good starting guess however. In other words it is hard to get a "number" that we can clearly agree means something.

mwyoung
Posts: 43
Joined: Thu Jan 05, 2012 1:13 am
Real Name: Mark Young

Re: Is Houdini 3 Tactical the Strongest Chess Engine?

Post by mwyoung » Wed Dec 26, 2012 9:33 pm

Don wrote:
mwyoung wrote:
Odeus37 wrote:The CCRL 40/40 rating you are refering to is irrelevant : only 53 games yet for Houdini Tactical...

http://www.computerchess.org.uk/ccrl/40 ... t_all.html

Code: Select all

Name                         Rating   Elo+    Elo-   Score   Average Opponent   Draws   Games	
Houdini 3 Tactical 64-bit    3252     +70     −68    65.1%   −98.0              50.9%   53
Houdini 3 64-bit             3221     +20     −19    64.4%   −106.7             41.3%   750

It is not irrelevant to CCRL, they publish the data and rating.

The results themselves are not irrelevant to the question to the thread. Is Houdini 3 Tactical the Strongest Chess Engine? Since you do not need 100's of games to answer this question. You only need 100's or 1000's of games if the elo's are very close, or if you are trying to get very exact elo ratings.
Actually you tens of thousands of games if the ELO's are close. For tactics you probably need thousands of positions unless the difference is very clear.

What we are asking Is Houdini 3 Tactical the Strongest Chess Engine? If we don't care by the exact amount, the data is very relevant.

Even with only 50 games and since the elo difference is not very close, you can say even after 50 games that is much more likely that Houdini 3 Tactical is stronger at 40/40 then Houdini 3. We just can't have high confidence in the exact value after 50 games.
I seriously doubt we can get a legitimate answer to the question of which program is best tactically. We have to define tactics and we usually go by how it does on some tactical test suite and the value of that I have doubt in. It's probably a good starting guess however. In other words it is hard to get a "number" that we can clearly agree means something.
No one is asking which program is best tactically, for what ever people think that means.

We are asking which program plays the overall strongest chess game at longer time controls.

CCRL Rating data right now suggest it is more likely then not, and that is all that can be said after 53 games. That Houdini 3 Tactical may play the stronger game at longer time controls of 40/40. Using the CCRL Rating testing protocol.

The understanding of statistics is not well understood by many here. Some think you not say anything about the rating data unless many hundreds or thousands of games have been played.

What is important is the question you are trying to answer with the existing data.

And extreme example of this would be if I wanted to know if program A is stronger then program B. And I don't care by how much. I can get an answer to this question in as little as 7 games with very good confidence. If program A beats program B 7-0 for example. This is a statistically meaningful result to answering that question. And in this example I can say with high confidence that program A is the stronger program.

But many here only see that it takes many hundreds or thousands of games to answer any rating questions.

In the human chess rating world this we be laughable, no one goes around saying GM Carlsen is not proven to be the best rated player because he has not played many thousands of games. And I certainly don't go around saying I could be the strongest player in the world, because I have not played GM Carlsen enough games yet. That is not how the ratings system was intended to work. With everything at 99.99...% confidence :)


"Fun fact: Over 10 million chess games were played for the development and tuning of Houdini 3!"

Except for maybe Robert Houdart who tested Houdini 3 with over 10 million games :)

User avatar
Don
Posts: 42
Joined: Thu Dec 30, 2010 12:28 am
Real Name: Don Dailey

Re: Is Houdini 3 Tactical the Strongest Chess Engine?

Post by Don » Wed Dec 26, 2012 9:58 pm

mwyoung wrote:
Don wrote:
mwyoung wrote:
Odeus37 wrote:The CCRL 40/40 rating you are refering to is irrelevant : only 53 games yet for Houdini Tactical...

http://www.computerchess.org.uk/ccrl/40 ... t_all.html

Code: Select all

Name                         Rating   Elo+    Elo-   Score   Average Opponent   Draws   Games	
Houdini 3 Tactical 64-bit    3252     +70     −68    65.1%   −98.0              50.9%   53
Houdini 3 64-bit             3221     +20     −19    64.4%   −106.7             41.3%   750

It is not irrelevant to CCRL, they publish the data and rating.

The results themselves are not irrelevant to the question to the thread. Is Houdini 3 Tactical the Strongest Chess Engine? Since you do not need 100's of games to answer this question. You only need 100's or 1000's of games if the elo's are very close, or if you are trying to get very exact elo ratings.
Actually you tens of thousands of games if the ELO's are close. For tactics you probably need thousands of positions unless the difference is very clear.

What we are asking Is Houdini 3 Tactical the Strongest Chess Engine? If we don't care by the exact amount, the data is very relevant.

Even with only 50 games and since the elo difference is not very close, you can say even after 50 games that is much more likely that Houdini 3 Tactical is stronger at 40/40 then Houdini 3. We just can't have high confidence in the exact value after 50 games.
I seriously doubt we can get a legitimate answer to the question of which program is best tactically. We have to define tactics and we usually go by how it does on some tactical test suite and the value of that I have doubt in. It's probably a good starting guess however. In other words it is hard to get a "number" that we can clearly agree means something.
No one is asking which program is best tactically, for what ever people think that means.

We are asking which program plays the overall strongest chess game at longer time controls.

CCRL Rating data right now suggest it is more likely then not, and that is all that can be said after 53 games. That Houdini 3 Tactical may play the stronger game at longer time controls of 40/40. Using the CCRL Rating testing protocol.

The understanding of statistics is not well understood by many here. Some think you not say anything about the rating data unless many hundreds or thousands of games have been played.

What is important is the question you are trying to answer with the existing data.

And extreme example of this would be if I wanted to know if program A is stronger then program B. And I don't care by how much. I can get an answer to this question in as little as 7 games with very good confidence. If program A beats program B 7-0 for example. This is a statistically meaningful result to answering that question. And in this example I can say with high confidence that program A is the stronger program.

But many here only see that it takes many hundreds or thousands of games to answer any rating questions.
Which is why I said, "Actually you need tens of thousands of games if the ELO's are close."

In the human chess rating world this we be laughable, no one goes around saying GM Carlsen is not proven to be the best rated player because he has not played many thousands of games. And I certainly don't go around saying I could be the strongest player in the world, because I have not played GM Carlsen enough games yet. That is not how the ratings system was intended to work. With everything at 99.99...% confidence :)
I don't want to burst your bubble here but Calsen is not best with any serious confidence. It's clear statistically that he is among the very top but that is all you can say. That doesn't mean that he is NOT the best player in the world, it is just to say we cannot say that with a lot of confidence. Few people will deny he is best but that is because most people go by the hype. He has had a nice record, winning tournaments in grand style (and all sorts of noise is made over this) but there are still 2 or 3 players within 20 ELO of him. Statistically you just cannot say. Once you appear on the FIDE list as rank number 1 and hold it for a little time, then it's assumed that you "must" be the strongest player in the world. But statistically that is a nonsense claim. It's sort of like the winner of Wimbledon or the super bowl. They are clearly declared the "best of the best" and it's believed but in reality to win these things you have to be among the best and have some good fortune too because the second or third guy or team had a real shot too.

Below Carlsen there is a sudden drop after the top 3 or 4. So it's VERY clear he is in the top 5. It's almost a sure thing he belongs in the top 3 and there is a "good chance" he deserves to be called the best player. But 1 or 2 tournaments could easily turn this around and he could swap places with Anand for example. We are simply victims of human perception and when things happen slowly we assign permanence to them, especially when they are hyped and hailed as being true. If there are 2 tournaments in a row and Carlsen wins them both over Anand it's easy to say that Carlsen proved his superiority but the best we can say is that yes, in these 2 events he got better results.

I would mention that Carlsen is still improving but he is not world champion. Does that prove he is not best? Nope, for the same reason.

"Fun fact: Over 10 million chess games were played for the development and tuning of Houdini 3!"

Except for maybe Robert Houdart who tested Houdini 3 with over 10 million games :)

Post Reply