OpenChess

Posted: **Mon Sep 12, 2011 8:32 am**

The past couple of days I ran a tournament of 50 games (5 min tournament) between Houdini 1.5a, Houdini Pro 2.0, Stockfish 2.11 and Critter 2.1 (ponder on).

Houdini Pro 2.0 win the majority of games against Critter and Stockfish, with a couple of draws against Critter and Stockfish (less than 10%).

Critter won all its games against Stockfish, with only a couple of draws. (I always thought Stockfish were stronger than that)?

Houdini 1.5a also win the majority of games against Critter and Stockfish, with a couple of draws (less than 20%), but one loss each against Critter and Stockfish.

Houdini Pro 2.0 had a final score of 63% versus Houdini 1.5a 37%, with the 5 minute tournament games. Thus the new Houdini Pro 2.0 is clearly stronger than Houdini 1.5a (32 bit tested).

However, I then tested Houdini 1.5a versus Houdini Pro 2.0 with two tournaments of 30 games. One with a 20 minute per game and then 30 minutes per game. (ponder on).

I then got a shock! Out of the 20 and 30 minutes games, Houdini 1.5a won each tournament!! In the 20 minute games, Houdini 1.5a win with a score of 61% versus Houdini Pro 2.0 39%
In the 30 minute games, Houdini 1.5a win with a score of 63% versus Houdini Pro 2.0 37%. Thus the longer the games, the "stronger" Houdini 1.5a became... looks like it.

This results really surprised me. Houdini Pro 2.0 clearly showed its dominance over Houdini 1.5a in the shorter time games, but suddenly the tables are turned around, the longer
the games are running!

I ran my tests on a PC with a duo core and another with a quad core and got very similiar results on both.

I would love to hear from other testers out there what they find. Something strange is happening with Houdini Pro 2.0 versus Houdini 1.5a with the longer games?

Looks like to me that Robert Houdart did something to the tweaking, that makes it less strong with longer games, compared to his older 1.5a version?

Posted: **Tue Sep 13, 2011 12:36 am**

Hello Pieter,

From my tests with many games I expect Houdini 2 to be about 25 Elo stronger than H1.5.
By playing only 30 games your error margin exceeds 100 Elo.
This means that from your results there are no real statistical grounds to claim that H2.0 is stronger than H1.5 at fast TC, nor the opposite at long TC. You simply haven't played enough games from a sufficient number of openings.

I'm confronted every day with tests that include insufficient number of games. Yesterday on the Houdini Facebook page somebody reported a 6 game match and complained that Houdini 2 didn't win convincingly...

For more thorough results, there's an impressive series of tests concluded by user "Robbolito" on the Chess2u forum.
See http://www.chess2u.com/t4006-houdini-20-houdini-15a .
Also http://www.chess2u.com/t3935-houdini-2-tests .

The test results are interesting because they show the large spread of the individual matches. There's even one 80-game match in which Houdini 1.5 wins by 42-38, but most of the time Houdini 2 has the upper hand and wins for example by 46.5-33.5
It's by summing up all the individual matches that you can arrive at a statistical reliable estimate of the strength based on a sufficient number of games.

Cheers,
Robert

Posted: **Tue Sep 13, 2011 1:10 am**

P.S.
Here's another interesting link: a 100 game match at 40 min/game on strong 8-core hardware with the SilverSuite. Score was 59.5-40.5 in favor of Houdini 2 Pro, the games are available for download.

See http://rybkaforum.net/cgi-bin/rybkaforu ... ?tid=22955

As for your tests, in itself this is not statistically conclusive - 100 games give a statistical margin of about 60 Elo points - and needs to be viewed in the wider picture.

Robert

Posted: **Tue Sep 13, 2011 7:41 am**

Hi Robert

Thank you for your info. I have played another round of 30 games (30 min per side) and this time Houdini 1.5a win with 53% versus 47% of Houdini 2.0.
Thus much smaller difference this time, but Houdini 1.5a still won the tournament.

Robert, what should be a sufficient number of games you reckon, that should be suffice to say without a doubt that the one engin is actually stronger than the other?

What I found interesting is the fact that Houdini 2.0 just blows everything else out of the water on the 5 min games (shorter games), but with my long time tournament
tests (even if not sufficient amount of games) I get the "opposite"?

One would think that Houdini 2.0 should at least also have a lead there, even if small, when one ran in the vicinity of 30 long games? Thanks.

Posted: **Tue Sep 13, 2011 10:54 am**

Pieter,

To reliably detect a 25 Elo difference you need to play at least 500 games.

If you play only 30 games the expected score is 16-14 for Houdini 2.
The 95% confidence interval with 30 games is about +/- 4 points, which means that any result between 20-10 for Houdini 2 and 12-18 for Houdini 1.5 is considered "statistically normal".

Note also that the difference between your "short" 5 minute and "long" 30 minute games is relatively small, the engines will go about 2 ply deeper. Depending on your hardware it could for example be 21 plies against 19 plies on the average. It's not expected that this would significantly change the balance of forces between the engine, there's nothing magical that happens when you go from 19 deep to 21 deep.
A more revealing test of "short" v "long" would be, for example, 1 minute games v 120 minute games.

Robert

Posted: **Tue Sep 13, 2011 8:18 pm**

Hi Robert

Thanks for the info. Looks like chess is a much much more complex type of game (possible the most complex of any game in the world) than for example
say boxing. In boxing the best fighter usually wins the fight and is statistically "very" accurate with only one match, but with chess engines it is not so
simple, as judging from your suggestion of at least 500 games, to decide who is the best and by how much...

My appreciation and love for the game of chess has just increased a lot.

Robert, keep up the good work. Houdini had me in times in "shock" and admiration with the way it is not afraid to loose material in the beginning, middle
or end, to get into a better or winning position later on, and also the way it escapes the most difficult positions! I enjoy every moment of testing and
playing against it!

Pieter

Posted: **Tue Sep 13, 2011 10:06 pm**

Pieterhb wrote: In boxing the best fighter usually wins the fight and is statistically "very" accurate with only one match

Actually, no, there's no reason to only apply such statistics to chess games and not boxing, by the same standards it's relatively very likely that the best fighter lost the match, and that would be reflected if they fought more than once.

Say, a fighter is better than another, so that he has a 60% chance to defeat him in a match, 40% of the times the worse fighter is going to beat the best one and it would definitively won't be a case of "'very' statistically accurate", people just don't care about who is the best, they only care about who won.

This is the case with most sports and even guys beating coworkers that are actually better at the job, but who has the time to play 300 games of the same baseball team against another to get statistical significance of who is better? As things are, the worst baseball team has a non zero change of beating all the others and it is expected that it'll do so after enough seasons.

Posted: **Wed Sep 14, 2011 7:09 am**

Robert

Yes, I totally agree with you. What I actually trying to say, is that with different sports you have different stats to decide who
is the best.

Looks like to me that with chess engines, one have to run a lot of games, whereas with other sports you only need, for example
5, 10, 20 or 30 games, to decide who is the better, statistically wise. Of course, time, money, logistics, etc., seldom allows society
to do that and from there that the best does not always won.

Luckily with chess engines, there are thousands of testers out there, that can do a lot of runs, to determine what engine is the real King
of the world at the end of the day.

I quess life would have been very bored if the same stats applied to all the sports out there. This is what makes is so interesting.

Pieter

Posted: **Wed Sep 14, 2011 10:33 pm**

Pieterhb wrote:Robert

I'm not Robert, I'm Uly

Pieterhb wrote:Looks like to me that with chess engines, one have to run a lot of games, whereas with other sports you only need, for example
5, 10, 20 or 30 games, to decide who is the better, statistically wise.

But that's not true, the same statistics that apply to chess engines apply to sports. 30 games aren't enough to decide who is better, with statistical significance it doesn't matter if it's chess or some other sport. Many times the worse sport player or team wins the cup, due to luck, people don't know this and think the player or team is actually best.

Actually, the best player or team has the lowest chance to win the first place, because the number of opponents is higher than his chances of winning. Even if she has higher chances of winning than anybody else, it's more probable that she doesn't win first place than she does.

Posted: **Thu Sep 15, 2011 8:42 am**

Sorry Uly, my mistake.

Thanks for your insight. I think that we are both correct, but we "miss" each other in terms how we see statistics
and the way it works. I am a researcher and statistics is part of my every day life.

For example, I do not even have to run one race against an professional athlete, who runs the 100m, because I know that I will never
in my life be able to win such a person, simply because they train every day, are younger than me, etc., etc.

The CLOSER the opponents skills are, the longer one needs to test who is the best. For example, chess engines differences
in ELO is not that much, therefore one needs to run hundreds of games to determine who is actually better/stronger and by how much.

Thus the stats certainly differ a lot in terms of sports or anything else for that matter. One can thus never apply the exact same stats for
everything in life that needs comparing. One thus needs more or less competition games between two opponents, depending on the situation
you work with. The stats for a school rugby team, playing against an international rugby team for example, they do not need to play for example
20 games, to determine who is the better, versus two international teams on the other hand, that needs much more games and thus different
stats apply there...

Anyhow, this is a chess forum and lets stick to it. Roberts's guess for the amount of games neccessary to test his older and newer version
of Houdini is about 500. I will see if I can get such a run in, and then give the results here.

OpenChess

Houdin 1.5a stronger than Houdini 2.0 on long games (32bit)

Houdin 1.5a stronger than Houdini 2.0 on long games (32bit)

Re: Houdin 1.5a stronger than Houdini 2.0 on long games (32b

Re: Houdin 1.5a stronger than Houdini 2.0 on long games (32b

Re: Houdin 1.5a stronger than Houdini 2.0 on long games (32b

Re: Houdin 1.5a stronger than Houdini 2.0 on long games (32b

Re: Houdin 1.5a stronger than Houdini 2.0 on long games (32b

Re: Houdin 1.5a stronger than Houdini 2.0 on long games (32b

Re: Houdin 1.5a stronger than Houdini 2.0 on long games (32b

Re: Houdin 1.5a stronger than Houdini 2.0 on long games (32b

Re: Houdin 1.5a stronger than Houdini 2.0 on long games (32b