TCEC Season 6 Opening Book
Posted: Thu Apr 10, 2014 11:45 am
Season 6 opening book notes from Nelson Hernandez (nick: Cato the Younger)TCEC is a website intended to showcase computer chess programs and provide visitors with chess games of extraordinarily high quality on top-flight hardware at long time controls, eventually producing a Seasonal 'champion'.
There are a great number of ways one might structure a sequential-stage tournament with these intentions. Veteran observers of TCEC have no doubt noticed evolutions from one season to the next in various details. Martin's chief guiding principle in all these changes has been 'entertainment first'.
With respect to openings in particular, numerous individuals have advanced or supported different proposals in the past. Some favor no opening books at all ('let the engines play!') and others favor opening books of limited or unlimited size ('let the authors match wits!'). Some favor openings being selected from an extremely short opening book of two or three moves. Some favor thematic openings, i.e. playing an opening system repeatedly during a Stage.
All of the above approaches could result in an interesting competition. However, Martin decided that the best way to promote entertainment, showcase the engines, present a fair challenge to all programs and impose the least administrative burden on everyone involved was for all games to start from pre-selected opening positions that were produced in a GM-level game in the past.
Obviously the implementation details of such a plan are a paramount concern. On one hand you do not want one-sided opening positions where the same color will almost always win. On the other hand you do not want extremely draw-prone opening positions, particularly in Stages when very evenly-matched, high-quality programs are playing. What you are looking for are hundreds of openings that are consistently in-between. Further, in a single round-robin (as in Stages 1 and 2) the positions need to be very similar in terms of the degree to which each color is advantaged. Since the engines are getting to play only one side of a position fairness is critical. Your white player will be justifiably upset if he is presented with an opening position where black is objectively much better.
To find these hundreds of opening positions with a high degree of quantitative precision you need lots of empirical and evaluative data, i.e. you need many millions of games in your database and you need deep comparative evaluations across different high-quality engines. There is simply no other way to do it reliably. You have a classic 'Big Data' problem.
Martin invited me to tackle this problem last November and I took it on with enthusiasm. I had one major component already in hand-a world-class private database with tens of millions of human and engine games, all carefully de-duplicated and verified for ending outcomes and adjusted as necessary, which has been a daily work in progress of mine since 2004 and been contributed to by many other collectors and individuals over the years.
The second component, developing a list of candidate opening positions, turned out to be easy because it was already done. Adam Hair had an eight-move book compiled from Norm Pollock's database of GM-level games. After some refinement provided by Dariusz Orzechowski and Lucas Braesch, the book contained 44,624 unique positions achieved in GM-level games throughout history. Eight moves seemed ideal to me because it is long enough to span a much fuller range of ECOs than six or fewer moves. Nine or ten moves, it seemed to me, would be taking too much liberty away from the engines.
A third component was getting deep evaluations for the candidate positions. Adam Hair and I developed these evaluations from Houdini 4, Stockfish DD and Komodo TCECr for all of the candidate positions in two months, which was all the time we had to complete the task. For Houdini and Komodo we went to 24 plies and for Stockfish we went to 28 plies (which was comparable to the others in terms of time spent per position).
Perhaps the key idea that emerged from this project was the recognition that there are four quantifiable properties to all opening positions:
Frequency (or commonness) - how often has the candidate position been seen in human and engine games?
Play-balance - how closely does the candidate position align to the relative advantage seen in the opening position in empirical and evaluative terms?
Draw-tendency - how often has the candidate position resulted in a draw in human and engine databases relative to each of those databases?
Dissonance - to what degree do leading engines disagree in deep evaluations of the position?
My opening selection process was based on quantifying these four dimensions and converting the resulting analysis into percentile-ranks. In other words, every position was boiled down to four simple numbers on a 0-100 scale. I then assigned varying weights to each of these numbers for each Stage.
In all Stages no position was selected that had not been seen at least 100 times in human and engine games; in Stage 1 this number is at least 300 times. I did not choose very rare openings because the metrics for play balance and draw-tendency would have been unreliable.
As already mentioned play-balance is critically important in Stages 1 and 2 because they are both single round-robin. Every game in those Stages will slightly favor white as is seen in the opening position. Play-balance remains quite important through the other Stages, but not as much, because engines will get to play both sides of each position.
Attempting to reduce the draw-rate is not important in Stage 1 because engines of widely varying strength will be competing resulting in many decisive games. From Stage 2 onward all openings selected have below-average draw-rates. Our theory, which was demonstrated in the Superfinal last Season, is that you can knock down draw-rates by as much as 15-20% by selecting balanced openings that have historically produced low draw-rates.
Dissonance is irrelevant to my opening selection until Stage 4 and the Superfinal. In those Stages I purposely picked positions where the leading engines displayed sharp evaluation disagreements while not being heavily unbalanced in favor of one color. I am hoping that this will both further depress the draw-rate as well as lead to some really wild and exciting games. (Or it could lead to some big names getting steamrolled because they flat-out do not understand the starting position, in which case it will be very revealing!)
Q&A
Q: Can you provide us with a list of what openings will be played in each Stage?
A: I could, but Martin prefers the element of surprise and so do I.
Q: Will you know what openings any two engines will play before a game starts?
A: No. For each Stage I have provided Martin with a pgn file containing all the opening positions that will be played. His tournament GUI selects the openings at random, never repeating one that has already been played, apart from cases where engines are playing both sides of the same opening.
Q: You say you compared evaluations across the engines. How could you do this when engines have different ways of evaluating positions and scale differently?
A: Adam Hair did a linear regression comparing the 44,000+ evaluations and concluded that Stockfish DD is about 1.88x higher than Houdini 4 while Komodo TCECr is 1.18x higher. We adjusted their evaluations accordingly to get a better impression of evaluation dissonance. Correlations were quite high after we did this.
Q: Is there any theme or opening system that will be seen more often than others?
A: Not by design. It is interesting, though, that 1.e4 openings are more predominant in Stage 1 and 2 but 1.d4 openings achieve parity after that. This outcome was happenstance.
Q: Any gambits or exotic openings in store?
A: Gambits, or positions that evaluate poorly but have compensation in human contests will be seldom seen. I did not go looking for them. As for exotic positions, different people will see different things in a position. I picked positions entirely on the basis of their known, objective numerical properties, not my impressions of what I saw on the board, which would be very subjective and unreliable.
Q: Would you be so kind as to send me your database?
A: No.
My thanks to my collaborators, Adam Hair and Anson Williams, whose assistance and input were essential, and also to Martin Thoresen, whose encouragement and trust that we could pull it all off in time for this Season's launch certainly helped us complete this ambitious project.
There are a great number of ways one might structure a sequential-stage tournament with these intentions. Veteran observers of TCEC have no doubt noticed evolutions from one season to the next in various details. Martin's chief guiding principle in all these changes has been 'entertainment first'.
With respect to openings in particular, numerous individuals have advanced or supported different proposals in the past. Some favor no opening books at all ('let the engines play!') and others favor opening books of limited or unlimited size ('let the authors match wits!'). Some favor openings being selected from an extremely short opening book of two or three moves. Some favor thematic openings, i.e. playing an opening system repeatedly during a Stage.
All of the above approaches could result in an interesting competition. However, Martin decided that the best way to promote entertainment, showcase the engines, present a fair challenge to all programs and impose the least administrative burden on everyone involved was for all games to start from pre-selected opening positions that were produced in a GM-level game in the past.
Obviously the implementation details of such a plan are a paramount concern. On one hand you do not want one-sided opening positions where the same color will almost always win. On the other hand you do not want extremely draw-prone opening positions, particularly in Stages when very evenly-matched, high-quality programs are playing. What you are looking for are hundreds of openings that are consistently in-between. Further, in a single round-robin (as in Stages 1 and 2) the positions need to be very similar in terms of the degree to which each color is advantaged. Since the engines are getting to play only one side of a position fairness is critical. Your white player will be justifiably upset if he is presented with an opening position where black is objectively much better.
To find these hundreds of opening positions with a high degree of quantitative precision you need lots of empirical and evaluative data, i.e. you need many millions of games in your database and you need deep comparative evaluations across different high-quality engines. There is simply no other way to do it reliably. You have a classic 'Big Data' problem.
Martin invited me to tackle this problem last November and I took it on with enthusiasm. I had one major component already in hand-a world-class private database with tens of millions of human and engine games, all carefully de-duplicated and verified for ending outcomes and adjusted as necessary, which has been a daily work in progress of mine since 2004 and been contributed to by many other collectors and individuals over the years.
The second component, developing a list of candidate opening positions, turned out to be easy because it was already done. Adam Hair had an eight-move book compiled from Norm Pollock's database of GM-level games. After some refinement provided by Dariusz Orzechowski and Lucas Braesch, the book contained 44,624 unique positions achieved in GM-level games throughout history. Eight moves seemed ideal to me because it is long enough to span a much fuller range of ECOs than six or fewer moves. Nine or ten moves, it seemed to me, would be taking too much liberty away from the engines.
A third component was getting deep evaluations for the candidate positions. Adam Hair and I developed these evaluations from Houdini 4, Stockfish DD and Komodo TCECr for all of the candidate positions in two months, which was all the time we had to complete the task. For Houdini and Komodo we went to 24 plies and for Stockfish we went to 28 plies (which was comparable to the others in terms of time spent per position).
Perhaps the key idea that emerged from this project was the recognition that there are four quantifiable properties to all opening positions:
Frequency (or commonness) - how often has the candidate position been seen in human and engine games?
Play-balance - how closely does the candidate position align to the relative advantage seen in the opening position in empirical and evaluative terms?
Draw-tendency - how often has the candidate position resulted in a draw in human and engine databases relative to each of those databases?
Dissonance - to what degree do leading engines disagree in deep evaluations of the position?
My opening selection process was based on quantifying these four dimensions and converting the resulting analysis into percentile-ranks. In other words, every position was boiled down to four simple numbers on a 0-100 scale. I then assigned varying weights to each of these numbers for each Stage.
In all Stages no position was selected that had not been seen at least 100 times in human and engine games; in Stage 1 this number is at least 300 times. I did not choose very rare openings because the metrics for play balance and draw-tendency would have been unreliable.
As already mentioned play-balance is critically important in Stages 1 and 2 because they are both single round-robin. Every game in those Stages will slightly favor white as is seen in the opening position. Play-balance remains quite important through the other Stages, but not as much, because engines will get to play both sides of each position.
Attempting to reduce the draw-rate is not important in Stage 1 because engines of widely varying strength will be competing resulting in many decisive games. From Stage 2 onward all openings selected have below-average draw-rates. Our theory, which was demonstrated in the Superfinal last Season, is that you can knock down draw-rates by as much as 15-20% by selecting balanced openings that have historically produced low draw-rates.
Dissonance is irrelevant to my opening selection until Stage 4 and the Superfinal. In those Stages I purposely picked positions where the leading engines displayed sharp evaluation disagreements while not being heavily unbalanced in favor of one color. I am hoping that this will both further depress the draw-rate as well as lead to some really wild and exciting games. (Or it could lead to some big names getting steamrolled because they flat-out do not understand the starting position, in which case it will be very revealing!)
Q&A
Q: Can you provide us with a list of what openings will be played in each Stage?
A: I could, but Martin prefers the element of surprise and so do I.
Q: Will you know what openings any two engines will play before a game starts?
A: No. For each Stage I have provided Martin with a pgn file containing all the opening positions that will be played. His tournament GUI selects the openings at random, never repeating one that has already been played, apart from cases where engines are playing both sides of the same opening.
Q: You say you compared evaluations across the engines. How could you do this when engines have different ways of evaluating positions and scale differently?
A: Adam Hair did a linear regression comparing the 44,000+ evaluations and concluded that Stockfish DD is about 1.88x higher than Houdini 4 while Komodo TCECr is 1.18x higher. We adjusted their evaluations accordingly to get a better impression of evaluation dissonance. Correlations were quite high after we did this.
Q: Is there any theme or opening system that will be seen more often than others?
A: Not by design. It is interesting, though, that 1.e4 openings are more predominant in Stage 1 and 2 but 1.d4 openings achieve parity after that. This outcome was happenstance.
Q: Any gambits or exotic openings in store?
A: Gambits, or positions that evaluate poorly but have compensation in human contests will be seldom seen. I did not go looking for them. As for exotic positions, different people will see different things in a position. I picked positions entirely on the basis of their known, objective numerical properties, not my impressions of what I saw on the board, which would be very subjective and unreliable.
Q: Would you be so kind as to send me your database?
A: No.
My thanks to my collaborators, Adam Hair and Anson Williams, whose assistance and input were essential, and also to Martin Thoresen, whose encouragement and trust that we could pull it all off in time for this Season's launch certainly helped us complete this ambitious project.