Converting PGN files to Parquet on Ray
This post is part of a series to show the use of Ray for data ingestion and data analysis.
In the previous post, we already used a third-party library when running a workload on Ray. Today we want to continue in that vein. We will use a Python chess library and Ray to accelerate the conversion of chess game files in the Portable Game Notation (PGN) format to parquet. Converting the data to the Parquet format will significantly reduce the disk space required. As a second benefit, the Parquet format allows us to use tools like Apache Spark or Arrow DataFunsion to parallelize the analysis of the chess games.
Lichess is a free and open-source online chess platform. Initiated and maintained by Thibault Duplessis Lichess is free of advertisements. Lichess patrons finance the annual operational costs of 423000 dollars. Currently, there are three million games per day played on the platform. We will tap into this source.
The Data
As of today, there are 3,013,194,626 games available for download. Compressed those games take up 769 GB of disk space. Uncompressed, this value would be five to nine times higher. PGN is a verbose plain text format consisting of key-value pairs inside brackets and a “movetext.” That is the moves that took place in this specific game. Let’s have a look at an example.
[Event "Casual Blitz game"]
[Site "https://lichess.org/UMyBDDFR"]
[Date "2020.09.19"]
[White "STL_Nakamura"]
[Black "STL_Nepomniachtchi"]
[Result "1/2-1/2"]
[UTCDate "2020.09.19"]
[UTCTime "20:05:07"]
[WhiteElo "1500"]
[BlackElo "1500"]
[WhiteTitle "GM"]
[BlackTitle "GM"]
[Variant "Standard"]
[TimeControl "300+3"]
[ECO "B50"]
[Opening "Sicilian Defense: Modern Variations"]
[Termination "Normal"]
[Annotator "lichess.org"]
1. e4 c5 2. Nf3 d6 { B50 Sicilian Defense: Modern Variations } 3. Nc3 Nf6 4. h3 a6 5. a4 Nc6 6. d4 cxd4 7. Nxd4 Bd7 8. Nb3 g6 9. a5 Bg7 10. Be3 O-O 11. Be2 Be6 12. Bb6 Qc8 13. O-O Nd7 14. Na4 Bxb3 15. cxb3 Qe8 16. Qd2 Nxb6 17. Nxb6 Rd8 18. b4 Nd4 19. Bc4 e6 20. Rfe1 Qe7 21. Rad1 Rfe8 22. Re3 Qg5 23. Rd3 Qxd2 24. R3xd2 Nc6 25. Rxd6 Bxb2 26. Rxd8 { The game is a draw. } 1/2-1/2
The above blitz game between GM Ian Nepomniachtchi as black and GM Hikaru Nakamura as white ended after 26 moves in a draw. There are two annotations in the “movetext”—one indicating the opening used and the other showing the game’s result. Sometimes moves are annotated with an evaluation score by a game engine.
Let us start implementing the converter by downloading the PGN files from Lichess. Conveniently there are two listings on the server with the url and checksum for each file. The full source code for this post can be found on the Argodis GitHub account.
import requests
from pathlib import Path
def download(url, file_name):
if not Path(file_name).is_file():
with requests.get(url, stream=True) as r:
r.raise_for_status()
with open(file_name, 'wb') as fh:
for chunk in r.iter_content(chunk_size=8192):
fh.write(chunk)
def process(file):
url, checksum, filename = file
download(url, filename)
if __name__ == "__main__":
...
files = zip(urls, checksums)
for file in files:
process(file)
Because the files are pretty large, we stream the data from the Lichess server and write it into the local filesystem in fixed-sized chunks. As a side effect, we keep memory usage low. The final version of the converter has options that limit the amount of data it downloads. If you are only interested in the games of the last few months, this comes in handy. We will skip the next step of verifying the files. Check the version on Github for details.
As mentioned before, a chess game in PGN format consists of two parts. The key-value pairs and the “movetext.” Surprisingly the “movetext” is much more compute-intensive to process. PGN parsers will verify that all moves in the “movetext” are valid moves resulting in correct chess positions. Thus our strategy in the following is to skip through the PGN file and find all games in it. And start parser instances for many games in parallel to saturate all available CPU resources.
lichess_db_standard_rated_2013-01.pgn takes 9 minutes