Remove [%eval] from PGN database

Sort:
prof_frink

The files can all be placed in one folder if need be.

 

The files would be named something like:

 

lichess_db_standard_rated_2013-01.pgn

lichess_db_standard_rated_2013-02.pgn

lichess_db_standard_rated_2013-03.pgn

etc.

 

They're all taken from here: https://database.lichess.org/

 

For the larger files, I've split them using pgnsplit into 1GB chunks, which would be named something like:

 

lichess_db_standard_rated_2018-01.1.pgn

lichess_db_standard_rated_2018-01.2.pgn

etc.

 

Each PGN ranges in size from about 100MB-1.5GB and contains about 100,000-1,500,000 games. I'm thinking that doing them in batches of a few at a time might be best, given their massive size.

 

Truth be told, though, I'm starting to think that I should abandon this idea, at least for the time being. I'm already about half-way through my project of converting said databases to ChessBase format and compiling separate databases for each month and one massive database with everything in it. I think I'm just going to unannotate everything using 'Unannotate DB' in CB for now (makes for a much cleaner-looking readout anyway, plus the annotations file [.cba] is massive, by far the biggest file in the database—might help to shrink the size of the database somewhat). Although it still might be useful to know how to do this for future reference.

 

Anyway, if anyone has any interest in a .CBV containing 400+ million chess games, let me know and I can create a torrent or something once it's done (takes quite a bit of time and effort).

skelos

One at a time, but it won't take much memory to process if you only need process them line by line. That's quite a bit of data though. happy.png

Even these days I'd be hesitant to feed a 1GB file to an editor, mind. I'd expect it to work but it might get very slow.

skelos

82GB in total. Yup, that would take a little while to process. Definitely one file at a time and line at a time or at most game at a time.

Here that would take most of my monthly data allowance just to fetch.

skelos

82GB compressed. Yeah. Not real convenient to download casually in this neck of the woods.

prof_frink

Yup, yup. Granted, once it's converted to CB format, it'll be much smaller (since the format is much more space-efficient than PGN), but I'm not sure yet how usable the complete database will be (plus it requires CB's proprietary software, of course). I read a tweet a while back from a CB developer saying that they were testing a DB of all Lichess games, and another with 1.2 billion games, but I'm not sure how functional it would be in the current version (14). I guess we'll see!

I'd like to create an .SI4 database for SCID, too, but apparently ~16M is the limit for that.

skelos

Yeah ... I have wondered from time to time if the space savings of the specialised formats are worth it as game collections get larger. Maybe paying some space penalty but putting them in a RDBMS which is known to scale waaaay higher would be better.

Partly it depends on speed, and partly on access. If you insist on storing locally on fast storage 82GB (never mind uncompressed) is an ugly lump.

If you don't need huge speed and can tolerate network (maybe LAN, maybe WAN) access speeds 82GB isn't much at all.

BTW if you want a Python script to remove the eval stuff after all, PM me. It's not too hard but might take a go-around or two to be 100% sure it's done the right thing, and only the right thing.

skelos

Apple want more than AUD$5000 to increase internal SSD storage on their latest MacBook Pro to 4TB. On a spinning disk I think the last 4TB disk I bought I bought at an office supply store for ~AUD$200.

82GB isn't much compared to 4TB, but 4TB can be pricey or not depending on circumstances!

skelos

(I'm using my 2012 MacBook Air until it fails. The trackpad is dodgy, some of the keyboard keys have not only worn off the label but through the dark plastic to the light plastic underneath ... and I have a Windows PC bought cheap as an emergency backup for the day I need it. Best investment ever; the Mac has lasted two years beyond what I thought it would. happy.png)

prof_frink

Sure, if I decide to go ahead with that plan, I'll let you know. Just noticed there's a link to a video course on the site that @p89trd mentioned. Might be worthwhile for someone like me who's just starting out in Python. Things are really quiet at work and I have computer access, so that may be the way to go... once I wrap up some of my games here (been on vacation forever... hoping to change that this weekend).

 

Yeah, those MacBooks are pretty sweet. Been thinking about going that route for my next computer. Would be nice to be able to take my chess studies on the road with me. (Too poor at the moment, though, lol)

jassimmohd
prof_frink wrote:

 OK, so here's two samples of the database. This game just has the %eval numbers:

 

[Event "Rated Bullet game"]
[Site "https://lichess.org/hca0mb9v"]
[White "LEGENDARY_ERFAN"]
[Black "Mariss"]
[Result "0-1"]
[UTCDate "2013.01.01"]
[UTCTime "00:15:38"]
[WhiteElo "1182"]
[BlackElo "1457"]
[WhiteRatingDiff "-30"]
[BlackRatingDiff "+5"]
[ECO "C00"]
[Opening "French Defense #2"]
[TimeControl "60+0"]
[Termination "Normal"]

1. e4 { [%eval 0.2] } 1... e6 { [%eval 0.13] } 2. Bc4 { [%eval -0.31] } 2... d5 { [%eval -0.28] } 3. exd5 { [%eval -0.37] } 3... exd5 { [%eval -0.31] } 4. Bb3 { [%eval -0.33] } 4... Nf6 { [%eval -0.35] } 5. d4 { [%eval -0.34] } 5... Be7 { [%eval 0.0] } 6. Nf3 { [%eval 0.0] } 6... O-O { [%eval -0.08] } 7. Bg5 { [%eval -0.19] } 7... h6 { [%eval -0.29] } 8. Bxf6 { [%eval -0.36] } 8... Bxf6 { [%eval -0.37] } 9. O-O { [%eval -0.36] } 9... c6 { [%eval -0.12] } 10. Re1 { [%eval -0.17] } 10... Bf5 { [%eval -0.04] } 11. c4?! { [%eval -0.67] } 11... dxc4 { [%eval -0.5] } 12. Bxc4 { [%eval -0.77] } 12... Nd7?! { [%eval -0.1] } 13. Nc3 { [%eval 0.0] } 13... Nb6 { [%eval 0.0] } 14. b3?! { [%eval -0.76] } 14... Nxc4 { [%eval -0.49] } 15. bxc4 { [%eval -0.65] } 15... Qa5 { [%eval -0.55] } 16. Rc1 { [%eval -0.79] } 16... Rad8 { [%eval -0.78] } 17. d5?? { [%eval -5.41] } 17... Bxc3 { [%eval -5.42] } 18. Re5? { [%eval -7.61] } 18... Bxe5 { [%eval -7.78] } 19. Nxe5 { [%eval -7.72] } 19... cxd5 { [%eval -7.81] } 20. Qe1? { [%eval -9.29] } 20... Be6?? { [%eval 3.71] } 21. Rd1?? { [%eval -12.34] } 21... dxc4 { [%eval -12.71] } 22. Rxd8?! { [%eval #-1] } 22... Rxd8?! { [%eval -13.06] } 23. Qc3?! { [%eval #-2] } 23... Qxc3?! { [%eval #-4] } 24. g3 { [%eval #-3] } 24... Rd1+?! { [%eval #-4] } 25. Kg2 { [%eval #-4] } 25... Qe1?! { [%eval #-4] } 26. Kf3 { [%eval #-3] } 26... Qxe5 { [%eval #-2] } 27. Kg2 { [%eval #-2] } 27... Bd5+?! { [%eval #-2] } 28. Kh3 { [%eval #-1] } 28... Qh5# 0-1

 

And this one has %eval and %clk values:

 

[Event "Rated Standard game"]
[Site "https://lichess.org/tKtuqF34"]
[White "NotReallyNow"]
[Black "Chessares"]
[Result "1-0"]
[UTCDate "2018.02.28"]
[UTCTime "23:00:01"]
[WhiteElo "1670"]
[BlackElo "1702"]
[WhiteRatingDiff "+12"]
[BlackRatingDiff "-11"]
[ECO "C41"]
[Opening "Philidor Defense #2"]
[TimeControl "600+0"]
[Termination "Normal"]
[LichessId "tKtuqF34"]

1. e4 { [%eval 0.03] [%clk 0:10:00] } 1... e5 { [%eval 0.24] [%clk 0:10:00] } 2. Nf3 { [%eval 0.25] [%clk 0:09:58] } 2... d6 { [%eval 0.28] [%clk 0:09:57] } 3. Nc3 { [%eval 0.19] [%clk 0:09:56] } 3... Bg4 { [%eval 0.49] [%clk 0:09:48] } 4. h3 { [%eval 0.42] [%clk 0:09:50] } 4... Bh5 { [%eval 0.89] [%clk 0:09:46] } 5. Bb5+?! { [%eval 0.31] [%clk 0:09:37] } 5... c6 { [%eval 0.33] [%clk 0:09:02] } 6. Be2 { [%eval 0.23] [%clk 0:09:34] } 6... Be7 { [%eval 0.65] [%clk 0:08:49] } 7. O-O?! { [%eval 0.15] [%clk 0:09:12] } 7... Nf6 { [%eval 0.22] [%clk 0:08:45] } 8. d4 { [%eval 0.2] [%clk 0:09:04] } 8... exd4?! { [%eval 0.74] [%clk 0:08:33] } 9. Nxd4 { [%eval 0.4] [%clk 0:09:00] } 9... O-O?? { [%eval 4.44] [%clk 0:08:27] } 10. Bxh5 { [%eval 4.39] [%clk 0:08:58] } 10... c5? { [%eval 5.47] [%clk 0:08:16] } 11. Nb3?! { [%eval 4.76] [%clk 0:08:52] } 11... Nc6 { [%eval 4.87] [%clk 0:07:53] } 12. Bf3 { [%eval 4.66] [%clk 0:08:49] } 12... a6 { [%eval 4.86] [%clk 0:07:44] } 13. Re1 { [%eval 4.75] [%clk 0:06:42] } 13... Ne5 { [%eval 4.95] [%clk 0:07:38] } 14. Bf4 { [%eval 4.82] [%clk 0:06:23] } 14... Nxf3+?! { [%eval 5.52] [%clk 0:06:57] } 15. Qxf3 { [%eval 5.49] [%clk 0:06:21] } 15... b5 { [%eval 5.89] [%clk 0:06:42] } 16. Nd2? { [%eval 4.86] [%clk 0:04:01] } 16... c4?! { [%eval 5.43] [%clk 0:06:28] } 17. e5?! { [%eval 4.64] [%clk 0:03:57] } 17... dxe5 { [%eval 4.51] [%clk 0:06:25] } 18. Bxe5?? { [%eval 0.78] [%clk 0:03:54] } 18... Qxd2 { [%eval 0.75] [%clk 0:06:16] } 19. Re3?! { [%eval 0.01] [%clk 0:03:05] } 19... Qxc2 { [%eval 0.12] [%clk 0:05:22] } 20. Bxf6?! { [%eval -0.73] [%clk 0:02:58] } 20... Bxf6 { [%eval -0.57] [%clk 0:05:21] } 21. Re2 { [%eval -0.75] [%clk 0:02:50] } 21... Qg6 { [%eval -0.66] [%clk 0:05:06] } 22. Ne4?! { [%eval -1.26] [%clk 0:02:46] } 22... Bd4 { [%eval -1.0] [%clk 0:04:29] } 23. Rf1 { [%eval -1.47] [%clk 0:02:34] } 23... Rae8?! { [%eval -0.94] [%clk 0:04:18] } 24. Rfe1 { [%eval -1.4] [%clk 0:02:30] } 24... Ba7? { [%eval 0.7] [%clk 0:03:33] } 25. Nf6+ { [%eval 0.62] [%clk 0:02:12] } 25... Qxf6 { [%eval 0.7] [%clk 0:02:56] } 26. Rxe8 { [%eval 0.61] [%clk 0:01:23] } 26... Qxb2?? { [%eval #3] [%clk 0:02:32] } 27. Rxf8+ { [%eval #2] [%clk 0:01:20] } 27... Kxf8 { [%eval #2] [%clk 0:02:30] } 28. Qa8+ { [%eval #1] [%clk 0:01:19] } 28... Bb8 { [%eval #1] [%clk 0:02:28] } 29. Qxb8# { [%clk 0:01:17] } 1-0

 

I'd like to keep the clock info but not the evals.

Thanks a lot, I found this useful and work nice

fabiorzfreitas

For smaller pgns, one approach would be feeding the file to a code editor and using RegEx to match all of these tags and replace them with nothing. I believe the following pattern should do the trick: "\{?\s?\r?\n?\[\r?\n?%\r?\n?[^\]\r\n]+\r?\n?\]\r?\n?\s?\}?\r?\n?\s" (remove the double quotes). I think this takes care of all kinds of linebreaks and matches all tags. It also matches the {} brackets only optionally, as other commentary may be contained within that move.

 

BTW, the pattern above was tested with regex101.com using the given sample, but no further testing was done.

TryMe1ce
Shock_Me wrote:

Maybe starting from ground up with python maybe isn't the most efficient way. Let's go the text editor route, with Microsoft word (other text editors will have the same capability, but different specifics) (...)

I've removed all {[%cal]} and {[%csl]} engine evaluations from a rather big PGN file, following the steps you've described in you post, Shock_Me, and it really works like a charm!
Cheers mate!