Could you recommend a script/library for scraping in Python or Javascript?

Sort:
WhiteDrake

I know this question is not about the API; it's about something the API doesn't provide and maybe never will. I also know that chess.com might not like seeing scraping in general. However, I'm not attempting to create any competing service or gain any money from this; on the contrary, I need to get a specific set of data that I could send back to chess.com staff and/or other interested members. Getting this data manually is infeasible; it would take an incredible amount of time.

 

What do I want?

I'm looking for a script/library, which would allow me to scrape all comments from a given finished vote chess game for a given move. Recommending a script which could scrape all comments from a given chess.com forum would also be very helpful, as I think I could modify it easily to scrape comments from a VC game instead. I haven't written any scraping code before and I guess some of you did, so I hope you could help me.

edit: I'm interested in scraping the html content rather than just the text, so that I can extract both the lines/continuations suggested as text and also the PGNs.

 

Scripting language

I'd prefer to use Python for this project but I'm afraid I might run into trouble with captchas. In that case, using Javascript might be better, as a bunch of JS code could just be pasted into the browser console, even though it wouldn't be comfortable - it'd still save me a lot of work.

 

Stress/load on the server

Clearly, throwing a bunch of requests against a server without thinking could put more stress on a server that a standard browser user would. I intend to slow the script down by increasing the time between two subsequent requests to 1-2 seconds, which shouldn't be much of a problem for the servers. I also won't be scraping all the VC games on the server, just a number of selected games.

 

Why do I want this?

I you want to know why do I want this, send me a PM.

stephen_33

Are you making harder work of this than you need to? My approach would be to copy+paste (scrape if you prefer) the text from the discussion thread.

For example: https://www.chess.com/votechess/game/127962

I picked move 5 (white) and copied all the comments:-

mxgolota
Jun 29, 2019
Обижаете) коль все рекомендуют кб5, то так м проголосую)


mksn_undead
Jun 28, 2019
На жаль, вельмишановне панство має звичку коментарі не читати.


TheBest1234567890
Jun 28, 2019
Вельмишановні панове ! Хід 5. К с 6 є слабким та неамбіційним , оскільки після bc чорні мають прекрасний центр та мінімальну перевагу . Прошу вас долучатися до ходу К b 5


mxgolota
Jun 28, 2019
Какие-то интересные у вас сицилианки)

 


nutaras

Згідно бази початків, найпопулярнішим і найвдалішим для білих є 5.Кб5

https://www.365chess.com/opening.php?m=9&n=329&ms=e4.c5.Nf3.Nc6.d4.cxd4.Nxd4.e5&ns=3.3.4.37.45.47.46.329


Isn't that what you want? I use Python for all of my activities on this site but I think trying to code for this task may be over-complicating it?

 

bmacho

Something like this should work:

function sleep(millisec) {
    var start_time = new Date().getTime()
    while (new Date().getTime() < start_time + millisec) {}
}

pgn = window.chesscom.game.pgn.split('\n')[14].split(" ")

mvs = [] // moves

for (i = 0; i < pgn.length - 1; i++) {
    if (i % 3 != 0) {
        mvs.push(pgn[i])
    }
}


comments = []
for (i = 0; i < 5; i++) {

    comm = {}
    comm.mv = i
    comm.san = mvs[i]

    sleep(700)

    get_comments_id(comm)
}

function get_comments_id(comm_obj) {

    url = "https://" + document.location.host + document.location.pathname + "?activePagination=archive&mv=" + comm_obj.mv + "&san=" + comm_obj.san

    console.log(url)

    var xhr = new XMLHttpRequest()

    xhr.onreadystatechange = function() {
        if (xhr.readyState === 4) {

            comment_id = parse_comment_id(xhr.response)

            sleep(1000)

            get_comments(comment_id, comm_obj)

        }
    }

    xhr.open('GET', url, true)
    xhr.send('')
}


function get_comments(id, obj) {

    url = 'https://www.chess.com/callback/comments?page=1&parentId=' + id + '&forumTopicTypeCode=votechess_forum'

    var XHR = new XMLHttpRequest()

    XHR.onreadystatechange = function() {
        if (XHR.readyState === 4) {
            obj.comments = JSON.parse(XHR.response).data
            comments.push(obj)
        }
    }

    XHR.open('GET', url, true)
    XHR.send('')
}

function parse_comment_id(html) {
    rows = html.split('\n')

    for (const row of rows) {
        if (row.slice(4, 31) == "window.chesscom.commentsBox") {
            eval(row)
            return window.chesscom.commentsBox.ids.archive
        }
    }
}

alert(' all downloaded ')
console.log(comments)

If you paste it into the console at https://www.chess.com/votechess/game/127962 , it should fill the comments variable with the comments for the first 5 moves. (Well, paging is not implemented, so only the first 20.)

WhiteDrake
stephen_33 wrote:

Are you making harder work of this than you need to? My approach would be to copy+paste (scrape if you prefer) the text from the discussion thread.

For example: https://www.chess.com/votechess/game/127962

[ ... ]

I picked move 5 (white) and copied all the comments:-Isn't that what you want? I use Python for all of my activities on this site but I think trying to code for this task may be over-complicating it?

I haven't thought of this. It's not bad, actually. thumbup.png Unfortunately, this wouldn't allow me to extract the PGNs from the diagrams posted in the comments, which is something that would be very useful for me.

WhiteDrake
bmacho wrote:

Something like this should work:

[ ... ]

If you paste it into the console at https://www.chess.com/votechess/game/127962 , it should fill the comments variable with the comments for the first 5 moves. (Well, paging is not implemented, so only the first 20.)

Thank you! I can see the comments and I can see the PGNs, too happy.png This looks like a good start for me.

stephen_33
WhiteDrake wrote:

I haven't thought of this. It's not bad, actually. Unfortunately, this wouldn't allow me to extract the PGNs from the diagrams posted in the comments, which is something that would be very useful for me.

O/k but you didn't include that requirement in your 'specification' above?

WhiteDrake
stephen_33 wrote:
WhiteDrake wrote:

I haven't thought of this. It's not bad, actually. Unfortunately, this wouldn't allow me to extract the PGNs from the diagrams posted in the comments, which is something that would be very useful for me.

O/k but you didn't include that requirement in your 'specification' above?

You're right, I didn't. Maybe I should have. It's something I can extract reasonably easily from the html content, so my question only focused on how to get the (html) content. Now that I'm writing this, I see that I should have specified at least that. But then again, if there was a library that would parse the chess.com posts into JSON or some other data structure, that would be also fine - I don't need the content to be in html per se. Anyway, I've edited the original post.

krazeechess

You could use requests and beautifulsoup. that's what i personally used for web scraping

WhiteDrake
krazeechess wrote:

You could use requests and beautifulsoup. that's what i personally used for web scraping

Looks good, as long as I can get around the captchas. Will give it a try. Thx for the tip thumbup.png

binomine

You can contact support and get permission for such a bot, as long as the bot does not play games or suggest moves. I would recommend being proactive and just contacting them. 

WhiteDrake
binomine wrote:

You can contact support and get permission for such a bot, as long as the bot does not play games or suggest moves. I would recommend being proactive and just contacting them. 

Good idea! I'll do that.