Strange Match Endpoint Result - Help Appreciated

Sort:
stephen_33

I'm currently involved in collating the results of the first RR in the TMCL 2018 tournament. The match endpoint API requests I've been making have gone without a hitch & I'm impressed at just how robust that system is. At present I'm not even bothering to trap errors relating to my API server requests & there hasn't been a single failure (out of several hundred), so kudos to the developers!

But I've hit a very small snag regarding the team name for the matches of just one of the groups taking part & it's this one:-

World's Best Chess Players - Лучшие Шахматисты в мире

I'm finding that the string I'm using, copied directly from the group's home page, doesn't correspond to the match endpoint name which is as follows:-

World's Best Chess Players - \u041b\u0443\u0447\u0448\u0438\u0435 \u0428\u0430\u0445\u043c\u0430\u0442\u0438\u0441\u0442\u044b \u0432 \u043c\u0438\u0440\u0435

I recognise the (Russian cyrilic?) unicode character codes used to represent the cyrilic part of the name but I don't understand why this should be giving me a problem. We have a host of teams with names that contain non Latin characters & they're all fine.

Another puzzle is this - when I do a comparison between the two strings using the command line of my Python (V3) interpreter like this...

"World's Best Chess Players - Лучшие Шахматисты в мире" == "World's Best Chess Players - \u041b\u0443\u0447\u0448\u0438\u0435 \u0428\u0430\u0445\u043c\u0430\u0442\u0438\u0441\u0442\u044b \u0432 \u043c\u0438\u0440\u0435"

..the result is True.

It's odd because my Python script is left treating the two names as if they're not the same but it appears to recognise them as the same on the command line, so what's going on?

* This team name involves the same process as the one above but cause no problems..

Захід

From a typical match endpoint for that group: "name":"\u0417\u0430\u0445\u0456\u0434"

skelos

I don't have an answer but am always wary in UTF-8 there are at least two "normal" forms. I gave up on Google's "Go" (golang) project to which I was a contributor because I couldn't in an application rely on comparing strings for equivalence. (They hand waved about library support, and may have done something since. As the "system programming" language also didn't desire to support all OS system calls, I returned to Perl/Python and C.)

I guess being yelled at/told off/disagreed with by Rob Pike is some sort of career achievement, but I'd rather he'd come to see my side of things and made his new language useful.

 

That's a bit off topic: point here is are all Unicode characters and strings always normalised the same way?

stephen_33

I can't answer that last question Giles but I'm guessing there's some incompatibility in one or more of the unicode characters used. As an afterthought I added an example of another name in cyrilic that gets treated in much the same way but works perfectly.

The two names share at least two character-codes: \u0430 & \u0445

It's not a huge problem because it doesn't cause my script to crash, just produces duplicated output as if the two names belong to two separate teams. Irritating though because otherwise it's working very well.

skelos

Unicode is not a panacea. My very first job after graduating with a science degree with a major in computer science involved internationalisation (I18N) and localisation (L10N) and I've never quite escaped the experience. (Your sort in German doesn't work ... gosh, and it landed on Giles' desk.)

Good work identifying the two characters. With the multiplicity of languages, standards and I imagine storage that might help figure out what's going wrong, and if that's known a workaround (or even a fix, but don't get your hopes up) is closer.

stephen_33

I've just thought of a perfect fix for the problem - kick that group out of TMCL  grin.png

Perhaps I'll wait & see if the developers have any ideas first.

bcurtis

I'm not sure I follow — what's the problem?

'{"a":"Л"}' and '{"a":"\u041b"}' are the exact same document. http://sandbox.onlinephpfunctions.com/code/4dd3c6d5e562d4513369fc6573ac87a6d16fccb3

When you decode the JSON, your decoder ought to decode the escaped Unicode sequence. Are you instead parsing the JSON as a string and extracting the components you want?

 

It is possible for us to respond as UTF-8, but since Javascript is UTF-16 by default and different clients may not recognize the HTTP headers when saving the data locally, we thought that delivering the escaped data was more compatible. Thoughts on this?

stephen_33

In fact I've realised that I forgot to mention something quite vital about the way I'm comparing one string (from the match endpoint data) with the other derived from the group name on it's home page.

I'm using the team name as a Python dictionary key. The key value is the cumulative score for the number of matches won by the team but due to match aborts (etc.) I'm having to add points from an input file & these are identified by the normal name string.

So it's when my script comes to increment the key-value, it doesn't recognise the key as being the same string & under that circumstance my script generates a new key-value.

I believe Python keys are in unicode format but I save all my input data in UTF8 - might there be some mis-match there?

This is the section of code I'm using for the download & decoding:-

url = 'https://api.chess.com/pub/match/' + match_id

with urlopen(url) as response:
____for line in response:
________line = line.decode('utf-8') # Decoding the binary data to text.

data = json.loads(line) # Dictionary holding all data on match

stephen_33

There are a few other teams in TMCL with cyrillic characters in their names & I've had no similar problem with them. Examples:-

  • The Volga Team - Поволжье
  • ⇚✯УРАЛ✯URAL✯⇛
  • Захід

And they're being handled in much the same way but without generating duplicate keys.

skelos

Speculating slightly, perhaps the input or output to your local storage needs to be tagged Stephen?

Unicode being something a retrofit (hey, it came along after the languages themselves) there are some hoops to jump through for Perl (which I am using with api.chess.com) and C (which I'm not).

When I write and read back local data I am tagging it as utf-8, that being best for my use.

@bcurtis: I am both pleased and chagrinned not to even have known what you were sending. Which means it has been "just working", although I'd be wary about using non-ASCII filenames; for more historical reasons MacOS has quirks in its filesystem naming.

I shall have to try Лучшие Шахматисты в мире later but other things to do first. Good luck!

stephen_33

Well I think I'm probably 'tagging' all input correctly as UTF8 but then I'm not sure what tagging is?

Here's the file open statement for my own text data input:-

my_input_file = open("TMCL_inp.txt", "r", encoding="utf-8-sig")

so my script should be storing all data in UTF8 format I assume. (The '-sig' deals with the extra byte(s) in the file that indicate UTF format)

And as for the API request:-

line = line.decode('utf-8') ... also stores that data in UTF8 format I think?

I'd expect a string, inputted from my own text file, to match one from the match endpoint, wouldn't you? But as I explained above, I'm using the team name as a dictionary key as well & it's when I reference the World's Best Chess Players - Лучшие Шахматисты в мире key that the problem arises. It's not being recognised properly & then my script generates a new key as it would for a new team.

I think we have around 90 teams in this tournament & it's frustrating to have just one that causes this problem.

skelos

Sounds sensible. I'm still not a Python guru. Got another job today; if I can grab a moment over the next couple of days I'll figure out what Python's doing. The escaped characters look like 32 bit, which would be full Unicode. Tricky if you can compare that with a UTF-8 string, but it's sensible not crazy if it works that way.

I always get wary of Unicode even for what we used to call "Latin1" alphabets. "ü" in Unicode can be "u" followed by a combining umlaut, or u-with-umlaut. Semantically the same, but not the same in bits or bytes! Thus normalisation comes into the picture. Which is why Google's "go" (golang) was broken from the start by using UTF-8 for strings but not insisting they be normalised.

 

I'd like to see the answer found (even if I find it happy.png) so that it can be added to the Python thread.

I'll think too (but not right now) if I should add some UTF-8 material to the Perl thread. Without enabling UTF-8 perl will throw warnings trying to write those "wide" characters.

stephen_33

As a matter of interest, what happens in Perl if you try the same thing? I assume it uses a dictionary  type of data object or has an equivalent one, indexed by keys?

* I think this problem might be even more peculiar because the ranked listing of teams (per sub-div) that you see in my TMCL posts were assembled from the keys themselves. That's to say reading those keys back seems to give identical results but accessing one of the keys gives a mis-match. Go figure!

skelos

Python dict is equivalent to an associate array in perl (or awk). I'd not like to feed awk Unicode, but I'll give it a whirl with Perl.

Let's see ... with two variables set to the left and right hand strings of your == perl says "not equal" when I use a string comprison (the "cmp" operator is for strings, == for numeric comparisons in perl).

Thus I'd have to look and put those strings into some sort of normal form, and that would have me diving for the documentation.

Right now of course if the strings are different they'd be different keys in an associative array aka "hash" aka dict.

Giles

stephen_33

I think I've discovered the problem & found a fix. I tried downloading a typical match endpoint for the problem club & comparing it in Python with the home-page club name - they don't match.

When I print them out I get the following:-

.

Home-page name:  World's Best Chess Players - Лучшие Шахматисты в мире
API name:        World's Best Chess Players  -  Лучшие Шахматисты в мире


Home-page name:  World's Best Chess Players - Лучшие Шахматисты в мире
API name:        World's Best Chess Players - Лучшие Шахматисты в мире

.

the difference is easy to spot & has nothing to do with the Cyrillic characters - that was just a red herring. I'll need the developers to explain why there're double-spaces in the stored club name?

Of course in HTML all redundant spaces are removed which is why you don't see the double spaces in any browser window, including when you open the match endpoint!

The fix I'm using is the Python replace function: team1 = team1.replace("__", " ")

skelos

Stephen, I'm not quite sure what you're comparing. Obviously, you've found the answer, great, but the club "name" I see is the same for the club profile and within a match:

https://api.chess.com/pub/club/worlds-best-chess-players

{"@id":"https://api.chess.com/pub/club/worlds-best-chess-players","name":"World's Best Chess Players  -  \u041b\u0443\u0447\u0448\u0438\u0435 \u0428\u0430\u0445\u043c\u0430\u0442\u0438\u0441\u0442\u044b \u0432 \u043c\u0438\u0440\u0435",...

https://api.chess.com/pub/match/885160

"@id":"https://api.chess.com/pub/match/885160","name":"TMCL 2018 C3 R4: World's Best Chess Players vs \u0417\u0430\u0445\u0456\u0434","url":"https://www.chess.com/club/matches/885160","start_time":1523808294,"status":"in_progress","boards":38,"settings":{"rules":"chess","time_class":"daily","time_control":"1/259200","min_team_players":20,"min_required_games":0,"autostart":false},"teams":{"team1":{"@id":"https://api.chess.com/pub/club/worlds-best-chess-players","name":"World's Best Chess Players  -  \u041b\u0443\u0447\u0448\u0438\u0435 \u0428\u0430\u0445\u043c\u0430\u0442\u0438\u0441\u0442\u044b \u0432 \u043c\u0438\u0440\u0435",

 

skelos

I see two spaces in both those instances. There may be some third place where the whitespace gets "squished"?

stephen_33

I usually copy & paste a club's name from their (web) profile page. Only single spaces are displayed in web content unless special provision is made.

But I think the spelling & spacing of a club's name should be the same in both API endpoints & in browser windows? Display the match endpoint in a browser window & it doesn't match the actual endpoint - that's to say if you copy+paste the name from there, you end up with the wrong string.

skelos

Sounds fair, although might in this case be tricky to do, as this time what's stored and supplied to api.chess.com is not quite what is shown on the website, but the (my?) rule of thumb is that api.chess.com should match the website unless there is very good reason not to.

Glad the problem's isolated at least! Often the hardest part.

stephen_33

I've messaged bcurtis about this & I'm very interested to see what he says. At the very least I think there needs to be a warning on the API published data page to warn members that web names may not exactly match those in the endpoints.

skelos

I await the answer with interest. That could have caught me too, I'm pretty sure. As it is I did report one bug and and a documentation error while working on the report yesterday.

Nothing like using an endpoint in anger to wring out some bugs. 😎🐛🐜🕷🐝🦂