Reading the chess player game data into the Python- Pyspark Dataframe is not working - Chess Forums

sachinsksm99

May 3, 2021

0

#1

Hi,

The chess.com player gaes data is not being read properly into the spark dataframe. How do i fix this? My code is below:

Code to save the game data in a CSV file format:

```

data-url= https://api.chess.com/pub/player/erik/games
games= requests.get(data-url).json()
df= pd.json_normalize(games['games'])
df['poi']= 'erik'
df.to_csv('data.csv', index=False)

df = spark.read.format("csv").load(
r'/media/disk2/isb_ras/sachin/chess/matches/games_larrygm.csv',header=True)

```

As you can see in the below screenshot, everything is in the first column and all other columns have none. Any suggestions are appreciated.

Zhongli-kun_Keqing-chan

May 3, 2021

0

#2

i dont understand coding lmao

cooperdetat

May 3, 2021

0

#3

Hi - is it possible that the CSV file you are trying to read has header lines before the CSV portion of the file? Are you sure that the file is separated by commas (",") and not tabs ("\t")?

Lego_Yodagaming

May 3, 2021

0

#4

what is this code

sachinsksm99

May 3, 2021

0

#5

Hi, I have added a minimum reproducible code. @cooperdetat Please take a look at it. I believe that the delimiter is comma seprated. Let me know if you are able to properly read the file to spark dataframe

sachinsksm99

May 3, 2021

0

#6

Please share some suggestions!

r2d2bb8

May 6, 2021

0

#7

sachinsksm99 wrote:

Please share some suggestions!

```

import json
import requests
import pandas as pd

with open("eric.jsonl", "w+") as fout:
    for game in requests.get("https://api.chess.com/pub/player/erik/games").json()["games"]:
        json.dump(game, fout)
        print(file=fout)

df = pd.read_json("eric.jsonl", lines=True)
print(df.head())

```

sachinsksm99

May 16, 2021

0

#8

@r2d2bb8 I know it works in pandas. But, my requirement is to read the data into a pyspark datframe. Reason being I have 2 billion chess games data and using pandas is very slow.

backgroundtasks

May 16, 2021

0

#9

The problem is, when you write the pandas dataframe to csv, each row's pgn field has multiple new lines, resulting in a new row with only one column when loaded in spark. If csv is not a requirement, I would suggest writing to json instead, e.g.,

df.to_json('data.json',orient='records')
dff = spark.read.format('json').load('data.json')

sachinsksm99

Jun 20, 2021

0

#10

backgroundtasks wrote:

The problem is, when you write the pandas dataframe to csv, each row's pgn field has multiple new lines, resulting in a new row with only one column when loaded in spark. If csv is not a requirement, I would suggest writing to json instead, e.g.,

df.to_json('data.json',orient='records')
dff = spark.read.format('json').load('data.json')

Thanks!