Reading the chess player game data into the Python- Pyspark Dataframe is not working

Sort:
sachinsksm99

Hi,

The chess.com player gaes data is not being read properly into the spark dataframe. How do i fix this? My code is below:

Code to save the game data in a CSV file format:

```

data-url= https://api.chess.com/pub/player/erik/games
games= requests.get(data-url).json()
df= pd.json_normalize(games['games'])
df['poi']= 'erik'
df.to_csv('data.csv', index=False)

df = spark.read.format("csv").load(
r'/media/disk2/isb_ras/sachin/chess/matches/games_larrygm.csv',header=True)

```

As you can see in the below screenshot, everything is in the first column and all other columns have none. Any suggestions are appreciated.

 

Zhongli-kun_Keqing-chan

i dont understand coding lmao

cooperdetat

Hi - is it possible that the CSV file you are trying to read has header lines before the CSV portion of the file? Are you sure that the file is separated by commas (",") and not tabs ("\t")?

Lego_Yodagaming

what is this code 

sachinsksm99

Hi, I have added a minimum reproducible code. @cooperdetat Please take a look at it. I believe that the delimiter is comma seprated. Let me know if you are able to properly read the file to spark dataframe

sachinsksm99

Please share some suggestions!

r2d2bb8
sachinsksm99 wrote:

Please share some suggestions!

```

import json
import requests
import pandas as pd

with open("eric.jsonl", "w+") as fout:
    for game in requests.get("https://api.chess.com/pub/player/erik/games").json()["games"]:
        json.dump(game, fout)
        print(file=fout)

df = pd.read_json("eric.jsonl", lines=True)
print(df.head())

```

sachinsksm99

@r2d2bb8 I know it works in pandas. But, my requirement is to read the data into a pyspark datframe. Reason being I have 2 billion chess games data and using pandas is very slow.

 

backgroundtasks

The problem is, when you write the pandas dataframe to csv, each row's pgn field has multiple new lines, resulting in a new row with only one column when loaded in spark. If csv is not a requirement, I would suggest writing to json instead, e.g.,

df.to_json('data.json',orient='records')
dff = spark.read.format('json').load('data.json')

sachinsksm99
backgroundtasks wrote:

The problem is, when you write the pandas dataframe to csv, each row's pgn field has multiple new lines, resulting in a new row with only one column when loaded in spark. If csv is not a requirement, I would suggest writing to json instead, e.g.,

df.to_json('data.json',orient='records')
dff = spark.read.format('json').load('data.json')

Thanks!