Describe the bug
WhoScored.read_schedule() fails with JSONDecodeError in soccerdata 1.9.0. The same workflow worked in my 1.8.8 environment.
This also affects WhoScored.read_missing_players() and WhoScored.read_events() when they need to call read_schedule() internally.
Expected behavior: read_schedule() should return the match schedule DataFrame instead of failing while decoding the response.
Python version: 3.12.13
Affected scrapers
This affects the following scrapers:
Code example
A minimal code example that fails. I used no_cache=True to make sure an invalid cached file was not causing the bug.
import soccerdata as sd
ws = sd.WhoScored(leagues="ESP-La Liga", seasons="24-25", no_cache=True)
ws.read_schedule()
Error message
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Additional context
I reproduced this in soccerdata==1.9.0 but not in my soccerdata==1.8.8 environment.
The same underlying failure also affects:
ws.read_missing_players(match_id=...)
ws.read_events(match_id=...)
because both methods call read_schedule() before retrieving match-level data.
From a local comparison of the 1.8.8 and 1.9.0 source code, this may be related to a change in the common Selenium download path. In 1.8.8, requests with var=None returned document.body.innerHTML; in 1.9.0, the response goes through the new page validation path based on page_source.
WhoScored.read_schedule() then calls json.load(reader). If the response is now HTML-wrapped instead of raw JSON, this raises the observed JSONDecodeError.
Contributor Action Plan
Reproduction notebook
I also attached the notebook I used while reproducing the issue and checking the behavior in my environment:
Guía SoccerData (1.9.0).ipynb
Local workaround
I also found a local workaround that fixed the issue in my environment.
The patch adds a helper that first tries to parse the response as JSON. If that fails, it checks whether the response is HTML-wrapped and then extracts the text from the <body> before parsing it as JSON.
This fixed the failing WhoScored.read_schedule() call locally. Since read_missing_players() and read_events() call read_schedule() first, it also allowed those workflows to continue.
I am attaching the modified whoscored.py file for reference. I understand that this may not be the preferred final implementation, and that the maintainers may prefer to fix this in the common Selenium reader instead.
whoscored_issue_940_local_patch.py
Describe the bug
WhoScored.read_schedule()fails withJSONDecodeErrorin soccerdata1.9.0. The same workflow worked in my1.8.8environment.This also affects
WhoScored.read_missing_players()andWhoScored.read_events()when they need to callread_schedule()internally.Expected behavior:
read_schedule()should return the match schedule DataFrame instead of failing while decoding the response.Python version:
3.12.13Affected scrapers
This affects the following scrapers:
Code example
A minimal code example that fails. I used
no_cache=Trueto make sure an invalid cached file was not causing the bug.Error message
Additional context
I reproduced this in
soccerdata==1.9.0but not in mysoccerdata==1.8.8environment.The same underlying failure also affects:
because both methods call
read_schedule()before retrieving match-level data.From a local comparison of the
1.8.8and1.9.0source code, this may be related to a change in the common Selenium download path. In1.8.8, requests withvar=Nonereturneddocument.body.innerHTML; in1.9.0, the response goes through the new page validation path based onpage_source.WhoScored.read_schedule()then callsjson.load(reader). If the response is now HTML-wrapped instead of raw JSON, this raises the observedJSONDecodeError.Contributor Action Plan
Reproduction notebook
I also attached the notebook I used while reproducing the issue and checking the behavior in my environment:
Guía SoccerData (1.9.0).ipynb
Local workaround
I also found a local workaround that fixed the issue in my environment.
The patch adds a helper that first tries to parse the response as JSON. If that fails, it checks whether the response is HTML-wrapped and then extracts the text from the
<body>before parsing it as JSON.This fixed the failing
WhoScored.read_schedule()call locally. Sinceread_missing_players()andread_events()callread_schedule()first, it also allowed those workflows to continue.I am attaching the modified
whoscored.pyfile for reference. I understand that this may not be the preferred final implementation, and that the maintainers may prefer to fix this in the common Selenium reader instead.whoscored_issue_940_local_patch.py