For now, I’ve organized the current situation (what’s happening in the code and the proposed fixes).
Short answer:
- The Space broke because GAIA changed how it ships data, but the Space still assumes an old layout.
- GAIA now stores
file_pathas a repo-relative path and no longer provides a loader script that materializes local files for you.(Hugging Face) - The Space checks
os.path.exists(file_path)directly, so it never finds any file, never fillstask_file_paths, and/files/{task_id}always returns 404.(Hugging Face) - The minimal fix: in
load_questions(), replace theos.path.existslogic with a smallhf_hub_download(...)call that turns the GAIAfile_pathinto a real local path under/app/.cache.
Below is the explanation step by step, plus a concrete small patch.
1. What the Space is supposed to do
The scoring Space has three public endpoints (as described in the course materials and forum reply).(Hugging Face Forums)
-
GET /questions- Loads GAIA (
gaia-benchmark/GAIA, config2023_level1, splitvalidation). - Filters down to “simple enough” tasks by annotator metadata.
- Exposes a list of questions:
task_id,question,Level, and (optionally)file_name.
- Loads GAIA (
-
GET /files/{task_id}- Uses an internal map
task_id → local_file_path(task_file_paths). - Serves the file (image/audio/xlsx/py/etc.) with proper MIME type.
- Uses an internal map
-
POST /submit- Checks your answers against
Final answer. - Records your score in
agents-course/unit4-students-scores.
- Checks your answers against
All of this is wired in main.py of agents-course/Unit4_scoring.(Hugging Face)
The /files/{task_id} endpoint depends completely on task_file_paths being filled correctly at startup.
2. What changed on the GAIA side
Two important changes happened on GAIA’s side:
-
Datasets 4.0 removed dataset-loader scripts.
GAIA used to have aGAIA.pyloader and had to be converted to a “script-free” dataset. The maintainers explicitly mention the need to drop the script and move to Parquet.(Hugging Face) -
GAIA now ships Parquet + repo-relative
file_path.
The updated GAIA dataset card (October 2025) says: (Hugging Face)-
Splits are now Parquet:
metadata.level1.parquetetc. -
Columns remain
task_id,Question,Level,Final answer,file_name,file_path,Annotator Metadata. -
Crucial sentence:
“
file_pathkeeps pointing to attachments relative to the repository root (for example,2023/test/<attachment-id>.pdf).” -
The recommended loading pattern is:
from datasets import load_dataset from huggingface_hub import snapshot_download import os data_dir = snapshot_download(repo_id="gaia-benchmark/GAIA", repo_type="dataset") dataset = load_dataset(data_dir, "2023_level1", split="test") for example in dataset: file_path = os.path.join(data_dir, example["file_path"])
So: GAIA never promises that
file_pathis an absolute local path. It is a relative path inside the dataset repo. -
3. How the Space currently loads GAIA and files
Look at the Space’s load_questions() in main.py: (Hugging Face)
dataset = load_dataset("gaia-benchmark/GAIA",
"2023_level1",
split="validation",
trust_remote_code=True)
...
local_file_path = item.get('file_path')
file_name = item.get('file_name')
...
# 3. Store the file path mapping if file details exist and are valid
if local_file_path and file_name:
# Log if the path from the dataset isn't absolute (might indicate issues)
if not os.path.isabs(local_file_path):
logger.warning(
f"Task {task_id}: Path '{local_file_path}' from dataset is not absolute. "
"This might cause issues finding the file on the server."
)
if os.path.exists(local_file_path) and os.path.isfile(local_file_path):
task_file_paths[str(task_id)] = local_file_path
logger.debug(f"Stored file path mapping for task_id {task_id}: {local_file_path}")
else:
logger.warning(
f"File path '{local_file_path}' for task_id {task_id} does NOT exist or is not a file on server. "
"Mapping skipped."
)
Key points:
- It reads
file_pathfrom GAIA intolocal_file_path. - It warns if this path is not absolute (so it expects absolute paths).
- Then it directly calls
os.path.exists(local_file_path)and only stores it if that is true.
Later, /files/{task_id} does:
if task_id not in task_file_paths:
raise HTTPException(status_code=404,
detail=f"No file path associated with task_id {task_id}.")
...
abs_file_path = os.path.abspath(local_file_path)
if not abs_file_path.startswith(ALLOWED_CACHE_BASE):
raise HTTPException(status_code=403, detail="File access denied.")
if not os.path.exists(abs_file_path) or not os.path.isfile(abs_file_path):
raise HTTPException(status_code=404,
detail=f"File associated with task_id {task_id} not found on server disk.")
return FileResponse(path=abs_file_path, ...)
So /files/{task_id} will only work if:
task_file_paths[task_id]exists, and- that path points to a real file already on disk under
/app/.cache.
4. Why this breaks now
Combine the previous sections:
-
GAIA now exposes
file_pathas repo-relative, e.g."2023/test/abcd1234.png".(Hugging Face) -
The scoring Space never downloads those files nor joins the path with the dataset root. It simply expects
local_file_pathto be an absolute path that already exists inside the container.(Hugging Face) -
At startup, for each question with an attachment:
local_file_pathis something like"2023/test/abcd1234.png".os.path.isabs("2023/test/abcd1234.png")is false, so it logs a warning.os.path.exists("2023/test/abcd1234.png")is also false, because nothing at that relative path exists in the container filesystem.- So it skips adding an entry to
task_file_paths.
Result:
task_file_pathsends up empty or nearly empty. -
At request time:
GET /files/{task_id}looks intotask_file_paths.- It finds no entry and returns 404
"No file path associated with task_id ...".
This matches the forum symptom: even using the correct bare task_id in the URL, users get 404 for tasks with valid file_name.(Hugging Face Forums)
So the Space is broken because its file-path handling is out of date with GAIA’s new Parquet + relative file_path design.
5. Minimal code fix inside the Space
Goal: keep the architecture, change as little as possible.
You already have hf_hub_download imported:
from huggingface_hub import HfApi, hf_hub_download
So the smallest safe fix is to replace the “does this path exist locally?” logic by a call that resolves GAIA’s relative file_path into a real local file.
Patch: only change the mapping block in load_questions()
Current block (lines 224–238): (Hugging Face)
# 3. Store the file path mapping if file details exist and are valid
if local_file_path and file_name:
# Log if the path from the dataset isn't absolute (might indicate issues)
if not os.path.isabs(local_file_path):
logger.warning(
f"Task {task_id}: Path '{local_file_path}' from dataset is not absolute. "
"This might cause issues finding the file on the server."
)
if os.path.exists(local_file_path) and os.path.isfile(local_file_path):
task_file_paths[str(task_id)] = local_file_path
logger.debug(f"Stored file path mapping for task_id {task_id}: {local_file_path}")
else:
logger.warning(
f"File path '{local_file_path}' for task_id {task_id} does NOT exist or is not a file on server. "
"Mapping skipped."
)
Replace that block with:
# 3. Store the file path mapping if file details exist and are valid
if local_file_path and file_name:
try:
# GAIA's file_path is relative to the dataset repo root.
# Download the file into the allowed cache and get its local path.
resolved_path = hf_hub_download(
repo_id="gaia-benchmark/GAIA",
filename=local_file_path, # e.g. "2023/test/<attachment-id>.pdf"
repo_type="dataset",
cache_dir=ALLOWED_CACHE_BASE,
)
task_file_paths[str(task_id)] = resolved_path
logger.debug(
f"Stored file path mapping for task_id {task_id}: {resolved_path}"
)
except Exception as e:
logger.warning(
f"Could not download file '{local_file_path}' for task_id {task_id}: {e}. "
"Mapping skipped."
)
Optional one-liner near the top of load_questions() (after task_file_paths.clear()):
os.makedirs(ALLOWED_CACHE_BASE, exist_ok=True)
Why this is enough
-
Resolves GAIA semantics correctly
- GAIA promises
file_pathis a path inside the dataset repo.(Hugging Face) hf_hub_download(repo_id="gaia-benchmark/GAIA", filename=local_file_path, repo_type="dataset", ...)downloads exactly that file from the GAIA repo and returns its local path.(Hugging Face)
- GAIA promises
-
Keeps security checks intact
- You set
cache_dir=ALLOWED_CACHE_BASEso the returnedresolved_pathwill be under/app/.cache/.... - The
/files/{task_id}endpoint already checksabs_file_path.startswith(ALLOWED_CACHE_BASE). That continues to work and blocks path traversal.(Hugging Face)
- You set
-
No changes to external API
/questionsresponse stays the same./files/{task_id}stays the same URL shape and behavior, but now it can actually find files./submitis unaffected.
-
Minimal surface change
-
You do not touch:
- The
load_dataset("gaia-benchmark/GAIA", ...)call. - The FastAPI route definitions.
- The scoring logic or leaderboard update code.
- The
-
6. Sanity checks after patch
After deploying the patched Space, do three quick manual tests:
-
Check questions still load
curl https://huggingface.co/proxy/agents-course-unit4-scoring.hf.space/questions | headYou should see JSON with
task_id,question,Level, and sometimesfile_name. -
Pick a known multimodal task
From
/questions, identify a question wherefile_nameis not null (for example an image or mp3). -
Call
/files/{task_id}curl -I "https://huggingface.co/proxy/agents-course-unit4-scoring.hf.space/files/<that-task-id>"Expected:
- HTTP status
200 OK. - A reasonable
Content-Type(e.g.image/png,audio/mpeg,application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, etc.).
- HTTP status
If you still get 404, log output from load_questions() will tell you whether hf_hub_download failed (dataset gating, network, token, etc.).
Final summary
- The Space broke because it assumes GAIA’s
file_pathis an absolute local path, but GAIA now definesfile_pathas a relative path inside the dataset repo, with Parquet-backed splits and no loader script.(Hugging Face) - At startup, the Space never downloads those files or joins them with a dataset root, so
os.path.exists(file_path)fails for every attachment,task_file_pathsstays empty, and/files/{task_id}returns 404.(Hugging Face) - The minimal fix is to replace the
os.path.existsblock inload_questions()with a call tohf_hub_download(repo_id="gaia-benchmark/GAIA", filename=local_file_path, repo_type="dataset", cache_dir=ALLOWED_CACHE_BASE), then store that returned path intask_file_paths. - This respects GAIA’s new format, keeps security checks and public API unchanged, and restores working attachments for
/files/{task_id}.