For now, I’ve organized the current situation (what’s happening in the code and the proposed fixes).
Short answer:
- The Space broke because GAIA changed how it ships data, but the Space still assumes an old layout.
- GAIA now stores
file_path as a repo-relative path and no longer provides a loader script that materializes local files for you.(Hugging Face)
- The Space checks
os.path.exists(file_path) directly, so it never finds any file, never fills task_file_paths, and /files/{task_id} always returns 404.(Hugging Face)
- The minimal fix: in
load_questions(), replace the os.path.exists logic with a small hf_hub_download(...) call that turns the GAIA file_path into a real local path under /app/.cache.
Below is the explanation step by step, plus a concrete small patch.
1. What the Space is supposed to do
The scoring Space has three public endpoints (as described in the course materials and forum reply).(Hugging Face Forums)
-
GET /questions
- Loads GAIA (
gaia-benchmark/GAIA, config 2023_level1, split validation).
- Filters down to “simple enough” tasks by annotator metadata.
- Exposes a list of questions:
task_id, question, Level, and (optionally) file_name.
-
GET /files/{task_id}
- Uses an internal map
task_id → local_file_path (task_file_paths).
- Serves the file (image/audio/xlsx/py/etc.) with proper MIME type.
-
POST /submit
- Checks your answers against
Final answer.
- Records your score in
agents-course/unit4-students-scores.
All of this is wired in main.py of agents-course/Unit4_scoring.(Hugging Face)
The /files/{task_id} endpoint depends completely on task_file_paths being filled correctly at startup.
2. What changed on the GAIA side
Two important changes happened on GAIA’s side:
-
Datasets 4.0 removed dataset-loader scripts.
GAIA used to have a GAIA.py loader and had to be converted to a “script-free” dataset. The maintainers explicitly mention the need to drop the script and move to Parquet.(Hugging Face)
-
GAIA now ships Parquet + repo-relative file_path.
The updated GAIA dataset card (October 2025) says: (Hugging Face)
-
Splits are now Parquet: metadata.level1.parquet etc.
-
Columns remain task_id, Question, Level, Final answer, file_name, file_path, Annotator Metadata.
-
Crucial sentence:
“file_path keeps pointing to attachments relative to the repository root (for example, 2023/test/<attachment-id>.pdf).”
-
The recommended loading pattern is:
from datasets import load_dataset
from huggingface_hub import snapshot_download
import os
data_dir = snapshot_download(repo_id="gaia-benchmark/GAIA", repo_type="dataset")
dataset = load_dataset(data_dir, "2023_level1", split="test")
for example in dataset:
file_path = os.path.join(data_dir, example["file_path"])
So: GAIA never promises that file_path is an absolute local path. It is a relative path inside the dataset repo.
3. How the Space currently loads GAIA and files
Look at the Space’s load_questions() in main.py: (Hugging Face)
dataset = load_dataset("gaia-benchmark/GAIA",
"2023_level1",
split="validation",
trust_remote_code=True)
...
local_file_path = item.get('file_path')
file_name = item.get('file_name')
...
# 3. Store the file path mapping if file details exist and are valid
if local_file_path and file_name:
# Log if the path from the dataset isn't absolute (might indicate issues)
if not os.path.isabs(local_file_path):
logger.warning(
f"Task {task_id}: Path '{local_file_path}' from dataset is not absolute. "
"This might cause issues finding the file on the server."
)
if os.path.exists(local_file_path) and os.path.isfile(local_file_path):
task_file_paths[str(task_id)] = local_file_path
logger.debug(f"Stored file path mapping for task_id {task_id}: {local_file_path}")
else:
logger.warning(
f"File path '{local_file_path}' for task_id {task_id} does NOT exist or is not a file on server. "
"Mapping skipped."
)
Key points:
- It reads
file_path from GAIA into local_file_path.
- It warns if this path is not absolute (so it expects absolute paths).
- Then it directly calls
os.path.exists(local_file_path) and only stores it if that is true.
Later, /files/{task_id} does:
if task_id not in task_file_paths:
raise HTTPException(status_code=404,
detail=f"No file path associated with task_id {task_id}.")
...
abs_file_path = os.path.abspath(local_file_path)
if not abs_file_path.startswith(ALLOWED_CACHE_BASE):
raise HTTPException(status_code=403, detail="File access denied.")
if not os.path.exists(abs_file_path) or not os.path.isfile(abs_file_path):
raise HTTPException(status_code=404,
detail=f"File associated with task_id {task_id} not found on server disk.")
return FileResponse(path=abs_file_path, ...)
(Hugging Face)
So /files/{task_id} will only work if:
task_file_paths[task_id] exists, and
- that path points to a real file already on disk under
/app/.cache.
4. Why this breaks now
Combine the previous sections:
-
GAIA now exposes file_path as repo-relative, e.g. "2023/test/abcd1234.png".(Hugging Face)
-
The scoring Space never downloads those files nor joins the path with the dataset root. It simply expects local_file_path to be an absolute path that already exists inside the container.(Hugging Face)
-
At startup, for each question with an attachment:
local_file_path is something like "2023/test/abcd1234.png".
os.path.isabs("2023/test/abcd1234.png") is false, so it logs a warning.
os.path.exists("2023/test/abcd1234.png") is also false, because nothing at that relative path exists in the container filesystem.
- So it skips adding an entry to
task_file_paths.
Result: task_file_paths ends up empty or nearly empty.
-
At request time:
GET /files/{task_id} looks into task_file_paths.
- It finds no entry and returns 404
"No file path associated with task_id ...".
This matches the forum symptom: even using the correct bare task_id in the URL, users get 404 for tasks with valid file_name.(Hugging Face Forums)
So the Space is broken because its file-path handling is out of date with GAIA’s new Parquet + relative file_path design.
5. Minimal code fix inside the Space
Goal: keep the architecture, change as little as possible.
You already have hf_hub_download imported:
from huggingface_hub import HfApi, hf_hub_download
(Hugging Face)
So the smallest safe fix is to replace the “does this path exist locally?” logic by a call that resolves GAIA’s relative file_path into a real local file.
Patch: only change the mapping block in load_questions()
Current block (lines 224–238): (Hugging Face)
# 3. Store the file path mapping if file details exist and are valid
if local_file_path and file_name:
# Log if the path from the dataset isn't absolute (might indicate issues)
if not os.path.isabs(local_file_path):
logger.warning(
f"Task {task_id}: Path '{local_file_path}' from dataset is not absolute. "
"This might cause issues finding the file on the server."
)
if os.path.exists(local_file_path) and os.path.isfile(local_file_path):
task_file_paths[str(task_id)] = local_file_path
logger.debug(f"Stored file path mapping for task_id {task_id}: {local_file_path}")
else:
logger.warning(
f"File path '{local_file_path}' for task_id {task_id} does NOT exist or is not a file on server. "
"Mapping skipped."
)
Replace that block with:
# 3. Store the file path mapping if file details exist and are valid
if local_file_path and file_name:
try:
# GAIA's file_path is relative to the dataset repo root.
# Download the file into the allowed cache and get its local path.
resolved_path = hf_hub_download(
repo_id="gaia-benchmark/GAIA",
filename=local_file_path, # e.g. "2023/test/<attachment-id>.pdf"
repo_type="dataset",
cache_dir=ALLOWED_CACHE_BASE,
)
task_file_paths[str(task_id)] = resolved_path
logger.debug(
f"Stored file path mapping for task_id {task_id}: {resolved_path}"
)
except Exception as e:
logger.warning(
f"Could not download file '{local_file_path}' for task_id {task_id}: {e}. "
"Mapping skipped."
)
Optional one-liner near the top of load_questions() (after task_file_paths.clear()):
os.makedirs(ALLOWED_CACHE_BASE, exist_ok=True)
Why this is enough
-
Resolves GAIA semantics correctly
- GAIA promises
file_path is a path inside the dataset repo.(Hugging Face)
hf_hub_download(repo_id="gaia-benchmark/GAIA", filename=local_file_path, repo_type="dataset", ...) downloads exactly that file from the GAIA repo and returns its local path.(Hugging Face)
-
Keeps security checks intact
- You set
cache_dir=ALLOWED_CACHE_BASE so the returned resolved_path will be under /app/.cache/....
- The
/files/{task_id} endpoint already checks abs_file_path.startswith(ALLOWED_CACHE_BASE). That continues to work and blocks path traversal.(Hugging Face)
-
No changes to external API
/questions response stays the same.
/files/{task_id} stays the same URL shape and behavior, but now it can actually find files.
/submit is unaffected.
-
Minimal surface change
-
You do not touch:
- The
load_dataset("gaia-benchmark/GAIA", ...) call.
- The FastAPI route definitions.
- The scoring logic or leaderboard update code.
6. Sanity checks after patch
After deploying the patched Space, do three quick manual tests:
-
Check questions still load
curl https://huggingface.co/proxy/agents-course-unit4-scoring.hf.space/questions | head
You should see JSON with task_id, question, Level, and sometimes file_name.
-
Pick a known multimodal task
From /questions, identify a question where file_name is not null (for example an image or mp3).
-
Call /files/{task_id}
curl -I "https://huggingface.co/proxy/agents-course-unit4-scoring.hf.space/files/<that-task-id>"
Expected:
- HTTP status
200 OK.
- A reasonable
Content-Type (e.g. image/png, audio/mpeg, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, etc.).
If you still get 404, log output from load_questions() will tell you whether hf_hub_download failed (dataset gating, network, token, etc.).
Final summary
- The Space broke because it assumes GAIA’s
file_path is an absolute local path, but GAIA now defines file_path as a relative path inside the dataset repo, with Parquet-backed splits and no loader script.(Hugging Face)
- At startup, the Space never downloads those files or joins them with a dataset root, so
os.path.exists(file_path) fails for every attachment, task_file_paths stays empty, and /files/{task_id} returns 404.(Hugging Face)
- The minimal fix is to replace the
os.path.exists block in load_questions() with a call to hf_hub_download(repo_id="gaia-benchmark/GAIA", filename=local_file_path, repo_type="dataset", cache_dir=ALLOWED_CACHE_BASE), then store that returned path in task_file_paths.
- This respects GAIA’s new format, keeps security checks and public API unchanged, and restores working attachments for
/files/{task_id}.