Attachements not available on https://huggingface.co/proxy/agents-course-unit4-scoring.hf.space/docs#

For now, I’ve organized the current situation (what’s happening in the code and the proposed fixes).


Short answer:

  • The Space broke because GAIA changed how it ships data, but the Space still assumes an old layout.
  • GAIA now stores file_path as a repo-relative path and no longer provides a loader script that materializes local files for you.(Hugging Face)
  • The Space checks os.path.exists(file_path) directly, so it never finds any file, never fills task_file_paths, and /files/{task_id} always returns 404.(Hugging Face)
  • The minimal fix: in load_questions(), replace the os.path.exists logic with a small hf_hub_download(...) call that turns the GAIA file_path into a real local path under /app/.cache.

Below is the explanation step by step, plus a concrete small patch.


1. What the Space is supposed to do

The scoring Space has three public endpoints (as described in the course materials and forum reply).(Hugging Face Forums)

  • GET /questions

    • Loads GAIA (gaia-benchmark/GAIA, config 2023_level1, split validation).
    • Filters down to “simple enough” tasks by annotator metadata.
    • Exposes a list of questions: task_id, question, Level, and (optionally) file_name.
  • GET /files/{task_id}

    • Uses an internal map task_id → local_file_path (task_file_paths).
    • Serves the file (image/audio/xlsx/py/etc.) with proper MIME type.
  • POST /submit

    • Checks your answers against Final answer.
    • Records your score in agents-course/unit4-students-scores.

All of this is wired in main.py of agents-course/Unit4_scoring.(Hugging Face)

The /files/{task_id} endpoint depends completely on task_file_paths being filled correctly at startup.


2. What changed on the GAIA side

Two important changes happened on GAIA’s side:

  1. Datasets 4.0 removed dataset-loader scripts.
    GAIA used to have a GAIA.py loader and had to be converted to a “script-free” dataset. The maintainers explicitly mention the need to drop the script and move to Parquet.(Hugging Face)

  2. GAIA now ships Parquet + repo-relative file_path.
    The updated GAIA dataset card (October 2025) says: (Hugging Face)

    • Splits are now Parquet: metadata.level1.parquet etc.

    • Columns remain task_id, Question, Level, Final answer, file_name, file_path, Annotator Metadata.

    • Crucial sentence:

      “file_path keeps pointing to attachments relative to the repository root (for example, 2023/test/<attachment-id>.pdf).”

    • The recommended loading pattern is:

      from datasets import load_dataset
      from huggingface_hub import snapshot_download
      import os
      
      data_dir = snapshot_download(repo_id="gaia-benchmark/GAIA", repo_type="dataset")
      dataset = load_dataset(data_dir, "2023_level1", split="test")
      for example in dataset:
          file_path = os.path.join(data_dir, example["file_path"])
      

    So: GAIA never promises that file_path is an absolute local path. It is a relative path inside the dataset repo.


3. How the Space currently loads GAIA and files

Look at the Space’s load_questions() in main.py: (Hugging Face)

dataset = load_dataset("gaia-benchmark/GAIA",
                       "2023_level1",
                       split="validation",
                       trust_remote_code=True)
...
local_file_path = item.get('file_path')
file_name = item.get('file_name')
...
# 3. Store the file path mapping if file details exist and are valid
if local_file_path and file_name:
    # Log if the path from the dataset isn't absolute (might indicate issues)
    if not os.path.isabs(local_file_path):
        logger.warning(
            f"Task {task_id}: Path '{local_file_path}' from dataset is not absolute. "
            "This might cause issues finding the file on the server."
        )

    if os.path.exists(local_file_path) and os.path.isfile(local_file_path):
        task_file_paths[str(task_id)] = local_file_path
        logger.debug(f"Stored file path mapping for task_id {task_id}: {local_file_path}")
    else:
        logger.warning(
            f"File path '{local_file_path}' for task_id {task_id} does NOT exist or is not a file on server. "
            "Mapping skipped."
        )

Key points:

  • It reads file_path from GAIA into local_file_path.
  • It warns if this path is not absolute (so it expects absolute paths).
  • Then it directly calls os.path.exists(local_file_path) and only stores it if that is true.

Later, /files/{task_id} does:

if task_id not in task_file_paths:
    raise HTTPException(status_code=404,
                        detail=f"No file path associated with task_id {task_id}.")
...
abs_file_path = os.path.abspath(local_file_path)
if not abs_file_path.startswith(ALLOWED_CACHE_BASE):
    raise HTTPException(status_code=403, detail="File access denied.")
if not os.path.exists(abs_file_path) or not os.path.isfile(abs_file_path):
    raise HTTPException(status_code=404,
                        detail=f"File associated with task_id {task_id} not found on server disk.")
return FileResponse(path=abs_file_path, ...)

(Hugging Face)

So /files/{task_id} will only work if:

  • task_file_paths[task_id] exists, and
  • that path points to a real file already on disk under /app/.cache.

4. Why this breaks now

Combine the previous sections:

  1. GAIA now exposes file_path as repo-relative, e.g. "2023/test/abcd1234.png".(Hugging Face)

  2. The scoring Space never downloads those files nor joins the path with the dataset root. It simply expects local_file_path to be an absolute path that already exists inside the container.(Hugging Face)

  3. At startup, for each question with an attachment:

    • local_file_path is something like "2023/test/abcd1234.png".
    • os.path.isabs("2023/test/abcd1234.png") is false, so it logs a warning.
    • os.path.exists("2023/test/abcd1234.png") is also false, because nothing at that relative path exists in the container filesystem.
    • So it skips adding an entry to task_file_paths.

    Result: task_file_paths ends up empty or nearly empty.

  4. At request time:

    • GET /files/{task_id} looks into task_file_paths.
    • It finds no entry and returns 404 "No file path associated with task_id ...".

This matches the forum symptom: even using the correct bare task_id in the URL, users get 404 for tasks with valid file_name.(Hugging Face Forums)

So the Space is broken because its file-path handling is out of date with GAIA’s new Parquet + relative file_path design.


5. Minimal code fix inside the Space

Goal: keep the architecture, change as little as possible.

You already have hf_hub_download imported:

from huggingface_hub import HfApi, hf_hub_download

(Hugging Face)

So the smallest safe fix is to replace the “does this path exist locally?” logic by a call that resolves GAIA’s relative file_path into a real local file.

Patch: only change the mapping block in load_questions()

Current block (lines 224–238): (Hugging Face)

        # 3. Store the file path mapping if file details exist and are valid
        if local_file_path and file_name:
            # Log if the path from the dataset isn't absolute (might indicate issues)

            if not os.path.isabs(local_file_path):
                logger.warning(
                    f"Task {task_id}: Path '{local_file_path}' from dataset is not absolute. "
                    "This might cause issues finding the file on the server."
                )

            if os.path.exists(local_file_path) and os.path.isfile(local_file_path):
                task_file_paths[str(task_id)] = local_file_path
                logger.debug(f"Stored file path mapping for task_id {task_id}: {local_file_path}")
            else:
                logger.warning(
                    f"File path '{local_file_path}' for task_id {task_id} does NOT exist or is not a file on server. "
                    "Mapping skipped."
                )

Replace that block with:

        # 3. Store the file path mapping if file details exist and are valid
        if local_file_path and file_name:
            try:
                # GAIA's file_path is relative to the dataset repo root.
                # Download the file into the allowed cache and get its local path.
                resolved_path = hf_hub_download(
                    repo_id="gaia-benchmark/GAIA",
                    filename=local_file_path,  # e.g. "2023/test/<attachment-id>.pdf"
                    repo_type="dataset",
                    cache_dir=ALLOWED_CACHE_BASE,
                )

                task_file_paths[str(task_id)] = resolved_path
                logger.debug(
                    f"Stored file path mapping for task_id {task_id}: {resolved_path}"
                )
            except Exception as e:
                logger.warning(
                    f"Could not download file '{local_file_path}' for task_id {task_id}: {e}. "
                    "Mapping skipped."
                )

Optional one-liner near the top of load_questions() (after task_file_paths.clear()):

    os.makedirs(ALLOWED_CACHE_BASE, exist_ok=True)

Why this is enough

  • Resolves GAIA semantics correctly

    • GAIA promises file_path is a path inside the dataset repo.(Hugging Face)
    • hf_hub_download(repo_id="gaia-benchmark/GAIA", filename=local_file_path, repo_type="dataset", ...) downloads exactly that file from the GAIA repo and returns its local path.(Hugging Face)
  • Keeps security checks intact

    • You set cache_dir=ALLOWED_CACHE_BASE so the returned resolved_path will be under /app/.cache/....
    • The /files/{task_id} endpoint already checks abs_file_path.startswith(ALLOWED_CACHE_BASE). That continues to work and blocks path traversal.(Hugging Face)
  • No changes to external API

    • /questions response stays the same.
    • /files/{task_id} stays the same URL shape and behavior, but now it can actually find files.
    • /submit is unaffected.
  • Minimal surface change

    • You do not touch:

      • The load_dataset("gaia-benchmark/GAIA", ...) call.
      • The FastAPI route definitions.
      • The scoring logic or leaderboard update code.

6. Sanity checks after patch

After deploying the patched Space, do three quick manual tests:

  1. Check questions still load

    curl https://huggingface.co/proxy/agents-course-unit4-scoring.hf.space/questions | head
    

    You should see JSON with task_id, question, Level, and sometimes file_name.

  2. Pick a known multimodal task

    From /questions, identify a question where file_name is not null (for example an image or mp3).

  3. Call /files/{task_id}

    curl -I "https://huggingface.co/proxy/agents-course-unit4-scoring.hf.space/files/<that-task-id>"
    

    Expected:

    • HTTP status 200 OK.
    • A reasonable Content-Type (e.g. image/png, audio/mpeg, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, etc.).

If you still get 404, log output from load_questions() will tell you whether hf_hub_download failed (dataset gating, network, token, etc.).


Final summary

  • The Space broke because it assumes GAIA’s file_path is an absolute local path, but GAIA now defines file_path as a relative path inside the dataset repo, with Parquet-backed splits and no loader script.(Hugging Face)
  • At startup, the Space never downloads those files or joins them with a dataset root, so os.path.exists(file_path) fails for every attachment, task_file_paths stays empty, and /files/{task_id} returns 404.(Hugging Face)
  • The minimal fix is to replace the os.path.exists block in load_questions() with a call to hf_hub_download(repo_id="gaia-benchmark/GAIA", filename=local_file_path, repo_type="dataset", cache_dir=ALLOWED_CACHE_BASE), then store that returned path in task_file_paths.
  • This respects GAIA’s new format, keeps security checks and public API unchanged, and restores working attachments for /files/{task_id}.