Attachements not available on https://huggingface.co/proxy/agents-course-unit4-scoring.hf.space/docs#

Hi! :waving_hand:
I’m currently working on the GAIA evaluation agent, and I’ve run into an issue with the task attachments (images, audio files, Python code, Excel sheets, etc.).

According to the documentation, these files should be obtainable by calling the attachments endpoint with the task_id. However, the endpoint consistently returns no file, and it looks like attachments have not been available for quite a while.

Could someone from the Hugging Face team (ping @HuggingFace or the GAIA maintainers) confirm:

Are the task attachments going to be made available again?

If yes, is there an expected timeline for when access will be restored?

If not, should agents be designed to work without attachments for now?

This is blocking full GAIA evaluation support, so any clarification would be very appreciated. Thanks in advance! :folded_hands:

1 Like

For now, I’ve organized the current situation (what’s happening in the code and the proposed fixes).


Short answer:

  • The Space broke because GAIA changed how it ships data, but the Space still assumes an old layout.
  • GAIA now stores file_path as a repo-relative path and no longer provides a loader script that materializes local files for you.(Hugging Face)
  • The Space checks os.path.exists(file_path) directly, so it never finds any file, never fills task_file_paths, and /files/{task_id} always returns 404.(Hugging Face)
  • The minimal fix: in load_questions(), replace the os.path.exists logic with a small hf_hub_download(...) call that turns the GAIA file_path into a real local path under /app/.cache.

Below is the explanation step by step, plus a concrete small patch.


1. What the Space is supposed to do

The scoring Space has three public endpoints (as described in the course materials and forum reply).(Hugging Face Forums)

  • GET /questions

    • Loads GAIA (gaia-benchmark/GAIA, config 2023_level1, split validation).
    • Filters down to “simple enough” tasks by annotator metadata.
    • Exposes a list of questions: task_id, question, Level, and (optionally) file_name.
  • GET /files/{task_id}

    • Uses an internal map task_id → local_file_path (task_file_paths).
    • Serves the file (image/audio/xlsx/py/etc.) with proper MIME type.
  • POST /submit

    • Checks your answers against Final answer.
    • Records your score in agents-course/unit4-students-scores.

All of this is wired in main.py of agents-course/Unit4_scoring.(Hugging Face)

The /files/{task_id} endpoint depends completely on task_file_paths being filled correctly at startup.


2. What changed on the GAIA side

Two important changes happened on GAIA’s side:

  1. Datasets 4.0 removed dataset-loader scripts.
    GAIA used to have a GAIA.py loader and had to be converted to a “script-free” dataset. The maintainers explicitly mention the need to drop the script and move to Parquet.(Hugging Face)

  2. GAIA now ships Parquet + repo-relative file_path.
    The updated GAIA dataset card (October 2025) says: (Hugging Face)

    • Splits are now Parquet: metadata.level1.parquet etc.

    • Columns remain task_id, Question, Level, Final answer, file_name, file_path, Annotator Metadata.

    • Crucial sentence:

      “file_path keeps pointing to attachments relative to the repository root (for example, 2023/test/<attachment-id>.pdf).”

    • The recommended loading pattern is:

      from datasets import load_dataset
      from huggingface_hub import snapshot_download
      import os
      
      data_dir = snapshot_download(repo_id="gaia-benchmark/GAIA", repo_type="dataset")
      dataset = load_dataset(data_dir, "2023_level1", split="test")
      for example in dataset:
          file_path = os.path.join(data_dir, example["file_path"])
      

    So: GAIA never promises that file_path is an absolute local path. It is a relative path inside the dataset repo.


3. How the Space currently loads GAIA and files

Look at the Space’s load_questions() in main.py: (Hugging Face)

dataset = load_dataset("gaia-benchmark/GAIA",
                       "2023_level1",
                       split="validation",
                       trust_remote_code=True)
...
local_file_path = item.get('file_path')
file_name = item.get('file_name')
...
# 3. Store the file path mapping if file details exist and are valid
if local_file_path and file_name:
    # Log if the path from the dataset isn't absolute (might indicate issues)
    if not os.path.isabs(local_file_path):
        logger.warning(
            f"Task {task_id}: Path '{local_file_path}' from dataset is not absolute. "
            "This might cause issues finding the file on the server."
        )

    if os.path.exists(local_file_path) and os.path.isfile(local_file_path):
        task_file_paths[str(task_id)] = local_file_path
        logger.debug(f"Stored file path mapping for task_id {task_id}: {local_file_path}")
    else:
        logger.warning(
            f"File path '{local_file_path}' for task_id {task_id} does NOT exist or is not a file on server. "
            "Mapping skipped."
        )

Key points:

  • It reads file_path from GAIA into local_file_path.
  • It warns if this path is not absolute (so it expects absolute paths).
  • Then it directly calls os.path.exists(local_file_path) and only stores it if that is true.

Later, /files/{task_id} does:

if task_id not in task_file_paths:
    raise HTTPException(status_code=404,
                        detail=f"No file path associated with task_id {task_id}.")
...
abs_file_path = os.path.abspath(local_file_path)
if not abs_file_path.startswith(ALLOWED_CACHE_BASE):
    raise HTTPException(status_code=403, detail="File access denied.")
if not os.path.exists(abs_file_path) or not os.path.isfile(abs_file_path):
    raise HTTPException(status_code=404,
                        detail=f"File associated with task_id {task_id} not found on server disk.")
return FileResponse(path=abs_file_path, ...)

(Hugging Face)

So /files/{task_id} will only work if:

  • task_file_paths[task_id] exists, and
  • that path points to a real file already on disk under /app/.cache.

4. Why this breaks now

Combine the previous sections:

  1. GAIA now exposes file_path as repo-relative, e.g. "2023/test/abcd1234.png".(Hugging Face)

  2. The scoring Space never downloads those files nor joins the path with the dataset root. It simply expects local_file_path to be an absolute path that already exists inside the container.(Hugging Face)

  3. At startup, for each question with an attachment:

    • local_file_path is something like "2023/test/abcd1234.png".
    • os.path.isabs("2023/test/abcd1234.png") is false, so it logs a warning.
    • os.path.exists("2023/test/abcd1234.png") is also false, because nothing at that relative path exists in the container filesystem.
    • So it skips adding an entry to task_file_paths.

    Result: task_file_paths ends up empty or nearly empty.

  4. At request time:

    • GET /files/{task_id} looks into task_file_paths.
    • It finds no entry and returns 404 "No file path associated with task_id ...".

This matches the forum symptom: even using the correct bare task_id in the URL, users get 404 for tasks with valid file_name.(Hugging Face Forums)

So the Space is broken because its file-path handling is out of date with GAIA’s new Parquet + relative file_path design.


5. Minimal code fix inside the Space

Goal: keep the architecture, change as little as possible.

You already have hf_hub_download imported:

from huggingface_hub import HfApi, hf_hub_download

(Hugging Face)

So the smallest safe fix is to replace the “does this path exist locally?” logic by a call that resolves GAIA’s relative file_path into a real local file.

Patch: only change the mapping block in load_questions()

Current block (lines 224–238): (Hugging Face)

        # 3. Store the file path mapping if file details exist and are valid
        if local_file_path and file_name:
            # Log if the path from the dataset isn't absolute (might indicate issues)

            if not os.path.isabs(local_file_path):
                logger.warning(
                    f"Task {task_id}: Path '{local_file_path}' from dataset is not absolute. "
                    "This might cause issues finding the file on the server."
                )

            if os.path.exists(local_file_path) and os.path.isfile(local_file_path):
                task_file_paths[str(task_id)] = local_file_path
                logger.debug(f"Stored file path mapping for task_id {task_id}: {local_file_path}")
            else:
                logger.warning(
                    f"File path '{local_file_path}' for task_id {task_id} does NOT exist or is not a file on server. "
                    "Mapping skipped."
                )

Replace that block with:

        # 3. Store the file path mapping if file details exist and are valid
        if local_file_path and file_name:
            try:
                # GAIA's file_path is relative to the dataset repo root.
                # Download the file into the allowed cache and get its local path.
                resolved_path = hf_hub_download(
                    repo_id="gaia-benchmark/GAIA",
                    filename=local_file_path,  # e.g. "2023/test/<attachment-id>.pdf"
                    repo_type="dataset",
                    cache_dir=ALLOWED_CACHE_BASE,
                )

                task_file_paths[str(task_id)] = resolved_path
                logger.debug(
                    f"Stored file path mapping for task_id {task_id}: {resolved_path}"
                )
            except Exception as e:
                logger.warning(
                    f"Could not download file '{local_file_path}' for task_id {task_id}: {e}. "
                    "Mapping skipped."
                )

Optional one-liner near the top of load_questions() (after task_file_paths.clear()):

    os.makedirs(ALLOWED_CACHE_BASE, exist_ok=True)

Why this is enough

  • Resolves GAIA semantics correctly

    • GAIA promises file_path is a path inside the dataset repo.(Hugging Face)
    • hf_hub_download(repo_id="gaia-benchmark/GAIA", filename=local_file_path, repo_type="dataset", ...) downloads exactly that file from the GAIA repo and returns its local path.(Hugging Face)
  • Keeps security checks intact

    • You set cache_dir=ALLOWED_CACHE_BASE so the returned resolved_path will be under /app/.cache/....
    • The /files/{task_id} endpoint already checks abs_file_path.startswith(ALLOWED_CACHE_BASE). That continues to work and blocks path traversal.(Hugging Face)
  • No changes to external API

    • /questions response stays the same.
    • /files/{task_id} stays the same URL shape and behavior, but now it can actually find files.
    • /submit is unaffected.
  • Minimal surface change

    • You do not touch:

      • The load_dataset("gaia-benchmark/GAIA", ...) call.
      • The FastAPI route definitions.
      • The scoring logic or leaderboard update code.

6. Sanity checks after patch

After deploying the patched Space, do three quick manual tests:

  1. Check questions still load

    curl https://huggingface.co/proxy/agents-course-unit4-scoring.hf.space/questions | head
    

    You should see JSON with task_id, question, Level, and sometimes file_name.

  2. Pick a known multimodal task

    From /questions, identify a question where file_name is not null (for example an image or mp3).

  3. Call /files/{task_id}

    curl -I "https://huggingface.co/proxy/agents-course-unit4-scoring.hf.space/files/<that-task-id>"
    

    Expected:

    • HTTP status 200 OK.
    • A reasonable Content-Type (e.g. image/png, audio/mpeg, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, etc.).

If you still get 404, log output from load_questions() will tell you whether hf_hub_download failed (dataset gating, network, token, etc.).


Final summary

  • The Space broke because it assumes GAIA’s file_path is an absolute local path, but GAIA now defines file_path as a relative path inside the dataset repo, with Parquet-backed splits and no loader script.(Hugging Face)
  • At startup, the Space never downloads those files or joins them with a dataset root, so os.path.exists(file_path) fails for every attachment, task_file_paths stays empty, and /files/{task_id} returns 404.(Hugging Face)
  • The minimal fix is to replace the os.path.exists block in load_questions() with a call to hf_hub_download(repo_id="gaia-benchmark/GAIA", filename=local_file_path, repo_type="dataset", cache_dir=ALLOWED_CACHE_BASE), then store that returned path in task_file_paths.
  • This respects GAIA’s new format, keeps security checks and public API unchanged, and restores working attachments for /files/{task_id}.