Dataset Row Counts Missing from Profile View!

hmm…?


Current situation on the Hugging Face Hub

1) Row counts and “Data Studio” are produced by the Dataset Viewer backend.
They are not “just README YAML.” The backend computes splits, schema, and dataset size, and the UI shows those results. (Hugging Face)

2) The Hub auto-converts data to Parquet (first 5GB per dataset) to power the viewer and Data Studio.
If your data is already Parquet, it can still regenerate Parquet if the original Parquet row groups are too large. (Hugging Face)

3) For gated datasets, the viewer backend requires authentication.
Without a user token, the viewer API will not return size metadata. This directly impacts row counts and Data Studio visibility in “public” contexts (profile lists, logged-out views, etc.). (Hugging Face)


Your case, right now

EhsanShahbazi/goodreads-quotes

  • It already shows Data Studio and a visible row count on the dataset page.
  • So the viewer backend is working for it. No waiting. No YAML “fix” required for viewer functionality. (Hugging Face)

EhsanShahbazi/digikala-products

  • It is gated (“agree to share contact info”). (Hugging Face)
  • For gated datasets, viewer endpoints need a token. (Hugging Face)
  • Result: row counts and Data Studio can be missing or inconsistent across UI surfaces, especially where the Hub behaves like an unauthenticated client.

This means your “missing row counts in profile view” is most likely expected behavior for gating, not a Parquet-bot backlog and not a README YAML error.


Solutions and workarounds (pick the goal)

Goal A: Row counts and Data Studio visible to everyone

Solution: Do not gate the dataset.

  • Disable gating in settings. Then the viewer can run publicly and the UI can show counts broadly. Gating is specifically designed to restrict access. (Hugging Face)
  • Keep sensitive parts out of the public dataset if needed.

Workaround if you must gate the full dataset: publish a separate public “preview” repo.

  • Create digikala-products-preview with:

    • a small sampled subset (e.g., 10k to 100k rows)
    • same schema
    • same column names
  • Keep the full dataset gated.
    This gives you public row counts and Data Studio on the preview, while keeping the complete data gated.


Goal B: Keep gating, but ensure Data Studio works for approved users (including you)

This is the “correct” gated workflow.

Step 1: Confirm viewer health using the viewer API with a token
The Dataset Viewer docs are explicit: gated datasets require a user token in headers. (Hugging Face)

Example checks:

# 1) Is the dataset viewer valid?
curl -s \
  -H "Authorization: Bearer $HF_TOKEN" \
  "https://huggingface.co/proxy/datasets-server.huggingface.co/is-valid?dataset=EhsanShahbazi/digikala-products"
# 2) Can the backend compute size (rows)?
curl -s \
  -H "Authorization: Bearer $HF_TOKEN" \
  "https://huggingface.co/proxy/datasets-server.huggingface.co/size?dataset=EhsanShahbazi/digikala-products"

If these work and return sizes, then the backend is fine and any missing UI elements are almost certainly due to gating + UI context (public list pages). The API is the ground truth. (Hugging Face)

Step 2: Use Data Studio alternatives for gated data
If you want “Data Studio-like” exploration locally, use DuckDB/Polars with token auth.

  • Hugging Face has official docs for authenticating to private/gated datasets for DuckDB. (Hugging Face)
  • Data Studio docs also explain how Parquet powering works and how to access Parquet programmatically. (Hugging Face)

This is the best workaround when you want SQL exploration but keep gating.


Goal C: Keep gating, but still show some “counts” publicly

You cannot force the Hub to publicly compute viewer-derived row counts if the backend cannot access data unauthenticated. (Hugging Face)

Workarounds:

  1. Put the total row count in README text (“This dataset contains N rows”).
  2. Put a small public preview dataset (recommended).
  3. If you only care about discoverability, keep size_categories correct in the dataset card metadata. That helps search/filtering even when row counts are hidden. (It will not reliably recreate the exact numeric badge everywhere.)

If it is NOT gating: the other real failure modes and fixes

Sometimes gated datasets also have genuine viewer processing issues. Here are the common ones, and what to do.

1) “Full dataset viewer not available, only preview rows”

Usually indicates viewer backend job failed or is blocked. A real-world example shows CreateCommitError during viewer processing. (Hugging Face Forums)

Fixes:

  • Re-upload clean Parquet.
  • Reduce complexity (fewer exotic nested types).
  • Open a discussion with the exact error text.

2) Parquet row group too large

Hugging Face explicitly says it may regenerate Parquet if row groups are too big (performance). (Hugging Face)
There are also issues where too-large row groups break viewer performance. (GitHub)

Fix:

  • Re-write Parquet with smaller row groups (common practical approach).
  • Re-upload and let viewer re-index.

3) Dataset repo has a loading script

Some repos disable the viewer if they require arbitrary Python execution, and recommend relying on automated data support or pre-converted Parquet. (Hugging Face)

Fix:

  • Remove or move the script off main, keep Parquet in main. Forum discussions cover coexistence patterns and limitations. (Hugging Face Forums)

4) Manual configuration mismatch

If your structure is nonstandard, the viewer can show only preview until splits/subsets are configured correctly. Use the “Configure the Dataset Viewer” doc and YAML manual configuration patterns. (Hugging Face)


Practical “what I would do” for you

  1. Decide: Do you want the Digikala dataset gated or not?

    • If you want public row-count badges and Data Studio for everyone, remove gating. (Hugging Face)
    • If you must keep it gated, accept that some public UI surfaces will not show counts.
  2. Run the two viewer API checks with a token (/is-valid, /size).

    • If they succeed, stop chasing YAML and stop waiting for the bot. The backend is fine. (Hugging Face)
    • If they fail, you have a real viewer pipeline problem. Then focus on Parquet quality, schema consistency, and errors.
  3. If you keep gating but want public discoverability, publish a preview dataset.


Summary

  • Current state: Data Studio and row counts come from the Dataset Viewer backend and Parquet. (Hugging Face)
  • Why your Digikala counts are missing: it is gated, and gated viewer endpoints require authentication. (Hugging Face)
  • Solutions: un-gate for public counts, or keep gating and use token-based viewer API and DuckDB/Polars auth workflows. (Hugging Face)
  • Workarounds: publish a public preview dataset, or document counts in README.