Spaces:
Running on CPU Upgrade
Running on CPU Upgrade
Troubleshooting Interconnect: Share Your Experience
#1
by
nouamanetazi - opened
Hi everyone! π
The Troubleshooting Interconnect section has some initial findings on common NCCL performance issues (CPU affinity, network topology, environment variables, container configs).
[Read the full section here β] https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook#troubleshooting-interconnect
We'd love to make this a living community resource! Have you run into:
- NCCL performance bottlenecks that were tricky to debug?
- Cloud-specific networking issues (AWS EFA, GCP, Azure)?
- Container configuration gotchas?
- Effective debugging workflows or tools?
Share your troubleshooting stories, solutions, or questions below.
Your experience could help others avoid hours of debugging! π€