Getting the following error when trying to run 24B model

by Gemneye - opened Apr 23

Discussion

Gemneye

Apr 23

It looks like matplotlib and rich modules were not included in requirements.txt
I am getting the below error. Running on RunPod with ADA6000, with conda environment using the commands from the documentation, Ubuntu 22.04

(magi) root@a86f02cd24e3:/workspace/MAGI-1# bash example/24B/run.sh
/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/timm/models/layers/init.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {name} is deprecated, please import via timm.layers", FutureWarning)
Traceback (most recent call last):
File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in
main()
File "/workspace/MAGI-1/inference/pipeline/entry.py", line 37, in main
pipeline = MagiPipeline(args.config_file)
File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 29, in init
self.config = MagiConfig.from_json(config_path)
File "/workspace/MAGI-1/inference/common/config.py", line 159, in from_json
post_validation(magi_config)
File "/workspace/MAGI-1/inference/common/config.py", line 154, in post_validation
magi_config.runtime_config.cfg_number == 1
AssertionError: Please set `cfg_number: 1` in config.json for distill or quant model
E0423 01:33:36.885000 124462029551424 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 5233) of binary: /workspace/miniconda3/envs/magi/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/magi/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, kwargs)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

inference/pipeline/entry.py FAILED

Failures:

Root Cause (first observed failure):
[0]:
time : 2025-04-23_01:33:36
host : a86f02cd24e3
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5233)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

xiguan97

Sand AI org Apr 23

It looks like you’re using a distill or quant model. Please make sure that cfg_number is set to 1 in your config.json file.

jameshuntercarter

Aug 14

@Gemneye did you ever get this to work on runpod?
If so can you share a deployment template or a guide?

Gemneye

Aug 19

@jameshuntercarter , unfortunately, I gave up. Too many other things to play with. After its initial hype, you have not heard anything about this, so it seemed like it went to the AI graveyard.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Getting the following error when trying to run 24B model

inference/pipeline/entry.py FAILED

Failures:

Root Cause (first observed failure):[0]: time : 2025-04-23_01:33:36 host : a86f02cd24e3 rank : 0 (local_rank: 0) exitcode : 1 (pid: 5233) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2025-04-23_01:33:36
host : a86f02cd24e3
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5233)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html