Getting the following error when trying to run 24B model

#3
by Gemneye - opened
  1. It looks like matplotlib and rich modules were not included in requirements.txt
  2. I am getting the below error. Running on RunPod with ADA6000, with conda environment using the commands from the documentation, Ubuntu 22.04

(magi) root@a86f02cd24e3:/workspace/MAGI-1# bash example/24B/run.sh
/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/timm/models/layers/init.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {name} is deprecated, please import via timm.layers", FutureWarning)
Traceback (most recent call last):
File "/workspace/MAGI-1/inference/pipeline/entry.py", line 54, in
main()
File "/workspace/MAGI-1/inference/pipeline/entry.py", line 37, in main
pipeline = MagiPipeline(args.config_file)
File "/workspace/MAGI-1/inference/pipeline/pipeline.py", line 29, in init
self.config = MagiConfig.from_json(config_path)
File "/workspace/MAGI-1/inference/common/config.py", line 159, in from_json
post_validation(magi_config)
File "/workspace/MAGI-1/inference/common/config.py", line 154, in post_validation
magi_config.runtime_config.cfg_number == 1
AssertionError: Please set cfg_number: 1 in config.json for distill or quant model
E0423 01:33:36.885000 124462029551424 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 5233) of binary: /workspace/miniconda3/envs/magi/bin/python
Traceback (most recent call last):
File "/workspace/miniconda3/envs/magi/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.4.0', 'console_scripts', 'torchrun')())
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/miniconda3/envs/magi/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

inference/pipeline/entry.py FAILED

Failures:

Root Cause (first observed failure):
[0]:
time : 2025-04-23_01:33:36
host : a86f02cd24e3
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5233)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Sand AI org

It looks like you’re using a distill or quant model. Please make sure that cfg_number is set to 1 in your config.json file.

@Gemneye did you ever get this to work on runpod?
If so can you share a deployment template or a guide?

@jameshuntercarter , unfortunately, I gave up. Too many other things to play with. After its initial hype, you have not heard anything about this, so it seemed like it went to the AI graveyard.

Sign up or log in to comment