Instructions to use tencent/Hunyuan-A13B-Instruct-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tencent/Hunyuan-A13B-Instruct-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tencent/Hunyuan-A13B-Instruct-FP8", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("tencent/Hunyuan-A13B-Instruct-FP8", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use tencent/Hunyuan-A13B-Instruct-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tencent/Hunyuan-A13B-Instruct-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tencent/Hunyuan-A13B-Instruct-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/tencent/Hunyuan-A13B-Instruct-FP8
- SGLang
How to use tencent/Hunyuan-A13B-Instruct-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tencent/Hunyuan-A13B-Instruct-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tencent/Hunyuan-A13B-Instruct-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tencent/Hunyuan-A13B-Instruct-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tencent/Hunyuan-A13B-Instruct-FP8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use tencent/Hunyuan-A13B-Instruct-FP8 with Docker Model Runner:
docker model run hf.co/tencent/Hunyuan-A13B-Instruct-FP8
Fast tokenizer, general VLLM support
Is this being working on ? I did not see any PRs for transformers
yes, transformers support is in progress, hope we can submit the the PR within next week
vLLM support is in this PR:https://github.com/vllm-project/vllm/pull/20114/files
about the faster tokenizer, could give some example on this items (some link inside of transformers lib ?)
yes, transformers support is in progress, hope we can submit the the PR within next week
vLLM support is in this PR:https://github.com/vllm-project/vllm/pull/20114/files
about the faster tokenizer, could give some example on this items (some link inside of transformers lib ?)
Tokenizer mode:
"auto" will use the fast tokenizer if available.
"slow" will always use the slow tokenizer.
"mistral" will always use the tokenizer from mistral_common.
"custom" will use --tokenizer to select the preregistered tokenizer.
This has to do with transformers support, so when the PR is in in transformers and or vllm support. This will work. I think the tokenizer is being run with a python script inside the vllm docker image.