Instructions to use dranger003/dbrx-instruct-iMat.GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dranger003/dbrx-instruct-iMat.GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="dranger003/dbrx-instruct-iMat.GGUF",
	filename="ggml-dbrx-instruct-16x12b-f16-00001-of-00006.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use dranger003/dbrx-instruct-iMat.GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf dranger003/dbrx-instruct-iMat.GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf dranger003/dbrx-instruct-iMat.GGUF:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf dranger003/dbrx-instruct-iMat.GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf dranger003/dbrx-instruct-iMat.GGUF:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf dranger003/dbrx-instruct-iMat.GGUF:F16
# Run inference directly in the terminal:
./llama-cli -hf dranger003/dbrx-instruct-iMat.GGUF:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf dranger003/dbrx-instruct-iMat.GGUF:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf dranger003/dbrx-instruct-iMat.GGUF:F16

Use Docker

docker model run hf.co/dranger003/dbrx-instruct-iMat.GGUF:F16

LM Studio
Jan

vLLM

How to use dranger003/dbrx-instruct-iMat.GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dranger003/dbrx-instruct-iMat.GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dranger003/dbrx-instruct-iMat.GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dranger003/dbrx-instruct-iMat.GGUF:F16

Ollama
How to use dranger003/dbrx-instruct-iMat.GGUF with Ollama:
```
ollama run hf.co/dranger003/dbrx-instruct-iMat.GGUF:F16
```

Unsloth Studio new

How to use dranger003/dbrx-instruct-iMat.GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for dranger003/dbrx-instruct-iMat.GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for dranger003/dbrx-instruct-iMat.GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for dranger003/dbrx-instruct-iMat.GGUF to start chatting

Docker Model Runner
How to use dranger003/dbrx-instruct-iMat.GGUF with Docker Model Runner:
```
docker model run hf.co/dranger003/dbrx-instruct-iMat.GGUF:F16
```

Lemonade

How to use dranger003/dbrx-instruct-iMat.GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull dranger003/dbrx-instruct-iMat.GGUF:F16

Run and chat with the model

lemonade run user.dbrx-instruct-iMat.GGUF-F16

List all available models

lemonade list

Quants from @phymbert (author of the support for this model in llama.cpp) are posted here
The quants here are meant to test imatrix quantized weights.
If you run metal, you may need this PR

Added ggml-dbrx-instruct-16x12b-f16_imatrix-wiki.dat which is a 2K batches (1M tokens) on FP16 weights using wiki.train.

Quant	IMatrix Quant/Dataset/Chunks	Size (GiB)	PPL (wiki.test)
IQ4_XS	Q8_0/wiki.train/200	65.29	5.2260 +/- 0.03558
IQ4_XS	FP16/wiki.train/2000	65.29	5.2241 +/- 0.03559
IQ4_XS	-	66.05	5.2546 +/- 0.03570

2024-04-13: Support for this model has just being merged - PR #6515.
You will need this llama.cpp commit 4bd0f93e to run this model

Quants in this repo are tested running the following command (quants under IQ3 are very sensitive and unreliable so far - the imatrix may require to be trained on FP16 weights rather than Q8_0 and for longer than 200 chunks):

./build/bin/main -ngl 41 -c 4096 -s 0 -e -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWrite an essay about AI.<|im_end|>\n<|im_start|>assistant\n" -m ggml-dbrx-instruct-16x12b-<<quant-to-test>>.gguf

GGUF importance matrix (imatrix) quants for https://huggingface.co/databricks/dbrx-instruct
The importance matrix is trained for ~100K tokens (200 batches of 512 tokens) using wiki.train.raw.
Which GGUF is right for me? (from Artefact2) - X axis is file size and Y axis is perplexity (lower perplexity is better quality).
The imatrix is being used on the K-quants as well (only for < Q6_K).
You can merge GGUFs with gguf-split --merge <first-chunk> <output-file> although this is not required since f482bb2e.
What is importance matrix (imatrix)? You can read more about it from the author here.
How do I use imatrix quants? Just like any other GGUF, the .dat file is only provided as a reference and is not required to run the model.
If you need to use IQ1, then use IQ1_M as IQ1_S is very unstable.

DBRX is a transformer-based decoder-only large language model (LLM) that was trained using next-token prediction. It uses a fine-grained mixture-of-experts (MoE) architecture with 132B total parameters of which 36B parameters are active on any input. It was pre-trained on 12T tokens of text and code data. Compared to other open MoE models like Mixtral-8x7B and Grok-1, DBRX is fine-grained, meaning it uses a larger number of smaller experts. DBRX has 16 experts and chooses 4, while Mixtral-8x7B and Grok-1 have 8 experts and choose 2. This provides 65x more possible combinations of experts and we found that this improves model quality. DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA). It uses the GPT-4 tokenizer as provided in the tiktoken repository. We made these choices based on exhaustive evaluation and scaling experiments.

Layers	Context	Template
40	32768	<\|im_start\|>system {system}<\|im_end\|> <\|im_start\|>user {prompt}<\|im_end\|> <\|im_start\|>assistant

16x12B MoE
16 experts (12B params per single expert; top_k=4 routing)
36B active params (132B total params)
Trained on 12T tokens
32k sequence length training

Downloads last month: 1,362

GGUF

Model size

132B params

Architecture

dbrx

Hardware compatibility

1-bit

2-bit

3-bit

4-bit

6-bit

8-bit

16-bit

Model tree for dranger003/dbrx-instruct-iMat.GGUF

Base model

databricks/dbrx-instruct

Quantized

(1)

this model