Fine-tune Mistral 7B–9B or 24B (bnb 4bit)

oukaise · July 26, 2025, 12:47pm

Hi everyone,

I’m exploring the feasibility of fine-tuning a 7B–9B model (like Mistral or Deepseek) on consumer hardware using 4-bit quantization (bnb). My current setup:

Specs:

GPU: Tesla V100 16GB
CPU: Xeon E5-2690v3
RAM: 64GB DDR4
OS: Ubuntu 20.04
Stack: Transformers + bitsandbytes + possibly Unsloth

Use case:
I’m building a system that generates short, contextualized outputs based on external content. The goal is to make the model more domain-aware by giving it access to a corpus of ~9k domain-specific text entries (no outputs), and then fine-tune it to better generate responses when paired with smaller adapters (LoRAs) per user or use-case (each around 200–300 examples).

Pipeline idea:

Pre-train or fine-tune the base model using the raw input texts (to improve domain understanding)
Use lightweight LoRAs for personalization (dynamically loaded)
Run inference with a combination of both (input + LoRA)

My questions:

Can Mistral 7B or Deepseek 9B (bnb-4bit) be fine-tuned efficiently on V100 16GB using tools like Unsloth?
If I add a second GPU (e.g. another V100, P100 16GB, or RTX 3060 12GB), is it possible to:
- fine-tune larger models (like Mistral 24B in 4-bit)?
- split layers or memory effectively between GPUs?
What’s the recommended approach for managing 10+ LoRAs for runtime personalization?
Which models are generally best suited for this kind of task (short domain-aware output generation + user-specific fine-tuning)?
I’m currently looking at Mistral, Deepseek, Yi, LLaMA 3, but open to suggestions for 4-bit setups on limited VRAM.

Any practical insights, configs, or success stories would be super appreciated!

Thanks a lot.

John6666 · July 26, 2025, 1:47pm

For now, with 24B seems difficult with just one card, but with 7B should be doable.

oukaise · July 26, 2025, 3:07pm

what if i use two gpus
like two v100s with 16gb
or a v100 + p100 16gb
or rtx 3060 12gb + v100
but most likely just for inference, and for full fine-tuning i’d rent a server for 2–3 days and then use the result
would that work?

system · July 27, 2025, 3:07am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How many GPU resources do I need for full-fine tuning of the 7b model? 🤗Transformers	2	5341	June 5, 2025
Best LLMs that can run on 4gb VRAM Beginners	2	6556	January 22, 2025
QLoRA Fine-tuning is Too Slow on LLaMA-based Model Despite BitsAndBytes Optimization Intermediate	4	84	August 17, 2025
Can't finetune TheBloke/Mistral-7B-OpenOrca-GPTQ 🤗Hub	0	1252	November 7, 2023
Finetuning LLM(e.g Mistral-7B) on multiple CPUs with (Q)LoRa Beginners	0	938	February 21, 2024

Fine-tune Mistral 7B–9B or 24B (bnb 4bit)

Related topics