Fine-tune Mistral 7B–9B or 24B (bnb 4bit)

Hi everyone,

I’m exploring the feasibility of fine-tuning a 7B–9B model (like Mistral or Deepseek) on consumer hardware using 4-bit quantization (bnb). My current setup:

Specs:

  • GPU: Tesla V100 16GB
  • CPU: Xeon E5-2690v3
  • RAM: 64GB DDR4
  • OS: Ubuntu 20.04
  • Stack: Transformers + bitsandbytes + possibly Unsloth

Use case:
I’m building a system that generates short, contextualized outputs based on external content. The goal is to make the model more domain-aware by giving it access to a corpus of ~9k domain-specific text entries (no outputs), and then fine-tune it to better generate responses when paired with smaller adapters (LoRAs) per user or use-case (each around 200–300 examples).


Pipeline idea:

  1. Pre-train or fine-tune the base model using the raw input texts (to improve domain understanding)
  2. Use lightweight LoRAs for personalization (dynamically loaded)
  3. Run inference with a combination of both (input + LoRA)

My questions:

  • Can Mistral 7B or Deepseek 9B (bnb-4bit) be fine-tuned efficiently on V100 16GB using tools like Unsloth?

  • If I add a second GPU (e.g. another V100, P100 16GB, or RTX 3060 12GB), is it possible to:

    • fine-tune larger models (like Mistral 24B in 4-bit)?
    • split layers or memory effectively between GPUs?
  • What’s the recommended approach for managing 10+ LoRAs for runtime personalization?

  • Which models are generally best suited for this kind of task (short domain-aware output generation + user-specific fine-tuning)?
    I’m currently looking at Mistral, Deepseek, Yi, LLaMA 3, but open to suggestions for 4-bit setups on limited VRAM.

Any practical insights, configs, or success stories would be super appreciated!

Thanks a lot.

1 Like

For now, with 24B seems difficult with just one card, but with 7B should be doable.

1 Like

what if i use two gpus
like two v100s with 16gb
or a v100 + p100 16gb
or rtx 3060 12gb + v100
but most likely just for inference, and for full fine-tuning i’d rent a server for 2–3 days and then use the result
would that work?

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.