Hi everyone,
I’m exploring the feasibility of fine-tuning a 7B–9B model (like Mistral or Deepseek) on consumer hardware using 4-bit quantization (bnb). My current setup:
Specs:
- GPU: Tesla V100 16GB
- CPU: Xeon E5-2690v3
- RAM: 64GB DDR4
- OS: Ubuntu 20.04
- Stack: Transformers + bitsandbytes + possibly Unsloth
Use case:
I’m building a system that generates short, contextualized outputs based on external content. The goal is to make the model more domain-aware by giving it access to a corpus of ~9k domain-specific text entries (no outputs), and then fine-tune it to better generate responses when paired with smaller adapters (LoRAs) per user or use-case (each around 200–300 examples).
Pipeline idea:
- Pre-train or fine-tune the base model using the raw input texts (to improve domain understanding)
- Use lightweight LoRAs for personalization (dynamically loaded)
- Run inference with a combination of both (input + LoRA)
My questions:
-
Can Mistral 7B or Deepseek 9B (bnb-4bit) be fine-tuned efficiently on V100 16GB using tools like Unsloth?
-
If I add a second GPU (e.g. another V100, P100 16GB, or RTX 3060 12GB), is it possible to:
- fine-tune larger models (like Mistral 24B in 4-bit)?
- split layers or memory effectively between GPUs?
-
What’s the recommended approach for managing 10+ LoRAs for runtime personalization?
-
Which models are generally best suited for this kind of task (short domain-aware output generation + user-specific fine-tuning)?
I’m currently looking at Mistral, Deepseek, Yi, LLaMA 3, but open to suggestions for 4-bit setups on limited VRAM.
Any practical insights, configs, or success stories would be super appreciated!
Thanks a lot.