--- library_name: transformers tags: - quantization - bitsandbytes - 4-bit - nf4 - double-quant - mcqa --- # Model Card for `Kikinoking/MNLP_M3_quantized_model` A 4-bit double-quantized (NF4 + nested quant) version of the MNLP_M3_mcqa_model, compressed with bitsandbytes. This model answers multiple-choice questions (MCQA) with minimal GPU memory usage. ## Model Details - **Model ID:** `Kikinoking/MNLP_M3_quantized_model` - **Quantization:** 4-bit NF4 + nested quantization (`bnb_4bit_use_double_quant=True`) - **Base model:** `aidasvenc/MNLP_M3_mcqa_model` - **Library:** Transformers + bitsandbytes - **Task:** Multiple-choice question answering (MCQA) ## Usage Load and run inference in just a few lines: ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "Kikinoking/MNLP_M3_quantized_model" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype="auto", low_cpu_mem_usage=True ).eval() prompt = "What is the capital of France ?\nA) Lyon B) Marseille C) Paris D) Toulouse\nAnswer: " inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.inference_mode(): output = model.generate(**inputs, max_new_tokens=1) print("Answer:", tokenizer.decode(output[0], skip_special_tokens=True)) ##How It Was Built from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig import torch base_id = "aidasvenc/MNLP_M3_mcqa_model" qcfg = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True ) tokenizer = AutoTokenizer.from_pretrained(base_id) model = AutoModelForCausalLM.from_pretrained( base_id, quantization_config=qcfg, device_map="auto", torch_dtype="auto" ) # Push to Hugging Face Hub model.push_to_hub("Kikinoking/MNLP_M3_quantized_model", private=True) tokenizer.push_to_hub("Kikinoking/MNLP_M3_quantized_model") print("VRAM used (MiB):", torch.cuda.memory_reserved()/1024**2)