README.md · tensense/code_repo

File size: 11,093 Bytes

---
language:
- zh
- en
license: apache-2.0
library_name: transformers
tags:
- code
- qwen
- lora
- repository-understanding
- code-assistant
- fine-tuning
- multi-agent-systems
base_model: Qwen/Qwen3-8B
datasets:
- custom
metrics:
- accuracy
- code_understanding
pipeline_tag: text-generation
model-index:
- name: code_repo_finetuning
  results:
  - task:
      type: text-generation
      name: Code Repository Understanding
    metrics:
    - type: accuracy
      value: 71.5
      name: Overall Score
    - type: improvement
      value: 22.1
      name: Improvement over Base Model
---

# Finetune any base model (e.g. Qwen3-8B) on any given code repository

## Model Description

This model is a fine-tuned version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) specifically trained to understand and answer questions about any given private or new project repository, for example, [Laddr](https://github.com/AgnetLabs/Laddr) - a framework for building scalable multi-agent systems.

The fine-tuning was performed using **LoRA (Low-Rank Adaptation)** with an innovative training data generation approach that **does not rely on LLM-generated synthetic data**, avoiding circular dependencies and hallucination issues.

### Key Features

- ✅ **Project-Specific Knowledge**: Deep understanding of Laddr's architecture, codebase, and APIs
- ✅ **Code Location**: Accurately locates functions, classes, and modules (+30% improvement)
- ✅ **Code Understanding**: Explains code functionality with detailed context (+19.3% improvement)
- ✅ **Maintains General Abilities**: Retains base model's general knowledge capabilities
- ✅ **Zero Hallucination Training Data**: Generated from real code via AST parsing, not LLM synthesis

## Model Details

### Base Model
- **Model**: Qwen/Qwen3-8B
- **Parameters**: 8 Billion
- **Architecture**: Transformer-based causal language model

### Fine-tuning Specifications
- **Method**: LoRA (Low-Rank Adaptation)
- **LoRA Rank**: 64
- **LoRA Alpha**: 128
- **Target Modules**: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- **Training Framework**: DeepSpeed ZeRO-3
- **Precision**: BF16
- **Epochs**: 3
- **Training Samples**: 650+
- **Training Time**: ~2-3 hours on 2x GPUs (48GB each)

### Training Data

The training dataset was **automatically generated** from the Laddr repository using:
- **Python AST parsing** for code structure extraction
- **Real docstrings** and code comments
- **Function signatures** and parameter information
- **Call graph relationships**
- **Project statistics** and module structure

**Data Composition**:
- Code Explanation: 300+ samples (46%)
- API Usage: 150+ samples (23%)
- Code Location: 100+ samples (15%)
- Project Overview: 50+ samples (8%)
- Design Proposals: 50+ samples (8%)

**Data Split**:
- Training: 80% (520+ samples)
- Validation: 10% (65+ samples)
- Test: 10% (65+ samples)

## Performance

### Overall Results

| Metric | Base Model | Fine-tuned | Improvement |
|--------|-----------|-----------|-------------|
| **Overall Score** | 49.4% | 71.5% | **+22.1%** ✅ |
| Code Location | 60.0% | 90.0% | **+30.0%** ⭐ |
| Code Understanding | 59.3% | 78.6% | +19.3% |
| Project Overview | 35.0% | 51.7% | +16.7% |
| General Knowledge | 10.0% | 30.0% | +20.0% |

### Detailed Performance by Task Type

**Code Location Tasks** (+30.0%):
- Accurately identifies file locations of functions/classes
- Provides complete file paths with line numbers
- Eliminates uncertainty in location queries

**Code Understanding Tasks** (+19.3%):
- Explains code functionality with context
- Includes function signatures and parameters
- Extracts and presents real docstrings

**Project Overview Tasks** (+16.7%):
- Describes project purpose and architecture
- Lists technology stack and dependencies
- Provides project statistics and structure

**General Knowledge** (+20.0%):
- Maintains ability to answer general programming questions
- Improves conciseness (from verbose to clear)
- Preserves base model capabilities

## Use Cases

### 1. Code Assistant for Laddr Project
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "tensense/code_repo_finetuning"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

question = "Where is the create_system_tools function located in Laddr?"
messages = [{"role": "user", "content": question}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0][len(inputs['input_ids'][0]):], skip_special_tokens=True)
print(response)
```

### 2. API Documentation Assistant
Query about function usage, parameters, and return values.

### 3. Codebase Onboarding
Help new developers quickly understand the Laddr project structure and key components.

### 4. Code Review Support
Provide context and explanations for code review processes.

## Example Queries

### Query 1: Code Location
**Question**: "Where is `create_system_tools` located in Laddr?"

**Response**:
> `create_system_tools` is located in `lib/laddr/src/laddr/core/system_tools.py`.
>
> **Function Description**:
> Create system tools for task delegation and artifact management. This function checks for user-provided overrides before registering the default implementations.
>
> **Function Signature**:
> ```python
> def create_system_tools(message_bus, storage_backend=None, agent=None) -> dict[str, tuple[Any, list[str]]]:
> ```
>
> **Parameters**:
> - `message_bus` (Any): Message queue backend
> - `storage_backend` (Any): Optional storage backend
> - `agent` (Any): Optional agent instance
>
> **Returns**: `dict[str, tuple[Any, list[str]]]`

### Query 2: Project Overview
**Question**: "What is Laddr's main functionality?"

**Response**:
> Laddr is a Python project that serves as a framework for building scalable multi-agent systems.
>
> **Main Technologies**: bandit, black, boto3-stubs, build, click, etc.
>
> The project contains 279 code elements, consisting of 67 classes and 212 functions/methods.
>
> **Core Modules**:
> - `core` (279 elements)
> - `cli` (52 elements)
> - `llms` (39 elements)

## Limitations

- **Project-Specific**: Optimized for Laddr project; may not perform as well on other codebases
- **Knowledge Cutoff**: Based on the Laddr repository as of training time (2025-01)
- **Language Focus**: Primarily trained on Python code and English/Chinese documentation
- **Limited General Coding**: While it maintains general knowledge, it's optimized for Laddr-specific queries

## Training Methodology

### Innovation: LLM-Free Training Data Generation

Unlike traditional approaches that use LLMs to generate synthetic training data, this project employs a novel methodology:

1. **AST-Based Code Parsing**: Python Abstract Syntax Tree analysis extracts accurate code structure
2. **Real Documentation**: Utilizes actual docstrings, comments, and code signatures
3. **Call Graph Analysis**: Builds function dependency relationships
4. **Pattern Extraction**: Identifies code patterns (implementation, usage, interaction)
5. **Template-Based QA**: Generates question-answer pairs using templates with real code context

**Benefits**:
- ✅ Avoids circular dependency (using LLM data to train LLM)
- ✅ Eliminates hallucination in training data
- ✅ Ensures factual accuracy
- ✅ Provides complete reasoning traces

### Training Pipeline

```
GitHub Repository
    ↓
[1. Repository Analyzer]
    → Extracts code elements, patterns, call graph
    ↓
[2. Data Generator]
    → Creates QA pairs with code context
    ↓
[3. Model Fine-tuner]
    → LoRA + DeepSpeed ZeRO-3 training
    ↓
[4. LoRA Merger]
    → Merges adapter into base model
    ↓
[5. Model Evaluator]
    → Compares base vs fine-tuned
    ↓
Fine-tuned Model
```

## Extensibility

The training methodology is **repository-agnostic** and can be applied to any codebase:

### Adapt to Your Repository

```bash
# 1. Update configuration
python utils/config_manager.py https://github.com/your-org/your-repo

# 2. Analyze repository
python scripts/01_analyze_repo.py

# 3. Generate training data
python scripts/02_generate_data.py

# 4. Fine-tune model
deepspeed --num_gpus=2 scripts/03_train_model.py

# 5. Merge LoRA weights
python scripts/04_merge_weights.py

# 6. Evaluate
python scripts/05_evaluate.py
```

**Supported Languages** (currently):
- Python (primary)
- Markdown (documentation)

**Extensible to**:
- JavaScript/TypeScript
- Java
- Go
- Rust

## Ethical Considerations

- **Code Attribution**: All training data comes from the open-source Laddr repository
- **License Compliance**: Respects Apache 2.0 license of both base model and Laddr project
- **No Private Data**: Only uses publicly available code
- **Reproducibility**: Complete methodology documented for transparency

## Citation

If you use this model or methodology in your research, please cite:

```bibtex
@misc{qwen3-code-repo-finetuned-2025,
  title={Finetune any base model (e.g. Qwen3-8B) on any given code repository},
  author={Tensense},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/tensense/code_repo_finetuning}
}
```

## Acknowledgments

- **Base Model**: [Qwen Team](https://huggingface.co/Qwen) for Qwen3-8B
- **Laddr Project**: [AgnetLabs](https://github.com/AgnetLabs/Laddr) for the multi-agent framework
- **Training Framework**: HuggingFace Transformers, DeepSpeed, PEFT (LoRA)

## License

This model is released under the **Apache 2.0 License**, consistent with:
- Qwen3-8B base model license
- Laddr project license

## Model Card Authors

[Tensense]

## Model Card Contact

For questions or issues, please contact:
- Email: [email protected]
- GitHub: [TopologyApplied](https://github.com/TopologyApplied)
- HuggingFace: [tensense](https://huggingface.co/tensense)

---

## Additional Resources

- **Base Model**: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
- **Training Code**: [GitHub Repository](https://github.com/TopologyApplied/code_repo_finetuning)
- **Checkpoint & Finetuned Model**: [Huggingface](https://huggingface.co/tensense/code_repo_finetuning)
- **Laddr Project**: [GitHub](https://github.com/AgnetLabs/Laddr)
- **Evaluation Report**: [[Link to comparison_report.json](https://github.com/TopologyApplied/code_repo_finetuning/blob/main/output/comparison_report_Laddr.json)]
- **Design Documentation**: [[Link to design docs](https://github.com/TopologyApplied/code_repo_finetuning/blob/main/代码仓库智能训练数据生成系统_设计文档.md)]

## Version History

### v1.0 (2025-11-15)
- Initial release
- Fine-tuned on Laddr repository
- 650+ training samples
- LoRA rank 64, alpha 128
- 3 epochs training
- Overall improvement: +22.1%

---

**Note**: This is a demonstration of repository-specific fine-tuning methodology. The approach can be adapted to any codebase for creating custom code assistants.