File size: 11,093 Bytes
ccd4038
024b941
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ccd4038
 
c532210
ccd4038
024b941
ccd4038
024b941
ccd4038
024b941
ccd4038
024b941
ccd4038
024b941
 
 
 
 
ccd4038
024b941
ccd4038
024b941
 
 
 
ccd4038
024b941
 
 
 
 
 
 
 
 
 
ccd4038
024b941
ccd4038
024b941
 
 
 
 
 
ccd4038
024b941
 
 
 
 
 
ccd4038
024b941
 
 
 
ccd4038
024b941
ccd4038
024b941
ccd4038
024b941
 
 
 
 
 
 
ccd4038
024b941
ccd4038
024b941
 
 
 
ccd4038
024b941
 
 
 
ccd4038
024b941
 
 
 
ccd4038
024b941
 
 
 
ccd4038
024b941
ccd4038
024b941
ccd4038
024b941
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ccd4038
 
024b941
 
ccd4038
024b941
ccd4038
 
024b941
ccd4038
 
024b941
ccd4038
 
024b941
ccd4038
 
024b941
ccd4038
 
 
024b941
 
 
ccd4038
024b941
 
 
 
 
ccd4038
024b941
ccd4038
024b941
 
 
 
ccd4038
024b941
ccd4038
024b941
ccd4038
024b941
 
eab1af9
024b941
 
 
 
 
 
ccd4038
024b941
ccd4038
024b941
 
 
ccd4038
024b941
ccd4038
024b941
 
 
ccd4038
024b941
ccd4038
024b941
ccd4038
024b941
ccd4038
024b941
 
de1165e
 
ccd4038
024b941
ccd4038
024b941
ccd4038
024b941
 
de1165e
024b941
 
 
ccd4038
024b941
ccd4038
024b941
 
 
 
 
 
 
ccd4038
 
 
024b941
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
---
language:
- zh
- en
license: apache-2.0
library_name: transformers
tags:
- code
- qwen
- lora
- repository-understanding
- code-assistant
- fine-tuning
- multi-agent-systems
base_model: Qwen/Qwen3-8B
datasets:
- custom
metrics:
- accuracy
- code_understanding
pipeline_tag: text-generation
model-index:
- name: code_repo_finetuning
  results:
  - task:
      type: text-generation
      name: Code Repository Understanding
    metrics:
    - type: accuracy
      value: 71.5
      name: Overall Score
    - type: improvement
      value: 22.1
      name: Improvement over Base Model
---

# Finetune any base model (e.g. Qwen3-8B) on any given code repository

## Model Description

This model is a fine-tuned version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) specifically trained to understand and answer questions about any given private or new project repository, for example, [Laddr](https://github.com/AgnetLabs/Laddr) - a framework for building scalable multi-agent systems.

The fine-tuning was performed using **LoRA (Low-Rank Adaptation)** with an innovative training data generation approach that **does not rely on LLM-generated synthetic data**, avoiding circular dependencies and hallucination issues.

### Key Features

-**Project-Specific Knowledge**: Deep understanding of Laddr's architecture, codebase, and APIs
-**Code Location**: Accurately locates functions, classes, and modules (+30% improvement)
-**Code Understanding**: Explains code functionality with detailed context (+19.3% improvement)
-**Maintains General Abilities**: Retains base model's general knowledge capabilities
-**Zero Hallucination Training Data**: Generated from real code via AST parsing, not LLM synthesis

## Model Details

### Base Model
- **Model**: Qwen/Qwen3-8B
- **Parameters**: 8 Billion
- **Architecture**: Transformer-based causal language model

### Fine-tuning Specifications
- **Method**: LoRA (Low-Rank Adaptation)
- **LoRA Rank**: 64
- **LoRA Alpha**: 128
- **Target Modules**: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- **Training Framework**: DeepSpeed ZeRO-3
- **Precision**: BF16
- **Epochs**: 3
- **Training Samples**: 650+
- **Training Time**: ~2-3 hours on 2x GPUs (48GB each)

### Training Data

The training dataset was **automatically generated** from the Laddr repository using:
- **Python AST parsing** for code structure extraction
- **Real docstrings** and code comments
- **Function signatures** and parameter information
- **Call graph relationships**
- **Project statistics** and module structure

**Data Composition**:
- Code Explanation: 300+ samples (46%)
- API Usage: 150+ samples (23%)
- Code Location: 100+ samples (15%)
- Project Overview: 50+ samples (8%)
- Design Proposals: 50+ samples (8%)

**Data Split**:
- Training: 80% (520+ samples)
- Validation: 10% (65+ samples)
- Test: 10% (65+ samples)

## Performance

### Overall Results

| Metric | Base Model | Fine-tuned | Improvement |
|--------|-----------|-----------|-------------|
| **Overall Score** | 49.4% | 71.5% | **+22.1%** ✅ |
| Code Location | 60.0% | 90.0% | **+30.0%** ⭐ |
| Code Understanding | 59.3% | 78.6% | +19.3% |
| Project Overview | 35.0% | 51.7% | +16.7% |
| General Knowledge | 10.0% | 30.0% | +20.0% |

### Detailed Performance by Task Type

**Code Location Tasks** (+30.0%):
- Accurately identifies file locations of functions/classes
- Provides complete file paths with line numbers
- Eliminates uncertainty in location queries

**Code Understanding Tasks** (+19.3%):
- Explains code functionality with context
- Includes function signatures and parameters
- Extracts and presents real docstrings

**Project Overview Tasks** (+16.7%):
- Describes project purpose and architecture
- Lists technology stack and dependencies
- Provides project statistics and structure

**General Knowledge** (+20.0%):
- Maintains ability to answer general programming questions
- Improves conciseness (from verbose to clear)
- Preserves base model capabilities

## Use Cases

### 1. Code Assistant for Laddr Project
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "tensense/code_repo_finetuning"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

question = "Where is the create_system_tools function located in Laddr?"
messages = [{"role": "user", "content": question}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0][len(inputs['input_ids'][0]):], skip_special_tokens=True)
print(response)
```

### 2. API Documentation Assistant
Query about function usage, parameters, and return values.

### 3. Codebase Onboarding
Help new developers quickly understand the Laddr project structure and key components.

### 4. Code Review Support
Provide context and explanations for code review processes.

## Example Queries

### Query 1: Code Location
**Question**: "Where is `create_system_tools` located in Laddr?"

**Response**:
> `create_system_tools` is located in `lib/laddr/src/laddr/core/system_tools.py`.
>
> **Function Description**:
> Create system tools for task delegation and artifact management. This function checks for user-provided overrides before registering the default implementations.
>
> **Function Signature**:
> ```python
> def create_system_tools(message_bus, storage_backend=None, agent=None) -> dict[str, tuple[Any, list[str]]]:
> ```
>
> **Parameters**:
> - `message_bus` (Any): Message queue backend
> - `storage_backend` (Any): Optional storage backend
> - `agent` (Any): Optional agent instance
>
> **Returns**: `dict[str, tuple[Any, list[str]]]`

### Query 2: Project Overview
**Question**: "What is Laddr's main functionality?"

**Response**:
> Laddr is a Python project that serves as a framework for building scalable multi-agent systems.
>
> **Main Technologies**: bandit, black, boto3-stubs, build, click, etc.
>
> The project contains 279 code elements, consisting of 67 classes and 212 functions/methods.
>
> **Core Modules**:
> - `core` (279 elements)
> - `cli` (52 elements)
> - `llms` (39 elements)

## Limitations

- **Project-Specific**: Optimized for Laddr project; may not perform as well on other codebases
- **Knowledge Cutoff**: Based on the Laddr repository as of training time (2025-01)
- **Language Focus**: Primarily trained on Python code and English/Chinese documentation
- **Limited General Coding**: While it maintains general knowledge, it's optimized for Laddr-specific queries

## Training Methodology

### Innovation: LLM-Free Training Data Generation

Unlike traditional approaches that use LLMs to generate synthetic training data, this project employs a novel methodology:

1. **AST-Based Code Parsing**: Python Abstract Syntax Tree analysis extracts accurate code structure
2. **Real Documentation**: Utilizes actual docstrings, comments, and code signatures
3. **Call Graph Analysis**: Builds function dependency relationships
4. **Pattern Extraction**: Identifies code patterns (implementation, usage, interaction)
5. **Template-Based QA**: Generates question-answer pairs using templates with real code context

**Benefits**:
- ✅ Avoids circular dependency (using LLM data to train LLM)
- ✅ Eliminates hallucination in training data
- ✅ Ensures factual accuracy
- ✅ Provides complete reasoning traces

### Training Pipeline

```
GitHub Repository

[1. Repository Analyzer]
    → Extracts code elements, patterns, call graph

[2. Data Generator]
    → Creates QA pairs with code context

[3. Model Fine-tuner]
    → LoRA + DeepSpeed ZeRO-3 training

[4. LoRA Merger]
    → Merges adapter into base model

[5. Model Evaluator]
    → Compares base vs fine-tuned

Fine-tuned Model
```

## Extensibility

The training methodology is **repository-agnostic** and can be applied to any codebase:

### Adapt to Your Repository

```bash
# 1. Update configuration
python utils/config_manager.py https://github.com/your-org/your-repo

# 2. Analyze repository
python scripts/01_analyze_repo.py

# 3. Generate training data
python scripts/02_generate_data.py

# 4. Fine-tune model
deepspeed --num_gpus=2 scripts/03_train_model.py

# 5. Merge LoRA weights
python scripts/04_merge_weights.py

# 6. Evaluate
python scripts/05_evaluate.py
```

**Supported Languages** (currently):
- Python (primary)
- Markdown (documentation)

**Extensible to**:
- JavaScript/TypeScript
- Java
- Go
- Rust

## Ethical Considerations

- **Code Attribution**: All training data comes from the open-source Laddr repository
- **License Compliance**: Respects Apache 2.0 license of both base model and Laddr project
- **No Private Data**: Only uses publicly available code
- **Reproducibility**: Complete methodology documented for transparency

## Citation

If you use this model or methodology in your research, please cite:

```bibtex
@misc{qwen3-code-repo-finetuned-2025,
  title={Finetune any base model (e.g. Qwen3-8B) on any given code repository},
  author={Tensense},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/tensense/code_repo_finetuning}
}
```

## Acknowledgments

- **Base Model**: [Qwen Team](https://huggingface.co/Qwen) for Qwen3-8B
- **Laddr Project**: [AgnetLabs](https://github.com/AgnetLabs/Laddr) for the multi-agent framework
- **Training Framework**: HuggingFace Transformers, DeepSpeed, PEFT (LoRA)

## License

This model is released under the **Apache 2.0 License**, consistent with:
- Qwen3-8B base model license
- Laddr project license

## Model Card Authors

[Tensense]

## Model Card Contact

For questions or issues, please contact:
- Email: [email protected]
- GitHub: [TopologyApplied](https://github.com/TopologyApplied)
- HuggingFace: [tensense](https://huggingface.co/tensense)

---

## Additional Resources

- **Base Model**: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
- **Training Code**: [GitHub Repository](https://github.com/TopologyApplied/code_repo_finetuning)
- **Checkpoint & Finetuned Model**: [Huggingface](https://huggingface.co/tensense/code_repo_finetuning)
- **Laddr Project**: [GitHub](https://github.com/AgnetLabs/Laddr)
- **Evaluation Report**: [[Link to comparison_report.json](https://github.com/TopologyApplied/code_repo_finetuning/blob/main/output/comparison_report_Laddr.json)]
- **Design Documentation**: [[Link to design docs](https://github.com/TopologyApplied/code_repo_finetuning/blob/main/代码仓库智能训练数据生成系统_设计文档.md)]

## Version History

### v1.0 (2025-11-15)
- Initial release
- Fine-tuned on Laddr repository
- 650+ training samples
- LoRA rank 64, alpha 128
- 3 epochs training
- Overall improvement: +22.1%

---

**Note**: This is a demonstration of repository-specific fine-tuning methodology. The approach can be adapted to any codebase for creating custom code assistants.