DistilBERT for Host-based Intrusion Detection System (HIDS)
This model is a fine-tuned DistilBERT model for binary classification of system call sequences to detect intrusions in the ADFA-LD dataset. The model was trained through hyperparameter tuning to achieve optimal performance for host-based intrusion detection.
Model Details
Base Model
- Architecture: DistilBERT (DistilBertForSequenceClassification)
- Base Model:
distilbert-base-uncased - Task: Binary Sequence Classification (Normal vs Attack)
- Number of Labels: 2
Training Configuration
- Training Epochs: 8
- Batch Size: 32
- Learning Rate: 2e-05
- Weight Decay: 0.0
- Warmup Ratio: 0.1
- Optimizer: AdamW
- Scheduler: LinearLR
Dataset
- Dataset: ADFA-LD (Australian Defence Force Academy Linux Dataset)
- Preprocessing: 18-gram sequences
Performance
Validation Metrics
- Accuracy: 94.03%
- F1 Score: 94.50%
- Precision: 92.45%
- Recall: 96.64%
- AUC-ROC: 96.30%
Usage
You can use this model directly with a pipeline for text classification:
>>> from transformers import pipeline
>>> classifier = pipeline('text-classification', model='salsazufar/distilbert-base-hids-adfa')
>>> classifier("1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18")
[{'label': 'LABEL_0',
'score': 0.9876},
{'label': 'LABEL_1',
'score': 0.0124}]
Here is how to use this model to get the classification of a given text in PyTorch:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained('salsazufar/distilbert-base-hids-adfa')
model = AutoModelForSequenceClassification.from_pretrained('salsazufar/distilbert-base-hids-adfa')
# Prepare input (18-gram system call sequence)
text = "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18"
encoded_input = tokenizer(text, return_tensors='pt', padding='max_length', truncation=True, max_length=20)
# Forward pass
with torch.no_grad():
output = model(**encoded_input)
logits = output.logits
probabilities = torch.softmax(logits, dim=-1)
predicted_class = torch.argmax(logits, dim=-1).item()
# Interpret results
class_names = ["Normal", "Attack"]
print(f"Predicted class: {class_names[predicted_class]}")
print(f"Confidence: {probabilities[0][predicted_class].item():.4f}")
print(f"Probabilities: Normal={probabilities[0][0].item():.4f}, Attack={probabilities[0][1].item():.4f}")
Data Preprocessing
This model expects input in 18-gram format. If you have raw system call traces, you need to:
- Extract system calls from trace files
- Convert to n-grams (n=18)
- Format as space-separated string
- Ensure sequences are exactly 18 tokens (pad or truncate if necessary)
Example preprocessing pipeline:
def create_ngrams(trace, n=18):
"""Convert system call trace to n-grams"""
ngrams = []
for i in range(len(trace) - n + 1):
ngram = trace[i:i+n]
ngrams.append(" ".join(map(str, ngram)))
return ngrams
Limitations and Considerations
Domain Specific: This model is trained specifically on ADFA-LD dataset and may not generalize well to other system call datasets without retraining.
Input Format: The model expects 18-gram sequences. Raw system calls must be preprocessed accordingly.
Binary Classification: The model only distinguishes between "Normal" and "Attack" classes. It does not classify specific attack types.
BibTeX entry and citation info
@misc{distilbert-hids-adfa,
title={DistilBERT for Host-based Intrusion Detection on ADFA-LD Dataset},
author={salsazufar},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/salsazufar/distilbert-base-hids-adfa}}
}
References
- ADFA-LD Dataset: ADFA-LD: An Anomaly Detection Dataset for Linux-based Host Intrusion Detection Systems
- DistilBERT: DistilBERT, a distilled version of BERT
License
This model is licensed under the Apache 2.0 license.
- Downloads last month
- 17
Evaluation results
- accuracy on ADFA-LDself-reported0.940
- f1 on ADFA-LDself-reported0.945
- precision on ADFA-LDself-reported0.924
- recall on ADFA-LDself-reported0.966
- auc on ADFA-LDself-reported0.963