DistilBERT for Host-based Intrusion Detection System (HIDS)

This model is a fine-tuned DistilBERT model for binary classification of system call sequences to detect intrusions in the ADFA-LD dataset. The model was trained through hyperparameter tuning to achieve optimal performance for host-based intrusion detection.

Model Details

Base Model

  • Architecture: DistilBERT (DistilBertForSequenceClassification)
  • Base Model: distilbert-base-uncased
  • Task: Binary Sequence Classification (Normal vs Attack)
  • Number of Labels: 2

Training Configuration

  • Training Epochs: 8
  • Batch Size: 32
  • Learning Rate: 2e-05
  • Weight Decay: 0.0
  • Warmup Ratio: 0.1
  • Optimizer: AdamW
  • Scheduler: LinearLR

Dataset

  • Dataset: ADFA-LD (Australian Defence Force Academy Linux Dataset)
  • Preprocessing: 18-gram sequences

Performance

Validation Metrics

  • Accuracy: 94.03%
  • F1 Score: 94.50%
  • Precision: 92.45%
  • Recall: 96.64%
  • AUC-ROC: 96.30%

Usage

You can use this model directly with a pipeline for text classification:

>>> from transformers import pipeline

>>> classifier = pipeline('text-classification', model='salsazufar/distilbert-base-hids-adfa')
>>> classifier("1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18")

[{'label': 'LABEL_0',
  'score': 0.9876},
 {'label': 'LABEL_1',
  'score': 0.0124}]

Here is how to use this model to get the classification of a given text in PyTorch:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained('salsazufar/distilbert-base-hids-adfa')
model = AutoModelForSequenceClassification.from_pretrained('salsazufar/distilbert-base-hids-adfa')

# Prepare input (18-gram system call sequence)
text = "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18"
encoded_input = tokenizer(text, return_tensors='pt', padding='max_length', truncation=True, max_length=20)

# Forward pass
with torch.no_grad():
    output = model(**encoded_input)
    logits = output.logits
    probabilities = torch.softmax(logits, dim=-1)
    predicted_class = torch.argmax(logits, dim=-1).item()

# Interpret results
class_names = ["Normal", "Attack"]
print(f"Predicted class: {class_names[predicted_class]}")
print(f"Confidence: {probabilities[0][predicted_class].item():.4f}")
print(f"Probabilities: Normal={probabilities[0][0].item():.4f}, Attack={probabilities[0][1].item():.4f}")

Data Preprocessing

This model expects input in 18-gram format. If you have raw system call traces, you need to:

  1. Extract system calls from trace files
  2. Convert to n-grams (n=18)
  3. Format as space-separated string
  4. Ensure sequences are exactly 18 tokens (pad or truncate if necessary)

Example preprocessing pipeline:

def create_ngrams(trace, n=18):
    """Convert system call trace to n-grams"""
    ngrams = []
    for i in range(len(trace) - n + 1):
        ngram = trace[i:i+n]
        ngrams.append(" ".join(map(str, ngram)))
    return ngrams

Limitations and Considerations

  1. Domain Specific: This model is trained specifically on ADFA-LD dataset and may not generalize well to other system call datasets without retraining.

  2. Input Format: The model expects 18-gram sequences. Raw system calls must be preprocessed accordingly.

  3. Binary Classification: The model only distinguishes between "Normal" and "Attack" classes. It does not classify specific attack types.

BibTeX entry and citation info

@misc{distilbert-hids-adfa,
  title={DistilBERT for Host-based Intrusion Detection on ADFA-LD Dataset},
  author={salsazufar},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/salsazufar/distilbert-base-hids-adfa}}
}

References

License

This model is licensed under the Apache 2.0 license.

Downloads last month
17
Safetensors
Model size
67M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results