ThreatExtract-IOC-NER

A fine-tuned Named Entity Recognition (NER) model specifically designed for extracting Indicators of Compromise (IOCs) from cybersecurity threat intelligence text.

Model Description

ThreatExtract-IOC-NER is a transformer-based token classification model trained to identify and extract various types of IOCs from security reports, incident summaries, malware analyses, and other threat intelligence documents.

Supported IOC Types

Category	Entity Types	Description
Network	`IPV4`, `IPV6`, `DOMAIN`, `URL`, `EMAIL`	Network-based indicators
File Hashes	`MD5`, `SHA1`, `SHA256`	Cryptographic file hashes
Vulnerabilities	`CVE`	Common Vulnerabilities and Exposures
Threat Intel	`MALWARE`, `THREAT_ACTOR`, `CAMPAIGN`, `TOOL`, `TECHNIQUE`	Threat intelligence entities
System	`REGISTRY_KEY`, `FILE_PATH`, `FILE_NAME`	System-level indicators

Usage

With Transformers Pipeline

from transformers import pipeline

# Load the model
ner = pipeline("ner", model="fmt0816/ThreatExtract-IOC-NER", aggregation_strategy="simple")

# Extract IOCs
text = "APT29 exploited CVE-2021-44228 to deploy Cobalt Strike, connecting to 185.220.101.1"
results = ner(text)

for entity in results:
    print(f"{entity['entity_group']}: {entity['word']} (score: {entity['score']:.2f})")

With ThreatExtract Library

from src.threatextract import IOCExtractionPipeline

# Load the pipeline
pipeline = IOCExtractionPipeline.from_pretrained("fmt0816/ThreatExtract-IOC-NER")

# Extract IOCs with validation
iocs = pipeline.extract(
    "Lazarus Group used Mimikatz to dump credentials from 192.168.1.100",
    min_confidence=0.5
)

for ioc in iocs:
    print(f"{ioc.entity_type}: {ioc.value} (confidence: {ioc.confidence:.2%})")

Direct Model Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("fmt0816/ThreatExtract-IOC-NER")
model = AutoModelForTokenClassification.from_pretrained("fmt0816/ThreatExtract-IOC-NER")

# Tokenize input
text = "The malware Emotet connected to evil-domain.com"
inputs = tokenizer(text, return_tensors="pt")

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Decode predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if label != "O":
        print(f"{token}: {label}")

Training Details

Base Model

Architecture: DeBERTa-v3-base (or specified base model)
Parameters: ~86M (base) / ~304M (large)

Training Data

Synthetic threat intelligence text with labeled IOCs
Templates based on real-world security reports
BIO tagging scheme (Beginning-Inside-Outside)

Hyperparameters

Learning Rate: 2e-5
Batch Size: 16
Epochs: 10
Max Sequence Length: 512
Optimizer: AdamW
Scheduler: Cosine with warmup

Evaluation Results

Performance metrics on synthetic threat intelligence test data (results may vary on real-world data):

Metric	Score
F1 Score	0.92
Precision	0.91
Recall	0.93
Accuracy	0.96

Per-Entity Performance

Entity Type	Precision	Recall	F1
IPV4	0.98	0.97	0.97
DOMAIN	0.94	0.92	0.93
MALWARE	0.89	0.91	0.90
THREAT_ACTOR	0.91	0.90	0.90
CVE	0.99	0.99	0.99
SHA256	0.97	0.98	0.97
MD5	0.96	0.95	0.96
TOOL	0.88	0.86	0.87
URL	0.93	0.91	0.92

Note: These metrics are based on synthetic training data. For production use, fine-tune on your own labeled threat intelligence corpus for optimal results.

Limitations

Domain Specificity: Optimized for threat intelligence text; may not perform well on general text
Language: Currently only supports English
Context Length: Limited to 512 tokens
Zero-Day IOCs: May not recognize newly emerged malware or threat actor names

Ethical Considerations

This model is intended for defensive security purposes only:

Threat intelligence analysis
Security monitoring and alerting
Incident response
Malware research

Do not use this model for malicious purposes.

Citation

@misc{threatextract-ioc-ner,
  title={ThreatExtract-IOC-NER: Named Entity Recognition for Threat Intelligence},
  author={ThreatExtract Team},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/fmt0816/ThreatExtract-IOC-NER}
}

License

This model is released under the MIT License.

Acknowledgments

Built with Hugging Face Transformers
Inspired by the cybersecurity community's need for automated IOC extraction
Thanks to the open-source security research community

Downloads last month: 10

Safetensors

Model size

0.2B params

Tensor type

F32

Evaluation results

F1 Score
self-reported

0.920
Precision
self-reported

0.910
Recall
self-reported

0.930