view post Post 1563 LLM Guardrail Models are Less Robust Against Text Mutation AttacksBlog post - https://huggingface.co/blog/kalyan-ks/llm-guardrail-models-less-robustEvaluated the robustness of three LLM guardrail models (GLiGuard, LlamaGuard3 and MiniGuard).Evaluation is done using 16 text mutation attacks over three datasets (AEGIS 2.0, WildGuard and ExpGuard).Achieved average Unsafe ASR score of up to 33% and average Safe ASR score of up to 25% against GLiGuard model.Achieved average Unsafe ASR score of up to 35% and average Safe ASR score of up to 17% against LlamaGuard3-8B model.Achieved average Unsafe ASR score of up to 45% and average Safe ASR score of up to 15% against MiniGuard v0.1 model. See translation 👀 2 2 + Reply
ExpGuard Collection ExpGuard safety classifier models for detecting unsafe content. Available in multiple sizes: 1.5B, 3B, and 7B parameters. • 3 items • Updated Jan 28 • 1
view article Article LLM Guardrail Models are Less Robust Against Text Mutation Attacks kalyan-ks • 5 days ago • 1
view article Article LLM Guardrail Models are Less Robust Against Text Mutation Attacks kalyan-ks • 5 days ago • 1