Instructions to use DJLougen/Ornstein-27B-SABER-RYS-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use DJLougen/Ornstein-27B-SABER-RYS-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="DJLougen/Ornstein-27B-SABER-RYS-GGUF",
	filename="Ornstein-27B-SABER-RYS-L-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use DJLougen/Ornstein-27B-SABER-RYS-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf DJLougen/Ornstein-27B-SABER-RYS-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf DJLougen/Ornstein-27B-SABER-RYS-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf DJLougen/Ornstein-27B-SABER-RYS-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf DJLougen/Ornstein-27B-SABER-RYS-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf DJLougen/Ornstein-27B-SABER-RYS-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf DJLougen/Ornstein-27B-SABER-RYS-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf DJLougen/Ornstein-27B-SABER-RYS-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf DJLougen/Ornstein-27B-SABER-RYS-GGUF:Q4_K_M

Use Docker

docker model run hf.co/DJLougen/Ornstein-27B-SABER-RYS-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use DJLougen/Ornstein-27B-SABER-RYS-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "DJLougen/Ornstein-27B-SABER-RYS-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DJLougen/Ornstein-27B-SABER-RYS-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/DJLougen/Ornstein-27B-SABER-RYS-GGUF:Q4_K_M

Ollama
How to use DJLougen/Ornstein-27B-SABER-RYS-GGUF with Ollama:
```
ollama run hf.co/DJLougen/Ornstein-27B-SABER-RYS-GGUF:Q4_K_M
```

Unsloth Studio new

How to use DJLougen/Ornstein-27B-SABER-RYS-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for DJLougen/Ornstein-27B-SABER-RYS-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for DJLougen/Ornstein-27B-SABER-RYS-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for DJLougen/Ornstein-27B-SABER-RYS-GGUF to start chatting

Pi new

How to use DJLougen/Ornstein-27B-SABER-RYS-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf DJLougen/Ornstein-27B-SABER-RYS-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "DJLougen/Ornstein-27B-SABER-RYS-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use DJLougen/Ornstein-27B-SABER-RYS-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf DJLougen/Ornstein-27B-SABER-RYS-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default DJLougen/Ornstein-27B-SABER-RYS-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use DJLougen/Ornstein-27B-SABER-RYS-GGUF with Docker Model Runner:
```
docker model run hf.co/DJLougen/Ornstein-27B-SABER-RYS-GGUF:Q4_K_M
```

Lemonade

How to use DJLougen/Ornstein-27B-SABER-RYS-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull DJLougen/Ornstein-27B-SABER-RYS-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Ornstein-27B-SABER-RYS-GGUF-Q4_K_M

List all available models

lemonade list

DJLougen/Ornstein-27B-SABER-RYS

0% refusal. Zero perplexity degradation. Layer-duplicated reasoning boost.

This model combines two complementary, training-free surgical techniques applied to DJLougen/Ornstein-27B:

SABER (Spectral Analysis-Based Entanglement Resolution) — removes safety refusal behavior while preserving capability
RYS (Repeat Your Self) — duplicates reasoning-circuit layers to improve reasoning and emotional intelligence

Both techniques modify model structure without changing any weights — SABER through targeted direction ablation, RYS through layer duplication.

SABER: Refusal Ablation

Key Results

Metric	Baseline	SABER-Refined	Delta
Refusal Rate	100%	0%	-100%
Perplexity	3.5	3.5	+0.6%
Directions Ablated	—	125 (across 25 layers)	—

The refusal circuit is cleanly separated from capability — removing it produces zero measurable perplexity degradation.

How SABER Works

SABER identifies and ablates the refusal circuit through a five-stage pipeline:

Stage 1 — Probing: Extract activation profiles from both harmful and harmless inputs across all transformer layers.

Stage 2 — Spectral Analysis: Decompose activation differences into individual refusal directions, each scored by how strongly they separate harmful from harmless representations.

Stage 3 — Entanglement Quantification: Measure the overlap between each refusal direction and the model's capability subspace (reasoning, knowledge, code, etc.) to avoid collateral damage.

Stage 4 — Targeted Ablation: Remove only the pure-refusal components, with strength proportional to their purity (how little they overlap with capability).

Stage 5 — Iterative Refinement: Re-probe after each ablation pass to catch hydra effects (dormant refusal features that activate when primary ones are removed).

Key differentiator from prior work: SABER explicitly measures and respects the entanglement between refusal and capability representations. Directions that are heavily entangled with capability are either skipped or ablated at reduced strength.

Sweep Results

Configuration search over global_top_k (number of top directions selected globally) and alpha_base (base ablation strength):

Top-K	Alpha	Refusal	PPL	PPL Delta	Layers	Dirs Ablated
25	0.85	5%	3.5	+0.4%	25	125
25	1.00	0%	3.5	+0.6%	25	125
50	0.85	0%	3.5	+0.8%	36	250
50	1.00	0%	3.5	+0.7%	36	250
75	0.85	0%	3.5	+0.9%	37	375
75	1.00	0%	3.5	+0.9%	37	375

Best config: top_k=25, alpha=1.0 — achieves 0% refusal with zero meaningful PPL change, using the minimum number of directions.

Ablation Convergence (Best Config)

Capability degradation remains at 0.00% across all 5 iterations — the refusal directions are surgically removed with zero collateral damage.

RYS: Reasoning Layer Duplication

Method

RYS (Repeat Your Self) is a layer-duplication technique discovered by David Noel Ng that duplicates contiguous blocks of middle transformer layers so they execute twice per forward pass. No weights are modified — the model simply traverses some layers a second time, giving it "another pass" through its core reasoning circuit.

For a model with N layers, a configuration (i, j) produces:

Layers 0 through j−1 run normally
Then layers i through j−1 are re-executed (looped back)
Remaining layers j through N−1 run normally
Layers i through j−1 execute twice per inference pass

This exploits the functional neuroanatomy of transformers:

Early layers (0–5): Input encoding — duplication hurts
Middle layers (~10–50): Reasoning circuits in format-agnostic space — duplication helps
Late layers (~55–64): Output decoding — duplication degrades

Pareto-Optimal Configs for Qwen3.5-27B

Based on the full sweep of Qwen3.5-27B — 4,643 measured configurations, XGBoost surrogate over 430K+ candidates, and final validation on Math120 + EQ140 — the Pareto frontier lies in layers 26–34 of the reasoning circuit.

Important for GGUF/llama.cpp: Qwen3.5-27B is a hybrid Mamba/SSM + Attention architecture with a strict 4-layer repeating pattern (3 SSM + 1 ATTN). Layer duplication blocks must be a multiple of 4 layers to preserve this pattern, otherwise llama.cpp fails to load the model. The original Pareto configs from the blog (which used ExLlamaV3) have been adapted to the nearest valid 4-aligned configs:

Variant	Config	Duplicated Layers	Extra Layers	Overhead	Nearest Pareto Config
S	(28,32)	28–31	+4	+6.25%	≈ (30,34)
M	(31,35)	31–34	+4	+6.25%	≈ (31,34)
L	(30,34)	30–33	+4	+6.25%	≈ (30,35)
XL	(26,34)	26–33	+8	+12.50%	= (26,34) ✓

Critical finding: the (26,34) XL config is the only original Pareto point that is natively 4-aligned. The S/M/L variants use the nearest valid 4-layer blocks that cover the same reasoning region. The EQ delta barely moves across all sizes (+0.095 to +0.101), so even the smallest valid config delivers most of the benefit.

Reference: RYS Scores on Qwen3.5-27B

Probe scores from XpressAI/Qwen3.5-27B-RYS-UD-Q4_K_XL-GGUF (RYS-30-34 config, identical base architecture):

Probe	Base (64 layers)	RYS 30-33 (68 layers)	RYS 34-37 (68 layers)
Math	0.375	0.438	0.375
EQ	11.5	29.5	39.4
Reasoning	0.000	0.353	0.000
Logic	0.00	1.00	0.00

Reference: BFCLv4 Function Calling (RYS vs Baseline vs Frontier Models)

From the XpressAI RYS-30-34 evaluation on BFCLv4:

Task	RYS-30-34	Qwen3.5-27B Base	Δ
parallel	95.00%	93.00%	+2.00%
parallel_multiple	91.50%	76.00%	+15.50%
simple_javascript	72.00%	66.00%	+6.00%
live_relevance	81.25%	68.75%	+12.50%
multi_turn_base	74.50%	70.50%	+4.00%
multi_turn_long_context	67.50%	59.00%	+8.50%

7 of 13 benchmarks improved, with large gains on parallel function calling and live relevance.

Available Variants

File	RYS Config	Layers	Size
`Ornstein-27B-SABER-Q4_K_M.gguf`	— (SABER only)	64	16.5 GB
`Ornstein-27B-SABER-RYS-S-Q4_K_M.gguf`	(28,32)	68	~17.5 GB
`Ornstein-27B-SABER-RYS-M-Q4_K_M.gguf`	(31,35)	68	~17.5 GB
`Ornstein-27B-SABER-RYS-L-Q4_K_M.gguf`	(30,34)	68	~17.5 GB
`Ornstein-27B-SABER-RYS-XL-Q4_K_M.gguf`	(26,34)	72	~18.6 GB

Usage

# With llama.cpp (recommended: RYS-L for best balance)
./llama-server -m Ornstein-27B-SABER-RYS-L-Q4_K_M.gguf \
  --host 0.0.0.0 --port 8080 --n-gpu-layers 99 \
  --ctx-size 131072 --flash-attn on --jinja \
  -ctk q4_0 -ctv q4_0

Recommended: Start with RYS-L (layers 30-34 duplicated) for the best balance of reasoning improvement and overhead. Use RYS-S if you're VRAM-constrained.

Complementary Design

SABER and RYS target fundamentally different aspects of the model:

	SABER	RYS
Target	Refusal circuit	Reasoning circuit
Mechanism	Direction ablation	Layer duplication
Modifies weights	Yes (orthogonal projections)	No (virtual copies)
VRAM cost	Negligible	Extra KV cache + compute
Effect	Removes refusals	Improves reasoning/EQ
Risk	Capability entanglement	Junction discontinuity

Both are applied to the same base architecture (Qwen3.5-27B) and are architecturally compatible — SABER cleans the refusal subspace, RYS amplifies the reasoning subspace.

Capability Evaluation

Perplexity was evaluated on a diverse 100-prompt battery spanning five categories:

Arithmetic (20): multi-step calculation, algebra, word problems
Logic (20): syllogisms, conditional reasoning, puzzle solving
Code (20): function implementation, debugging, execution tracing
Instruction Following (20): constrained formatting, multi-step instructions
Factual Recall (20): geography, history, science, general knowledge

This diverse evaluation ensures the entanglement analysis captures capability across all reasoning modalities, not just a narrow slice.

Intended Use

This model is released for research purposes. It demonstrates that safety refusal can be surgically removed from a 27B multimodal model without degrading its capabilities, and that reasoning can be further enhanced through layer duplication — a finding with implications for both AI safety research and alignment.

Warning

⚠️ This model will comply with any request, including harmful ones. It is intended solely for research into alignment, safety, and model behavior.

Acknowledgments

The RYS (Repeat Your Self) layer-duplication method was discovered and developed by David Noel Ng (@dnhkng). The Pareto-optimal configurations for Qwen3.5-27B, the Math/EQ probes, the XGBoost surrogate pipeline, and the beam search methodology are all from his work. The GGUF surgery tools used to create these models are from alainnothere/llm-circuit-finder, an open-source (MIT) implementation of the RYS technique for llama.cpp.

If you use these models, please cite David Noel Ng's work:

Ng, David Noel. "LLM Neuroanatomy: How I Topped the Leaderboard Without Changing a Single Weight." dnhkng.github.io/posts/rys

Ng, David Noel. "LLM Neuroanatomy II: Modern LLM Hacking and Hints of a Universal Language." dnhkng.github.io/posts/rys-ii

The SABER refusal-ablation method is original to this model.