Open-Weight Models

We publish Guard models on HuggingFace that you can run on your own infrastructure. No API calls, no data leaving your environment. 50-100ms latency on CPU or GPU.

Available Models

Model	Parameters	Format	Use Case
superagent-guard-0.6b	0.6B	Safetensors	Fast inference, edge deployment
superagent-guard-1.7b	1.7B	Safetensors	Balanced speed and accuracy
superagent-guard-4b	4B	Safetensors	Maximum accuracy

GGUF (CPU Inference)

Quantized versions optimized for CPU inference with llama.cpp:

Model	Parameters	Download
superagent-guard-0.6b-gguf	0.6B	HuggingFace
superagent-guard-1.7b-gguf	1.7B	HuggingFace
superagent-guard-4b-gguf	4B	HuggingFace

When to Self-Host

Self-host the models when you need:

Data residency: Keep all data on your infrastructure
Air-gapped environments: No external API calls
Cost control: No per-request pricing at scale
Latency guarantees: Predictable inference times

Getting Started

Download a model from HuggingFace and run it with your preferred inference framework (vLLM, llama.cpp, Transformers, etc.).

# Clone a model
git lfs install
git clone https://huggingface.co/superagent-ai/superagent-guard-1.7b

# Or download GGUF for CPU
huggingface-cli download superagent-ai/superagent-guard-1.7b-gguf

Browse all models: huggingface.co/superagent-ai

Open-Weight Models

Open-Weight Models

Available Models

GGUF (CPU Inference)

When to Self-Host

Getting Started

On this page