Quantization

Reducing a model's numerical precision to decrease size, cost, and inference time.

Why it matters

Quantization makes large models runnable on smaller hardware — a 70B model at 4-bit can run on consumer GPUs.

In practice

The llama3.2 model via Ollama uses quantization to fit on our Hetzner server without dedicated GPU.

Related terms

A tool for running AI models locally. Free, private, fast.

An AI model running on your own server (e.g., via Ollama), keeping data private and costs near zero.

The process of an AI model generating a response or prediction from input data.

Compressing a larger model's behavior into a smaller model to reduce cost and latency.