Quantization
Reducing a model's numerical precision to decrease size, cost, and inference time.
Why it matters
Quantization makes large models runnable on smaller hardware — a 70B model at 4-bit can run on consumer GPUs.
In practice
The llama3.2 model via Ollama uses quantization to fit on our Hetzner server without dedicated GPU.
Related terms
Ollama
A tool for running AI models locally. Free, private, fast.
Local Model
An AI model running on your own server (e.g., via Ollama), keeping data private and costs near zero.
Inference
The process of an AI model generating a response or prediction from input data.
Knowledge Distillation
Compressing a larger model's behavior into a smaller model to reduce cost and latency.