Inference

The process of an AI model generating a response or prediction from input data.

Why it matters

Inference is where the cost happens. Understanding it helps you optimize: batch queries, cache results, choose the cheapest model.

In practice

We route inference strategically: FAQ matching avoids it entirely, Ollama handles simple tasks locally (free), Claude API is reserved for complex reasoning.

Related terms

LLM (Large Language Model)

A large language model like Claude, GPT, or Gemini. The "brain" that understands and generates language.

Latency

The time it takes for an agent to respond or act.

Cost Tracking

Monitoring every AI call: model used, tokens in/out, cost, cache status. Essential for profitable operations.

Ollama

A tool for running AI models locally. Free, private, fast.

Back to glossary