Building Production AI Agents with External Plugins and Skills

This article argues that reliable production agents require moving beyond single-prompt chains by implementing structured, callable external skills and plugins to manage state, interact with proprietary systems, and execute multi-step logic reliably. For a Finnish SMB needing an agent to process complex, multi-source data—say, reconciling invoices from a local accounting system with emails—relying solely on the LLM's context window fails. You need an explicit Agentic Workflow that routes tasks to specialized tools. This architecture moves the complexity out of the prompt and into verifiable, testable code modules.

The Architecture of Tool Use: Beyond Function Calling

When you start building agents, the initial hurdle is often getting the model to *know* it can use a tool. Modern frontier models handle this via function calling or tool definitions. This is the entry point. However, function calling only solves the *intent* problem: "I need to check inventory." It does not solve the *execution* problem: "How do I securely connect to the local ERP endpoint, handle OAuth, and parse the resulting JSON?"

In practice, you build an abstraction layer. This layer interprets the model's request, selects the correct underlying tool, executes the code (e.g., calling an internal API or running a Python script), and then feeds the *result* back to the LLM for final synthesis. This pattern is crucial for creating a true Agentic System. Think of it as the agent's nervous system. If the tool fails—the API times out, or the data format is unexpected—the agent must fail gracefully, reporting the specific error back to the planning loop, not just hallucinating a successful outcome. Understanding the difference between model-level tool calling and a robust, stateful execution framework is the biggest operational gap for new builders.

Implementing Skills: From Concept to Production Code

A "skill" in this context is not just a function; it’s a self-contained, tested unit of business capability. For a Finnish payroll SaaS, skills might include calculate_vat_liability(amount, region) or fetch_employee_details(employee_id).

When designing these, focus on inputs and outputs. The model only sees the schema definition—the expected inputs and the guaranteed output structure. This forces discipline. We often see teams struggle here: they write a skill that works perfectly in a Jupyter notebook but fails in the production execution environment because of environment variables or required credentials.

To manage this, you need a clear separation of concerns. The orchestration layer (the framework managing the calls) handles credentials and state persistence, while the skill module handles the pure business logic. Frameworks like LangGraph are excellent for modeling these state transitions, allowing you to define explicit paths for success, failure, and retry. For complex, multi-step processes, modeling this as an explicit Agent Team where one agent coordinates calls to specialized tool-using agents is often cleaner than one monolithic agent.

State Management and Reliability in Agent Workflows

The biggest failure point in any production agent is state management. An agent's "memory" is not the LLM's context window; it is the persistent state store you manage. If an agent needs to iterate—for example, checking five different customer accounts for an issue—it must save the results of the first four checks and pass that structured data to the fifth check, without losing context or re-running initial steps.

This requires integrating external, reliable data stores. You need a system to track which steps ran, what the inputs were, and what the outputs were. A combination of a relational database like PostgreSQL for structured records and a fast key-value store like Redis for transient session state is common. The ability to inspect this history—the Audit Trail—is non-negotiable for debugging and compliance. When building these systems, treat the state store as the source of truth, not the LLM's output.

Evaluating Local vs. Cloud Skills Execution

A common decision point is where the skill executes. Should the skill run via an API call to a hosted frontier model, or should it run locally?

If the skill requires access to highly sensitive, proprietary data (e.g., internal HR records), running the execution logic locally via a capable local model or a local runtime environment is necessary. This keeps the data boundary closed. However, local inference adds latency and complexity to the deployment pipeline.

Conversely, if the skill is purely mathematical or involves calling a well-documented, public API (like fetching weather data), using a hosted service via its dedicated SDK is simpler and faster to deploy. The tradeoff is always latency vs. data sovereignty. For European SMBs dealing with GDPR, the data sovereignty argument often outweighs the minor latency gain of a cloud call.

FAQ

Q: What is the minimum number of tools I need to build a basic agent? A: You need at least one tool that interacts with external state or data. A simple "calculator" skill is often enough to prove the loop, but for real value, aim for a tool that hits a non-trivial endpoint, like querying a specific internal database table.

Q: How do I handle sequential, multi-step reasoning that requires external data? You must implement a structured "Tool Calling" or "Function Calling" pattern. The LLM doesn't *know* the data; it outputs a structured JSON request telling your orchestrator *which* tool to run and *what* arguments to use. Your code runs the tool, gets the result, and feeds that result back to the LLM for the final answer synthesis.

Q: What is the difference between using a vector database and a traditional API call? A traditional API call is for structured, known endpoints (e.g., "get customer by ID"). A vector database (RAG) is for unstructured knowledge retrieval—it finds the *most semantically relevant* documents from a large corpus, which you then feed into the LLM as context.