You can read on any given day that AI is unreliable slop, unfit for serious work. You can also pull up Klarna’s, Anthropic’s or Cursor’s published production metrics from the last two years and see something very different. Both pictures are real. The distance between them is mostly engineering, not model choice.
The disappointment usually starts with treating a chat box like a system. You type a question into ChatGPT or Claude, get a thin answer, conclude the technology is not ready. The teams shipping AI work that actually clears queues, processes claims or moves money are not using a different model. They built around it: clear objectives, scoped tools, evaluation harnesses, approval gates, audit logs. The agent is the whole workflow, not the chat surface.
By mid-2026 the gap between organisations with AI in production and organisations with AI in slide decks is mostly explained by this. Models keep improving roughly every quarter. Process discipline is what compounds.
What Are AI Agents and How Do They Differ from Basic Chatbots

An AI agent is software that observes its environment, decides what to do, then does it, without waiting for a human to drive each step. Inputs are anything you can pipe in: emails, system alerts, database changes, API responses, document uploads. The autonomy is deliberately bounded; agents act inside parameters someone set on purpose.
The line between agent and chatbot becomes obvious when you look at scope. A chatbot processes one input and returns one response inside a single conversation. An agent holds memory across sessions, integrates with the systems behind the conversation (CRM, ERP, ticketing, payments) and runs multi-step workflows. When Klarna said in February 2024 that its AI was doing the work of 700 full-time agents (source), that was not a chatbot. It authenticated customers, read order histories, applied refund policies, processed payments and escalated edge cases.
This is where the common misconception sits: ChatGPT is not an agent. The chat surface is a conversational tool on top of a model. When that model gets tools, memory and an orchestration layer (OpenAI’s Responses API (source), Anthropic’s Agent SDK, LangGraph, Semantic Kernel), it can participate inside an agent system. The agent is the wiring around it. Not the model, not the chat surface; the system.
Stanford’s annual AI Index (source) has shown for years now that task-specific performance varies far more by retrieval quality, tool design and feedback loops than by which top model you pick. A well-scoped open-source setup regularly beats an expensive proprietary call dropped behind a thin prompt.
How AI Agents Work: Architecture and Core Components

Modern agents run a perceive, decide, act loop borrowed from robotics. Perception pulls events from APIs, webhooks, file uploads, inboxes or sensor feeds and normalises them into structured records. Agents also actively query: REST endpoints, operational databases, knowledge bases.
Event-driven architectures (Kafka, NATS, Pub/Sub) let agents process streams without blocking on individual requests. Most production setups also preprocess at this layer: schema validation, PII redaction, deduplication, contextual enrichment. Vector stores (pgvector, Weaviate, LanceDB, Pinecone) underpin retrieval-augmented generation so the agent’s decisions sit on current documents instead of stale training data.
The decision layer is where a model plans a sequence of actions and selects the tools to run them. Patterns like ReAct, Tree-of-Thought and Reflexion give the agent a structured way to reason rather than blurt out the first thing it thinks of. Policy layers constrain what it can actually do: JSON schemas on tool arguments, Pydantic validators, explicit allow and deny lists. Probabilistic reasoning sits inside deterministic guardrails.
Action execution happens through tool interfaces wired up via function calling, REST or gRPC. The patterns that hold under load are the boring ones. Idempotency keys so a retried request does not double-charge a customer. Exponential backoff. Circuit breakers around fragile dependencies. Secrets in KMS or Vault. OAuth tokens short-lived and scoped. IAM least-privilege so a compromised agent has limited reach.
Memory is split between short-term context inside the model’s window and long-term storage of episodic and semantic facts. Rolling summaries, memory compaction and task-specific retrieval keep the prompt small while preserving the history that matters. Learning happens through offline fine-tuning on successful runs, preference optimisation against human feedback and online adaptation through evaluators.
Essential Tools and Frameworks for Agent Development
The toolchain shifts faster than any list keeps up with, but as of mid-2026 the building blocks settle into a few categories. Orchestration frameworks: LangGraph, Semantic Kernel, LlamaIndex, CrewAI. Managed agent runtimes: OpenAI’s Responses API (which replaced the older Assistants surface in 2025), Anthropic’s Agent SDK. Durable workflow engines: Temporal, Inngest, Vercel Workflow, Airflow.
Vector retrieval needs tuning to be useful. Chunk size, embedding model, similarity threshold and reranking strategy all push results around. FAISS or Chroma are fine for prototyping. Production usually moves to a managed service so backups, scale and security updates are someone else’s problem.
Evaluation tooling has caught up with the rest of the stack. LangSmith, Braintrust, Promptfoo, Ragas and TruLens give structured ways to run regression tests against golden datasets, score retrieval accuracy and probe for prompt injection. Once eval is wired up, agent changes can ride the same CI pipeline as any other code change.
The Role of Human Oversight in Agent Architecture
Reliable agents are built with humans in the loop, not humans cut out of it. Approval gates make sense for high-impact actions: financial transactions over a threshold, configuration changes on shared infrastructure, customer communications that depart from the template. Calibrate where the gate sits so it adds control without grinding the workflow to a halt.
Monitoring is what tells you the agent is still doing what you signed off on. Token usage, tool call latency, error rate, approval rate and the business metric the agent is meant to move all belong on a dashboard a human reads. Alerts catch the moments when the agent hits something novel or burns through its error budget.
When agents make mistakes, post-incident review is what turns those mistakes into improvements. Pull the prompt transcript, the tool call sequence, the inputs that led to the bad call. Updates can be a prompt edit, a new tool constraint, an extra evaluator, a tighter escalation rule. The same failure should not happen twice.
Types of AI Agents and Their Specific Capabilities

Reactive agents map the current input straight to an action, no internal model, no memory. They earn their keep when the response needs to be fast and predictable: paging on-call when a metric crosses a threshold, routing tickets by keyword, kicking off a build on commit. They struggle the moment the problem requires holding state.
Model-based agents keep an internal picture of the environment and update it as new information lands. A monitoring agent might keep a running model of normal versus anomalous behaviour, blending logs, metrics and alerts into something with situational awareness. The risk is the agent’s model and the actual system drifting apart.
Goal-based agents work backwards from a desired outcome and plan a path to get there. A procurement agent told to minimise lead time within a budget can research suppliers, request quotes, negotiate terms and place orders as one connected workflow. Planning adds latency, so the pattern fits problems where getting the right answer matters more than getting an instant one.
Utility-based agents balance multiple objectives through an explicit utility function. A bidding agent might maximise expected conversions while keeping CPC below a ceiling and respecting frequency caps. The interesting failure mode is metric gaming: optimising the number you can measure at the expense of the outcome you actually care about. Counter-metrics catch this.
Learning agents adjust behaviour from feedback. Coding agents learn from which patches passed tests; support agents learn from CSAT scores. The work is in the guardrails: data governance for what goes into training, drift detection for when learned behaviour diverges from policy, and the ability to roll back when the learning makes things worse.
Multi-agent systems coordinate specialists through a shared protocol. Microsoft’s AutoGen, CrewAI and OpenAI’s Swarm all provide shapes for this: defined roles, structured handoffs, a supervisor that decides when one agent is done. The hazards are deadlocks, runaway loops and chatty agents that burn tokens without producing anything.
AI Agents vs Traditional AI, RPA, and Classic ML Pipelines

Comparing agents to the automation tools already in your stack is the cleanest way to see what is genuinely new and what is repackaging.
Traditional ML pipelines run in batch on structured inputs and produce predictions or classifications. They do not act. A fraud model flags suspicious transactions; an agent flags them, opens the account, checks recent activity and freezes the card pending review. ML is a component an agent can call. It is not the agent.
RPA scripts deterministic clicks against deterministic UIs. It works where the inputs are stable and breaks the moment the form changes or a field moves. An agent can read varied PDF layouts, ask a clarifying question when data is ambiguous and make a judgement call on edge cases. RPA is for the parts of a process that do not change. Agents are for the parts that do.
Classic chatbots respond to messages within a session. They do not hold goals across interactions or execute downstream actions. A chatbot answers “where is my order?”. An agent notices the order is delayed, rebooks the shipment, applies the policy-allowed discount and sends the customer the new ETA before they have to ask.
Agents combine language understanding, persistent memory, tool execution and goal-oriented planning. They run multi-step workflows toward a defined outcome with audit trails and human oversight along the way. That combination is what lets them handle work that previously required a human at the keyboard.
AI Agents Examples: Enterprise Use Cases and Real-World Implementations

Customer service is the most documented application, and Klarna is the most documented example. The February 2024 announcement: an AI assistant doing the work equivalent of 700 full-time agents, two-thirds of chats handled, resolution time down 25% (source). By May 2025 the company walked part of that back. CEO Sebastian Siemiatkowski told Bloomberg the cost-cutting had gone too far and that human staff were being brought back for quality reasons (source). Both halves matter. The agent absorbed the volume. Complex, emotional and high-stakes cases are still where humans win. Treating an agent rollout as a headcount programme rather than a workflow programme is itself a case study now.
Financial markets use agents for monitoring, analysis and execution under heavy risk controls. Firms blend deterministic algorithms with model components for data extraction and scenario assessment. The UK FCA’s Market Watch, NY DFS bulletins and the EU AI Act all push the same operational shape: kill switches, pre-trade risk checks, simulation environments, audit trails. Objective design has to actively avoid encouraging excessive risk-taking, because a poorly framed reward function is how an autonomous trader gets you into the news.
Supply chain agents coordinate forecasting, inventory and logistics across systems. McKinsey’s operations research consistently puts inventory reductions at 10–20% and service-level improvements at 5–10% when AI planning replaces manual processes. Agents make those numbers operational: allocating stock against live demand, rebooking shipments after delays, notifying customers before they ask. The dependency is clean data pipelines, which is a data engineering problem more than an agent one.
Software engineering agents have moved from demo to default in roughly two years. GitHub’s 2022 Copilot studies reported up to 55% faster task completion (source). The more interesting numbers since 2025 come from full agentic coding tools (Cursor, Claude Code, Devin, Codex) measuring task throughput rather than autocomplete speed. The gains compound when agents understand the repository, follow the project’s conventions and integrate with CI rather than dropping isolated suggestions.
Cybersecurity agents detect threats, correlate alerts and execute response playbooks faster than a human team. MITRE Caldera provides automated adversary emulation. DARPA’s AI Cyber Challenge concluded at DEF CON 33 in August 2025 with autonomous systems doing end-to-end vulnerability discovery, patching and verification on real codebases (source). The promise is faster mean time to response. The risk is autonomous responses taking down systems that were not actually compromised, which is why approval gates and runbooks are non-negotiable here.
Where AI Agents Are Spreading Beyond Customer Service
Insurance claims processing is one of the more mature non-customer-service deployments. Agents validate documentation, cross-reference policy terms, run fraud signals and calculate settlements, with humans on high-value or unusual cases. Published work from Lemonade, Zurich and Tractable points in the same direction: claims cycle time down by a third or more, fraud detection improved, junior adjuster time freed up for the cases that need judgement.
Healthcare scheduling is a quieter but useful category. Agents handle bookings, cancellations and rescheduling against provider utilisation and patient wait times. NHS trust pilots have reported sizeable drops in no-show rates and gains in slot utilisation through proactive reminders and automated rebooking. The integration burden is real (EHR systems, travel times, patient preferences, multi-provider constraints), but the return shows up.
Manufacturing quality inspection combines computer vision with contextual reasoning. Automotive plants running agentic inspection report meaningful drops in defect escape rates, mostly because the agent correlates inspection data with process and supplier metrics that no human inspector would join up by hand.
Legal document review is the canonical “agent reads a lot of documents fast” use case. Law firms running these systems consistently report substantial reductions in initial review time with better consistency on flagging clauses, risks and anomalies. The agent does the routine extraction. Senior associates do the judgement.
Returns vary with implementation quality and how honestly the team measures. Klarna’s 2024 and 2025 metrics sit at one end of the range; McKinsey’s productivity research at the other. The numbers worth tracking internally are first-contact resolution, backlog burn-down, mean time to recovery and unit cost per transaction, all tied back to the business outcome the agent was meant to move.
Why AI Agents Fail: The Critical Role of Process Design and Quality Assurance

Public AI failures almost always trace back to process gaps, not model limits. The 2024 Air Canada chatbot ruling, where a Canadian tribunal held the airline liable for misleading bereavement fare advice generated by its bot (source), is the textbook non-autonomous example. The 2025 round of agentic incidents (Replit’s autonomous coding agent that wiped a customer’s database, the Cursor support bot fabricating policy that did not exist, multiple legal filings citing AI-hallucinated cases) all share the same pattern: tools the agent should not have been able to call, prompts with too much room for interpretation, approval gates that were either missing or set in the wrong places.
The other recurring failure is fuzzy objectives. Agents asked to “improve customer experience” or “boost productivity” deliver something, just not always the thing the business wanted. Implementations that work translate intent into measurable KPIs early: cut average handling time by 20% while keeping CSAT above 85, halve mean time to recovery without raising change failure rate. Counter-metrics catch the agent that hits the headline number by gaming the behaviour underneath.
Common Implementation Failures and Their Root Causes
Ambiguous prompts produce agents that operate outside the intent of the people who deployed them. “Resolve customer issues quickly” without bounds turns into oversized refunds and promises the company cannot keep. The fix is concrete constraints in the prompt: maximum refund threshold, approved resolution options, explicit escalation criteria.
Cost optimisation that targets the wrong layer is a quieter failure. Teams that aggressively trim per-call costs (cheap models, short context, low retrieval limits) often find the agent’s reasoning has collapsed and downstream costs (rework, escalations, lost customers) have grown. The right cost target is the workflow outcome, not the API call.
Change management is where most agent rollouts actually die. Insufficient training, processes left untouched, trust eroded by early visible mistakes. The HBR and MIT Sloan literature on the previous wave of digital transformation already wrote this script: capability building and role redesign are prerequisites. Treating an agent rollout as a technology project rather than organisational change is the most expensive way to do it.
The QA practices that prevent these failures are not new: pre-production testing with realistic edge cases, canary deployments with automatic rollback, monitoring with humans on call for high-impact decisions. The NIST AI Risk Management Framework (source) provides a structure for mapping risks to controls. LangSmith, Braintrust and Promptfoo handle the regression testing.
How to Build AI Agents: Essential Skills and Implementation Process
Prompt engineering is closer to systems writing than to creative writing. Effective prompts include explicit constraints, step-by-step policies, negative examples for disallowed behaviour and structured output (JSON schema or similar). They include the rationale for tool selection so an auditor can read why the agent did what it did, and they include self-checking steps so the agent catches its own contradictions before they reach the world. Prompt versions live in source control alongside the code.
Domain expertise is the bottleneck more often than model expertise. Agents have to embody business rules, regulatory constraints and process nuances that are not in the training data. Claims rules, KYC checks, change windows, refund policies, compliance carve-outs. Subject matter experts have to sit inside the team during build, capturing tacit knowledge as test scenarios and signing off before the agent touches a regulated workflow.
The technical foundations are conventional engineering: API design and integration, event-driven architecture, durable workflow orchestration (Temporal, Inngest, Vercel Workflow, Airflow), vector stores, secrets management, IAM, idempotent tool design with proper retry semantics. The LLM-specific layer adds tool definitions, function calling, evaluation harnesses and observability for tokens and latency.
How to Create AI Agents: A Step-by-Step Reference Implementation

An order refund agent is a good worked example because every component shows up: ingestion, retrieval, planning, tool execution, approvals, audit, testing.
Step 1: Define the ingestion layer
A webhook receives refund requests from the customer service portal. Schema validation confirms the required fields are present (order ID, reason code, customer auth token). Rate limiting prevents abuse. A queue absorbs traffic spikes so the downstream pipeline never sees them.
Step 2: Set up retrieval-augmented generation
A vector store holds refund policies, precedent decisions and product-specific rules. Semantic search retrieves relevant policy sections by reason code. A hybrid search layer combines exact-match on order IDs and SKUs with semantic matching, so context is accurate, not just relevant.
Step 3: Design the planning and decision layer
A LangGraph or Semantic Kernel agent gets custom tools for order lookups, refund calculation and approval routing. Prompts include explicit policy constraints, worked examples for edge cases and a structured reasoning format that produces auditable decision traces.
Step 4: Implement tool execution
Tools wrap the order system, payments and notifications. Each tool is idempotent, so a retried call does not refund the customer twice. Errors surface as structured responses the agent can reason about, not as opaque exceptions.
Step 5: Configure approval gates
A durable workflow engine (Temporal, Inngest or Vercel Workflow) orchestrates the refund. Standard cases under £500 that meet policy auto-approve. Larger amounts or unusual circumstances route to a human, with SLA timers and escalation paths.
Step 6: Establish audit logging
Every decision point logs full context: the original request, the retrieved policies, the agent’s reasoning, every tool call, the final outcome. Logs stream to the SIEM for compliance monitoring and feed back into the evaluation dataset for the next iteration.
Step 7: Test before deployment
Unit tests cover each tool. Integration tests cover the full workflow against a mock order system. End-to-end scenarios cover happy paths and edge cases. Adversarial tests probe for prompt injection. Load tests confirm the system holds under traffic. The metrics that matter post-launch are refund accuracy against policy, processing time per request, approval-rate alignment with historical baselines and post-refund CSAT.
What Sets Reliable Agents Apart
The agents that go the distance are the ones with measurable business impact, not the ones with the most novel architecture. They show up in metrics: cost per transaction down 40%, first-contact resolution up 30%, processing time halved. Those numbers come from being well-integrated with existing systems, having clear success criteria and being tuned continuously against what the production data shows.
Reliability separates production from demo. Production agents perform consistently across varied inputs, handle edge cases without melting, explain their decisions in a way an auditor can read and operate within documented SLAs. They include fallback paths and a smooth handoff to a human when confidence is low.
Trust is built through transparency. Users rely on the agents that expose their reasoning, let operators set approval thresholds per risk level and produce audit trails that hold up under scrutiny. People can see why the agent did what it did, and step in when context calls for human judgement.
Adaptability keeps an agent valuable as the business changes. Systems worth keeping support prompt updates without code deploys, easy integration of new tools and data sources, and gradual scope expansion as trust accumulates. They keep clean separation between business logic, integration code and model components so any one of them can change without breaking the others.
Security and compliance are not bolt-ons. Enterprise-grade agents include access controls, encrypted data handling, secure credential management and audit trails that evidence GDPR, HIPAA or FCA compliance. The agents that survive a regulator’s questions were built with this in from day one, not retrofitted after the first audit finding.
Benefits, Risks, and Security Considerations for AI Agents
Productivity numbers from the last two years are real, with caveats. Klarna’s 25% drop in average handling time and 700-FTE-equivalent capacity stand, alongside the company’s 2025 acknowledgement that quality required bringing humans back. GitHub’s 2022 Copilot studies hit 55% faster task completion; 2025/2026 follow-ons from GitHub, Cursor and Anthropic put full agentic coding gains in the same range. McKinsey’s 2023 and 2024 work on AI in operations points to double-digit productivity gains when workflows are redesigned around the agent rather than bolted on top.
By 2026 the threat picture includes AI-assisted attacks at scale. Personalised phishing, credible deepfakes, automated reconnaissance and agentic credential harvesting have collapsed attacker timelines. The Microsoft and OpenAI joint threat actor reports through 2024 and 2025 documented state-aligned groups using LLMs for reconnaissance, social engineering and code generation. Verizon’s annual DBIR (source) has tracked the corresponding shrinking dwell times. Defending requires the same automation discipline applied to abuse: red-team your own agent endpoints, monitor for tool misuse, treat the agent as part of the attack surface.
The new attack surfaces sit in the agent’s connectors. Tool integrations, prompt injection channels, callback URLs and external data sources are all entry points. The OWASP Top 10 for LLM and Generative AI Applications, updated in 2025 (source), lists the concrete risks: prompt injection (still LLM01), insecure output handling, data poisoning, SSRF via browsing tools, excessive agency. The controls are practical: input and output filters, strict tool allowlists, network egress restrictions, audit logging that traces a prompt through every system effect it produced.
Regulatory Compliance and Disclosure Requirements
The EU AI Act (source) came into force in August 2024. Prohibited practices started applying in February 2025. General-purpose AI obligations followed in August 2025. The high-risk AI obligations are scheduled to take effect in August 2026, meaning conformity assessments, CE marking and post-market monitoring are imminent operational requirements rather than future planning. The UK Information Commissioner’s Office has continued to issue agent-relevant data protection guidance, and ISO/IEC 42001 (source) (published December 2023) provides the management system framework that most organisations are now certifying against.
In the US, Texas signed TRAIGA (Responsible AI Governance Act) in mid-2025, most provisions effective January 2026, layered on top of the 2024 Texas Data Privacy and Security Act. Colorado’s AI Act, Utah’s AI Policy Act, the NYC local law on AI in hiring and the wave of state proposals through 2025 mean “where will this agent operate” is now a compliance question that needs answering before the architecture is locked.
Translating this into practice means risk classification frameworks that decide oversight depth, documented human accountability chains for agent decisions, record-keeping that supports a regulator audit, privacy impact assessments, bias testing and a clear user recourse mechanism. None of that is optional in a high-risk deployment.
Security Framework for AI Agent Protection
Layered security combines technical controls with governance. Role-based access control with least-privilege principles, approval gates for high-impact actions, immutable audit trails that support forensic analysis. The NIST AI Risk Management Framework provides the structure for identifying and mitigating risk. OWASP’s LLM and Generative AI guidance translates the general principles into agent-specific controls.
Operational security includes model and dataset documentation (model cards, system cards), change control for prompts and tools, scheduled red-team exercises against prompt injection and jailbreak vectors, third-party security assessments. ISO/IEC 42001 certification is increasingly the way to evidence systematic AI security management to enterprise procurement.
Incident response for agents covers a different surface than traditional IR. Playbooks need agent-specific scenarios: prompt injection events, tool abuse, decision appeals from affected users, cascading failures across integrated systems. Procedures should include a clear shutdown path, audit log preservation, stakeholder communication and a post-incident review that produces concrete control changes, not just a write-up.
Best Practices for Implementing AI Agents in Your Organisation
The work starts with outcomes and acceptance criteria agreed jointly with the business. Useful objectives look like “reduce support backlog by 30% within three months while keeping CSAT above the current floor” or “cut mean time to recovery by 25% without raising change failure rate”. Counter-metrics catch the agent gaming the headline number. SLOs give you something to monitor and alert on.
Testing has to happen at multiple levels. Unit tests on tool functions. End-to-end simulations against realistic and adversarial scenarios. Resilience testing against dependency failures. Canary deployments with control group comparisons so improvement claims are statistical, not anecdotal. Golden transcript regression tests catch quality drift after every prompt or tool change.
Monitoring needs detail: token consumption, tool call latency, error frequency, approval rate, the business outcome the agent is meant to move. Dashboards turn those signals into something humans can use. User feedback should be structured (reasons, not just thumbs).
Documentation has to live with the system. Prompt and policy docs that update as the prompts do. Role-specific training for operators and end-users. Changelogs that explain what changed and why, so the next person on call has context.
Governance covers the responsibility matrix for agent decisions, approval authority for high-impact actions and the conditions under which the agent gets paused or shut down. AI review boards (domain experts, security, compliance) should sign off on risk classification, data usage and deployment readiness before anything goes live.
Operational maintenance is a real workload. Scheduled prompt and tool reviews. Dependency updates behind feature flags. Knowledge base refreshes. Credential rotation. Quarterly objective reviews to confirm the agent is still aligned with what the business wants. Retirement procedures for agents whose ROI or risk profile no longer justifies running them.
Treat agents as engineered systems and they behave like engineered systems. The disciplines are the same ones that already produce reliable software: clear objectives, real testing, working monitoring, honest measurement, the patience to fix what breaks instead of pretending it never broke. The teams getting value from AI agents in 2026 are mostly the teams that have been doing software engineering well for the last decade. The technology is the smaller part of the story.


