Kamil Józwik

Agentic Capabilities of LLMs

Exploring how language models can act as autonomous agents through tool use, planning, memory, and more.

llm

You're building a customer service bots. Version one answers questions from training data. Version two can look up order status, check inventory, and process returns by calling your APIs. That second version isn't just a chatbot - it's an agent. The difference lies in agentic capabilities: features that transform a language model from a text generator into a system that can reason, plan, remember context, and take action.

This article explains what makes an LLM agentic, how these capabilities work conceptually, and when you actually need them.

The Core Capabilities

An agentic LLM goes beyond text generation. It perceives its environment, decides what to do next, executes actions through external tools, and learns from results. Five fundamental capabilities enable this: tool use, planning, memory, reflection, and multi-agent coordination.

Tool Use and Function Calling

Tool use enables LLMs to interact with external systems rather than relying solely on training data. When asked about today's weather, the model can't answer from memory - but with tool use, it invokes a weather API, receives current data, and formulates a response.

How It Works

You define available tools using structured schemas (typically JSON) describing what each tool does and what parameters it accepts. When the model needs external information, it generates a structured request specifying which tool to call. Your application executes that tool call and feeds results back to the model. The model never executes tools directly - this separation is crucial for security and control.

Different providers (OpenAI, Anthropic, Google, Cohere, Mistral) all offer native tool calling, but schema formats differ. The Berkeley Function Calling Leaderboard tests models across thousands of functions, evaluating not just single calls but parallel calls, multi-turn interactions, and knowing when to abstain.

While state-of-the-art models excel at single-turn calls, memory, dynamic decision-making, and long-horizon reasoning remain challenging. A model might perfectly handle "book me a flight" in isolation but struggle when that's step seven in a complex conversation requiring prior context.

When It Matters

Use tool use when your LLM needs information it can't know from training - current prices, account details, real-time data - or when it must perform actions with consequences like sending emails or modifying databases. Skip it when the model can answer from training knowledge or when schema overhead outweighs benefits.

Planning and Reasoning

Planning capabilities enable models to decompose large goals into actionable steps. When asked to "analyze Q3 sales and identify opportunities," the model breaks this into subtasks: retrieve data, analyze patterns, research trends, synthesize recommendations.

Different prompting approaches yield different behaviors. Chain-of-Thought shows reasoning steps. Tree-of-Thoughts explores multiple paths simultaneously. ReAct interleaves reasoning with action, letting the model think, execute a tool, observe results, and decide next steps.

Extended Thinking Modes

Modern models like Claude 3.7 Sonnet and Claude 4 offer hybrid modes: near-instant responses or extended thinking where the model self-reflects before answering. This "test-time compute" allocates additional tokens for reasoning, with performance improving logarithmically as thinking budgets increase.

Claude 4's "interleaved thinking" alternates between reasoning and tool calls within a single turn - think, use tool, think about results, use another tool. OpenAI's o1 and o3 use internal chain-of-thought reasoning, generating multiple solution paths and selecting the most consistent through self-verification.

Use planning for multi-step tasks: research workflows, software projects, procedural customer service, data pipelines. Skip it for simple queries where planning overhead exceeds the cost of direct answers.

Reflection and Self-Improvement

Reflection enables agents to evaluate outputs, learn from mistakes, and refine approaches. The agent generates output, evaluates it (by self-prompting or using an evaluator model), identifies shortcomings, and improves.

Some systems like Reflexion implement explicit loops: an Actor handles tasks, an Evaluator rates results and provides guidance for improvement. This creates structured refinement rather than hoping the model self-corrects.

But models are often poor judges of their own outputs. They might confidently assert incorrect code is correct. Reflection works best with verifiable feedback - test results, user corrections, execution outcomes - not self-assessment alone. Each reflection cycle adds latency and tokens. Worth it for high-stakes accuracy, overkill for routine tasks.

Multi-Agent Systems

Multi-agent systems deploy specialized agents that collaborate. A development workflow might use separate agents for writing code, running tests, reviewing quality, and documentation. By specializing roles, each agent optimizes for its function with appropriate tools and prompts.

Collaboration patterns vary: hierarchical systems where coordinators delegate to specialists, peer-to-peer where agents communicate directly, or marketplace systems where agents compete based on capabilities.

The challenge is coordination. Inter-agent misalignment includes communication breakdowns, inconsistent understanding, memory failures, and protocol violations. Some research introduces Theory of Mind where agents track each other's goals, but this adds complexity.

Multi-agent excels when tasks decompose into distinct specializations, subtasks require different tools, or you need parallel processing. It's overkill for simple workflows. Orchestration overhead only pays off when specialization benefits outweigh costs.

The Trade-offs

Agentic capabilities aren't free. Each layer - tool use, planning, memory, reflection, multi-agent coordination - adds latency, token costs, and complexity. An agent that plans, calls three tools, reflects, and updates memory uses 10x more tokens than a simple completion.

More moving parts mean more failure modes. Tools return unexpected data, memory surfaces irrelevant context, planning produces invalid sequences, agents miscommunicate. Debugging shifts from "why did the model say this?" to "why did the agent choose that tool based on that memory?"

But for tasks genuinely requiring agency - multi-step workflows, external system integration, feedback learning - these capabilities transform possibilities. The key is matching capability to need. Not every LLM application needs agency, and not every agent needs every capability.

Making the Decision

Ask what the model actually needs to do. Just generating text from static knowledge? No agentic capabilities needed. Current information or actions required? Tool use is essential. Multiple steps? Add planning. Continuity across sessions? Add memory. Quality improvement through iteration? Add reflection.

Consider constraints: latency tolerance, token budget, engineering complexity, failure handling. Frontier models (Claude 4, GPT-4/5, Gemini 2.5) offer these capabilities with different strengths. Claude excels at long-context and careful reasoning. GPT provides the broadest ecosystem and multimodal support. Gemini integrates with Google services.

What Remains

The distinction between specialized reasoning models and general-purpose models is blurring. Tool use grows more sophisticated with parallel calls and interleaved thinking. Memory systems evolve from retrieval to knowledge graphs.

But fundamentals remain stable: tools for world interaction, planning for problem decomposition, memory for context, reflection for improvement, multi-agent for coordination. Understanding these building blocks provides a foundation that persists as specific APIs change.

You'll still need to decide which capabilities your application requires, what trade-offs you'll accept, and how to architect maintainable systems. That decision-making framework matters most.