Accelerating AI Agent Development with LangSmith and OpenAI's Agent SDK

Introduction

Building AI-powered agents into your software product can unlock powerful new capabilities - from intelligent customer support bots to workflow assistants. However, integrating Large Language Model (LLM) agents in production is not as simple as plugging in an API. AI agents are non-deterministic by nature, often producing unexpected results that can be tricky to debug. Product managers and engineering leaders face challenges in understanding an agent's reasoning, refining its prompts, and ensuring it works reliably alongside traditional software components.

Let's explore how LangSmith, an observability and prompt management platform from the LangChain team, helps teams iterate quickly on AI agent applications. We'll focus on using LangSmith with OpenAI's new Agent SDK for Python - a lightweight framework for building agentic applications with OpenAI models.

Key workflows for success:

Debugging AI agent chains and flows: Gaining visibility into an agent's step-by-step decisions, tool use, and LLM calls
Collaboratively developing and managing prompts: Treating prompts as first-class, versioned assets
Blending AI and non-AI components: Managing workflows that mix LLM calls with standard software logic

By the end, you'll understand how LangSmith improves observability, accelerates iteration, and helps ship more reliable AI features in enterprise products.

The Need for Observability in AI Agent Applications

AI agents differ from traditional software modules: their internal “chain of thought” isn't explicitly coded, but emerges from prompts and the model's reasoning. When an agent produces an incorrect answer or takes a wrong tool action, how do you diagnose the issue? Without proper tooling, you're often flying blind inside a black-box model.

LangSmith provides LLM-native observability, allowing teams to get meaningful insights into each step an agent takes. Observability is crucial across the development lifecycle - from early prototyping to production monitoring - to find and fix failures fast.

LangSmith is a unified platform to trace, debug, test, and monitor AI agent performance. Importantly, you don't have to be using the LangChain library to benefit - many teams use LangSmith with other agents. In short, LangSmith acts as an AI agent control center: it traces your agent's behavior, surfaces intermediate decisions, and even enables evaluations and prompt management.

Debugging Agent Chains with LangSmith Tracing

One of LangSmith's core features is tracing: capturing a detailed log of an agent's execution flow. Think of it like an X-ray of your AI agent's reasoning. Every prompt it constructs, every decision or tool usage, and every output can be recorded as a structured trace in LangSmith's dashboard.

This is invaluable for debugging non-deterministic LLM behavior. Tracing lets you and cross-functional teams see what the agent is doing step by step — then fix issues to improve latency and response quality.

When using OpenAI's Agent SDK, integrating LangSmith tracing is straightforward:

from agents import set_tracing_disabled, set_default_openai_client, set_trace_processors
from openai import OpenAI
from langsmith.wrappers import OpenAIAgentsTracingProcessor, wrap_openai

# 1. Disable OpenAI's default tracing (we'll use LangSmith instead)
set_tracing_disabled(True)

# 2. Wrap the OpenAI client to capture API calls for LangSmith
openai_client = wrap_openai(OpenAI(api_key="sk-..."))

# 3. Use this client for all agent LLM calls
set_default_openai_client(openai_client)

# 4. Register LangSmith trace processor to capture agent flows
set_trace_processors([OpenAIAgentsTracingProcessor()])

# Define a simple tool as a Python function
from agents import Agent, Runner, function_tool

@function_tool
def add_numbers(a: int, b: int) -> int:
    """Add two numbers (non-LLM operation)"""
    return a + b

# Create an agent that uses the tool
math_agent = Agent(
    name="Calculator",
    model="gpt-4.1",
    instructions="You are a math assistant who uses tools for calculations.",
    tools=[add_numbers]
)

# Run the agent with an example query
result = Runner.run(math_agent, "What is 2 + 3?")
print(result.final_output)  # Should output '5'

In this snippet, we:

Disable the OpenAI SDK's own tracing,
Wrap the OpenAI API client so that every call to the LLM is recorded by LangSmith,
Tell the Agent SDK to use this wrapped client by default,
Attach LangSmith's tracing processor to log the agent's chain of steps.

Finally, we define a simple agent that can use a tool (a Python function add_numbers) to answer math questions. When we run the agent, LangSmith will capture the entire interaction: the user's question, the agent's reasoning, the call to the add_numbers tool (a non-AI operation), and the final answer.

LangSmith trace UI showing an agent's execution flow.

With tracing enabled, the LangSmith UI provides a step-by-step view of the agent's behavior. You can expand the trace of an agent run to see things like: which prompt was given as the system or instructions, what the AI model output at each step, which tool it decided to invoke and with what inputs, and what the tool returned. All this information is organized in a timeline, as shown in the figure above. In our math_agent example, you would see the agent's thought process (perhaps it decides to use the add_numbers tool), a record of the add_numbers function call and its result, and then the final answer “5” returned to the user.

Why is this useful to your team? It means faster debugging and better reliability. If the agent makes a wrong turn (say, calls the wrong tool or parses user input incorrectly), you'll spot it in the trace. You can quickly pinpoint which step or prompt caused the issue, instead of guessing. LangSmith's observability helps answer questions like: “What exactly did the AI see and decide before it gave this answer?” and “Where did the chain go off-track?” - crucial for non-deterministic systems. This level of insight is especially important when improving latency or quality; for example, if a certain tool call is slow, you'd see that duration in the trace and can investigate optimizing that step.

LangSmith's tracing doesn't just aid in development; it's also designed for production monitoring. You can trace real user sessions (sampling as needed) to catch issues that only appear in the wild. And importantly, using LangSmith does not add runtime latency to your application - traces are collected asynchronously, so product managers and engineers can instrument liberally without fear of slowing down user responses.

Collaborative Prompt Development and Management

Crafting the right prompts (system instructions, few-shot examples, etc.) is often as critical as writing code when building AI features. LangSmith treats prompts as first-class artifacts that teams can develop and refine collaboratively. Instead of burying a long prompt in code and doing ad-hoc edits, LangSmith provides a Prompt Hub and Playground for systematic prompt engineering.

In LangSmith's UI, your team can iteratively develop prompts in a visual Prompt Playground - trying out changes and immediately seeing how the model's output differs. You can experiment with different wording or formatting and compare outputs side by side. This is great for a product manager or content designer who wants to fine-tune the AI's tone or ensure it uses the right phrasing, without needing to run a full dev cycle. As LangChain's documentation notes, you can “experiment with models and prompts in the Playground, and compare outputs across different prompt versions” . Once a prompt is performing well, you can save it to the Prompt Hub with a version label.

Prompt Canvas, a feature within LangSmith, even allows you to collaborate with an AI assistant to refine your prompt. For example, you can highlight a section of the prompt and ask an LLM (in the sidebar) to suggest an alternative phrasing or adjust the tone. This novel UX lets you iterate faster on complex prompts by leveraging AI to help edit AI prompts . Of course, you can also edit prompts directly and use built-in “quick actions” (like adjusting reading level or length) in the Canvas . All changes can be reviewed before saving.

Perhaps most importantly for team collaboration, LangSmith's Prompt Hub provides version control for prompts. Every time you or a teammate saves an update to a prompt, LangSmith creates a new commit behind the scenes . This means you have a full audit trail of how a prompt has evolved - you can see what changed between versions (with a diff view), roll back if a certain revision underperforms, and understand the history of the prompt's development . In an enterprise setting, this auditability is crucial; it brings governance to prompt engineering. For instance, you might have an initial prompt that worked in testing, but after some production incidents, the prompt was revised. With LangSmith, you can trace which prompt version was used in any given trace, and correlate prompt changes with agent behavior changes.

From a workflow perspective, this setup encourages cross-functional collaboration. A UX writer or domain expert could propose prompt improvements via LangSmith's UI (using Prompt Canvas to prototype changes), and an engineer can then fetch the updated prompt from the hub to deploy in the app. LangSmith's Python SDK allows pulling the latest prompt version programmatically, or even directly referencing a prompt by name and version in your code. This blended approach ensures that prompt tweaks don't always require code deployments, and developers and non-developers can work hand-in-hand to improve AI behavior. As the LangSmith website says, “any teammate can use the Prompt Canvas UI to directly recommend and improve prompts” - which can then be tested and rolled out with confidence.

Blending AI and Non-AI Components in Workflows

Real-world AI applications are rarely just a single LLM call - they often involve a workflow of LLM calls and standard software functions. For example, an agent might need to call a database, invoke an API, run a calculation, or hand off to another microservice as part of fulfilling a user request. Managing these hybrid workflows is a challenge: you need to make sure the AI and non-AI parts work in harmony and understand where issues arise (was it a model hallucination or a tool failure? a prompt issue or a code bug?). LangSmith helps manage these complex workflows by tracing the entire end-to-end process, not just the LLM's output.

In our earlier code example, the math_agent had a tool add_numbers (a simple function). This represents a non-AI component in the agent's reasoning process. LangSmith's tracing captured not only the LLM's thought (like “I should use the addition tool for this”) but also the invocation of add_numbers and its result. In a more elaborate agent, you might have tools like a WebSearchTool or database lookup; LangSmith would log those calls as part of the trace as well. Every intermediate span (operation) in the agent's workflow gets recorded - whether it's an LLM call or a traditional function. The OpenAI Agent SDK makes it easy to define such tools (using the @function_tool decorator as we showed), and LangSmith will trace these tool executions as part of the agent's span hierarchy . In the LangSmith UI, you'd see a tree of steps: e.g., Agent workflow -> Tool: WebSearch -> ... -> Response. This holistic view is crucial for debugging workflows that blend AI decisions with deterministic code.

Beyond just logging, LangSmith allows you to attach metadata and metrics to these traces. You can tag certain runs, log custom evaluation results, and even set up dashboards to monitor key metrics. For instance, you might track how often the agent had to fall back to a sub-agent or how long each tool call takes on average. LangSmith's monitoring features let you plot metrics like cost, latency, or success rate across many runs . For a technical leader, these insights help answer: Is our AI feature getting more efficient over time? Are the recent prompt changes reducing error rates? If something goes off-track - say latency spikes - you can drill down from a dashboard to the exact trace that is an outlier, then inspect the detailed sequence of steps to find the bottleneck .

In enterprise scenarios, managing workflows with both AI and non-AI steps also raises the need for robust testing. LangSmith supports evaluation workflows where you can define criteria for a “good” output and automatically score agent outputs on historical traces (using either heuristic checks or even LLM-based evaluators). While beyond the scope of this post, it's worth noting that once you have traces of your agent's behavior, you can leverage them for regression tests - ensuring that as you tweak prompts or code, the agent still handles key scenarios correctly. In short, LangSmith not only helps blend AI with traditional software components, but also helps validate that the blend works as intended.

Seamless Integration with OpenAI's Agent SDK

The OpenAI Agent SDK is a welcomed toolkit for building agents with minimal overhead, and LangSmith is designed to complement it seamlessly. We saw in code how easy it is to tie the two together. It's worth highlighting a few best practices for integration:

Enable LangSmith Tracing: Use set_trace_processors([OpenAIAgentsTracingProcessor()]) to hook LangSmith into the agent's execution loop. This ensures all intermediate steps (LLM calls, tool actions, sub-agents, etc.) are streamed to the LangSmith tracer . The agent's entire decision tree will be recorded, which is invaluable for debugging and auditing.
Disable OpenAI's Default Tracing: The OpenAI SDK has built-in tracing that sends data back to OpenAI (for example, to their own monitoring or logging system). If you plan to exclusively use LangSmith for observability, you can disable the default via set_tracing_disabled(True) as we did. This prevents duplicate logging and keeps data within your LangSmith project (which might be preferable for privacy or compliance if you're self-hosting LangSmith).
Wrap the LLM API Calls: By calling wrap_openai() on the OpenAI client, you patch the client to automatically log each request/response to LangSmith . This is a one-time wrap that gives you transparency into every prompt sent to the model and every completion received. Whether your agent uses the Chat Completions API or the newer OpenAI Functions/Responses API, wrap_openai supports both sync and async clients . This means even if you call openai_client.chat.completions.create() or a function like openai_client.responses.create(), those calls become traceable events.
Configure the Default Client: OpenAI's SDK allows you to specify a custom OpenAI client to use (for example, if you want to point to a proxy or have special config) via set_default_openai_client() . We used this to inject our wrapped client, so that all agent actions use the trace-enabled client. This way, developers don't have to remember to use a special object for LLM calls - it's configured globally for the agent runtime. You can also set which API (completions vs chat) the agent uses by default with set_default_openai_api(...), but either way, LangSmith will catch the calls once the client is wrapped.

In summary, integrating LangSmith with the OpenAI Agent SDK is a matter of a few lines of setup. From that point on, you get rich instrumentation of your AI agent. The integration is non-intrusive - your agent's logic doesn't need to change, and LangSmith stays out of the way of core execution (it runs asynchronously in the background). What you gain is confidence and speed: confidence that you can observe and debug the agent's reasoning, and speed in iterating on improvements.

Conclusion and Key Takeaways

Incorporating AI agents into products requires a shift in how we develop and maintain software. With the combination of OpenAI's Agent SDK and LangSmith, teams can bring modern engineering practices - observability, version control, and collaboration - to the world of AI-driven features. LangSmith serves as a critical co-pilot for your AI agents, ensuring that you can ship agents with confidence.

Key takeaways for technical leaders:

Faster Debugging: LangSmith's tracing provides full visibility into an agent's decision-making process, making it easy to diagnose issues in complex AI chains. You can quickly find failure points by reviewing traces, rather than trying to reproduce elusive bugs.
Improved Prompt Management: LangSmith offers a centralized Prompt Hub where prompts are versioned and can be collaboratively edited. Your team can experiment in a sandbox (Playground/Prompt Canvas) and deploy better prompts without guesswork, with an audit trail of every change.
Unified Workflows (AI + Tools): Whether your agent is calling external APIs or performing calculations, LangSmith captures those steps alongside LLM interactions. This holistic observability ensures you understand the full picture of how AI and non-AI components work together, leading to more reliable and optimized workflows.
Seamless Integration: Using LangSmith with OpenAI's Agent SDK is straightforward and minimally invasive. A few lines of setup (disabling OpenAI's tracing, wrapping the client, and adding LangSmith's processor) plug you into LangSmith's platform. It works even if you're not using LangChain in your codebase.
Production-Ready Monitoring: LangSmith is built for production use. It introduces no added latency to your end-user requests, can be self-hosted for data privacy, and supports custom dashboards and alerts. This means you can confidently monitor AI features in real-time and catch issues or regressions early.

By leveraging LangSmith in tandem with the OpenAI Agent SDK, product and engineering teams can iterate faster on AI agents - from prompt tweaks to tool integrations - and deliver a better experience to users. In an era where AI capabilities are rapidly evolving, LangSmith provides the guardrails and insights needed to harness that innovation effectively. It enables your team to focus on building value with AI, while ensuring observability, collaboration, and reliability are never afterthoughts.

Integrating an AI agent into your product is a journey into new territory, but with the right tooling, it's a journey you can take with confidence and clarity. LangSmith helps turn the art of prompt and agent design into a manageable, trackable engineering process - so you can ship intelligent features that meet your users' needs, and keep improving them over time.