Deep Dive into OpenAI’s Agent Ecosystem

Ben Lorica

1 year ago

Secure Your Spot at the AI Agent Conference (NYC, May 6-7) – Limited Seats! Use code GRADIENTFLOW25 for 25% off before it expires.

In recent weeks, I have been examining the rapid evolution of AI agents, a field where OpenAI’s latest offerings represent just one approach in an increasingly transformative and globally competitive landscape. As my analysis of Manus, (the “general AI agent” from Chinese startup Monica.ai) revealed, significant innovation is emerging from diverse sources, with Manus even outperforming OpenAI’s offerings on the General AI Assistants (GAIA) benchmark. This isn‘t a winner-take-all market; it’s a rapidly developing global ecosystem where both established AI labs and nimble startups are driving progress. OpenAI itself has identified agents as a major growth area, underscoring the strategic importance of this technology.

OpenAI‘s new agent-building tools and Deep Research warrant analysis not because they necessarily represent the best solutions in their class, but because they crystallize key trends shaping the broader agent landscape. Their approach to web search, file search, and, critically, computer use tools (enabling GUI-based interaction) reflects a wider industry shift towards layered agent architectures and modular agent design. This move towards GUI interaction is particularly significant as it enables AI to interact with virtually any software through graphical interfaces, dramatically expanding the scope of automation beyond systems with specialized APIs. Manus exemplifies this modularity, reportedly leveraging a multi-agent system that incorporates models like Anthropic’s Claude and fine–tuned Qwen models. The emphasis is shifting from monolithic models to the effective orchestration of specialized agents.

While healthy skepticism towards any single vendor is warranted, the current moment is defined by the rapid translation of theoretical concepts into practical applications. We‘re witnessing a convergence of GUI-based interaction, layered and modular architectures, and the maturation of Planner-Actor-Validator and Tool-Use design patterns. Furthermore, as Manus demonstrates, competitive advantage increasingly stems from effective product engineering and integration of existing models, rather than solely from foundational research breakthroughs. This intensifies competition and highlights the importance of execution speed. This is an exceptionally productive period for those building agentic applications – one where understanding the evolving landscape, including the critical challenges of accountability, safety, and real-world evaluation, has become essential for technologists and business leaders alike. The technical capabilities to create useful autonomous agents already exist; now the race is on to deliver reliable, safe, and truly effective implementations that address these challenges.

I. OpenAI’s New Agent-Building Tools

II. Deep Research

I. OpenAI’s New Agent-Building Tools

OpenAI’s Core Building Blocks

OpenAI defines an agent as a system capable of independent action to perform tasks on a user’s behalf. They announced three core, built-in tools to facilitate agent development:

Web Search Tool: Provides models with access to up-to-date information from the internet. It’s powered by a fine-tuned GPT-4o model (or a smaller variant) optimized for information retrieval and source citation. This is the same technology powering search functionality within ChatGPT.
File Search Tool: Enables developers to upload and perform semantic searches over their own private documents. Crucially, it includes metadata filtering for precise queries and a direct search endpoint that bypasses model filtering, offering greater control and accuracy, especially for Retrieval-Augmented Generation (RAG).
Computer Use Tool: Brings the capabilities of ChatGPT’s “Operator” feature to the API. It allows agents to control computers (including virtual machines and legacy applications) via their graphical user interfaces (GUIs). This enables automation of tasks without requiring direct API access. It uses the same model as Operator and has demonstrated strong performance on benchmarks like OS-World, WebArena, and WebVoyager.

These tools are designed to address the common challenge of integrating disparate, low-level APIs when building agent applications.

Responses API: The Evolution Beyond Chat Completions

The Responses API is a new, more flexible API designed as the eventual successor to the Chat Completions API. It’s built to support the complex, multi-turn interactions and tool use that are essential for sophisticated agents. Key differences and features include:

Multi-Turn Interactions: Explicitly designed for conversations and workflows involving multiple steps.
Built-in Tool Calling: Natively supports calling the new tools (Web Search, File Search, Computer Use) within a single API request, streamlining agent logic.
Multimodal Inputs and Outputs: Handles text, images, and audio.
Complex Agentic Workflows: Optimized for scenarios with multiple model turns and tool calls.

The Chat Completions API will continue to be supported with new models and capabilities. However, some new features and models, particularly those related to advanced agent functionality, will be exclusive to the Responses API. Migration from Chat Completions to Responses is intended to be straightforward.

Web Search Tool: Enhancing AI with Real-Time Internet Access

The Web Search Tool allows models to retrieve and analyze current information from the internet, enhancing the factual accuracy and timeliness of responses. It leverages the same technology as ChatGPT’s search feature and provides:

Fine-tuned Model: Uses a version of GPT-4o specifically optimized for web search tasks.
Large-Scale Data Processing: Handles substantial amounts of web-retrieved data.
Relevant Information Extraction: Identifies and extracts pertinent information from search results.
Clear Source Citation: Provides clear citations for the information presented.
High Accuracy: According to OpenAI, it achieves 90% accuracy on the SimpleQA benchmark.

This ensures that AI applications can access real-time information beyond their training data.

Enhanced File Search: New Metadata and Direct Query Capabilities

The File Search Tool, previously part of the Assistants API, has been significantly enhanced with:

Metadata Filtering: Developers can add attributes (metadata) to files and filter search results based on these attributes, enabling more precise and targeted queries.
Direct Search Endpoint: Allows direct querying of the underlying vector store without the query being rewritten or filtered by the model. This improves accuracy and control, particularly for use cases like personalized recommendations or internal knowledge retrieval where precise matching is crucial.

These enhancements make RAG implementations more flexible and efficient for applications leveraging private knowledge bases.

Computer Use Tool: Bringing GUI Automation to AI Agents

The Computer Use Tool brings the functionality of ChatGPT’s “Operator” to the API. It allows AI agents to control computers by interacting with graphical user interfaces (GUIs). This enables:

Automation of tasks in applications without API access.
Control of virtual machines.
Operation of legacy software with only GUI interfaces.
Complex, multi-step workflows that previously required human intervention.
Actions like taking screenshots, analyzing interfaces, and executing appropriate actions (clicking, typing, etc.).

The tool utilizes the same model powering Operator in ChatGPT, with strong performance on benchmarks such as OS-World, WebArena, and WebVoyager.

Agents SDK: Streamlining Multi-Agent Application Development

The Agents SDK (formerly “Swarm”) is an open-source framework (installable via pip install openai-agents, with JavaScript support coming soon) designed to simplify the orchestration of multiple agents within a single application. Key features include:

Multi-Agent Orchestration: Supports the creation of specialized agents that can collaborate on tasks. Examples include separate agents for customer service, returns, or product recommendations.
Agent Handoffs: Enables seamless “handoffs” between agents within a single conversation, maintaining context while switching underlying tools and instructions.
Automatic Tool Interface Generation: Automatically converts Python functions (and soon, JavaScript functions) into tool interfaces using type signatures. The SDK generates the necessary JSON schemas, allowing agents to call these functions correctly.
Built-in Monitoring and Tracing: Provides a tracing UI to visualize agent interactions, function calls, and tool usage, simplifying debugging and monitoring.
Separation of Concerns: Promotes modular design by allowing developers to separate agent logic into distinct components.
Lifecycle Events and Guardrails: Includes built-in mechanisms for managing agent lifecycles and implementing safety measures.

Assistants API Migration: Transition Timeline and Strategy

OpenAI plans to sunset the Assistants API in 2026. Before this occurs:

Feature Parity: All Assistants API functionality will be incorporated into the Responses API.
Migration Guide: A comprehensive migration guide will be provided to assist developers in transitioning their applications smoothly, without data or functionality loss.
Ample Time: Developers will have sufficient time to migrate their applications.

This consolidation aims for a unified and streamlined developer experience. The Responses API will maintain support for multimodal inputs and all agent-building blocks currently in the Assistants API.

Combining Agent Tools: Practical Use Cases and Solutions

In their announcement video, OpenAI shared several examples illustrate the combined power of these tools:

Personal Stylist Assistant:
- File Search: Accesses a user’s “style diary” (private data) to understand preferences.
- Web Search: Finds current fashion trends and local stores.
- Computer Use: Could potentially make purchases on the user’s behalf.
Customer Support Agent:
- File Search: Retrieves information from private knowledge bases (product documentation, customer history).
- Web Search: Accesses up-to-date public information (company announcements, reviews).
- Computer Use: Automates tasks within internal systems (processing refunds, updating accounts).
Research Assistant:
- Web Search: Gathers information from various online sources.
- File Search: Analyzes internal documents and data.
- Computer Use: Could compile reports, create presentations, or interact with data analysis tools.
Internal knowledge systems
- File Search: Analyzes internal documents and data.

The Responses API allows these tools to be called within a single API response, streamlining development.

Advantages for Development Teams: Key Benefits Overview

These tools and the Responses API offer significant advantages:

Reduced Development Complexity: Integrated, high-level tools replace the need for custom integrations of multiple APIs.
Faster Prototyping: Simplifies the implementation of multi-step tasks and tool integrations.
More Reliable Agents: Leverages purpose-built, benchmarked tools and models.
Improved Accuracy and Freshness: Web Search ensures access to current information, while File Search provides precise retrieval from private data.
Simplified Debugging: Built-in tracing and monitoring via the Agents SDK.
Enhanced Orchestration: The Agents SDK facilitates the creation and management of complex, multi-agent workflows.
Expanded Automation: The Computer Use tool enables automation in systems without direct API access.
Enterprise Integration: File Search with metadata filtering allows secure and efficient access to private data.
Focus on Business Logic: Teams can concentrate on application logic and user experience, rather than low-level infrastructure.
Multi-vendor Integration: The framework supports integration with other tools that conform to the chat completions format.
Automatic Schema Generation: reduces boilerplate code.

These advancements enable teams to build more sophisticated and effective AI applications more quickly and efficiently, shifting the focus from simply answering questions to performing tasks autonomously.

Deep Research as a Model: Understanding the Practical Application of New Tools

Deep Research is an example of an existing agent that OpenAI has already built, which uses the kinds of capabilities they’re now making available to developers through their API. Specifically:

Deep Research is an existing agent that can condense a week’s worth of research into 15 minutes.
The tools being announced (Web Search Tool, File Search Tool, Computer Use Tool) and the new Responses API are positioned as enabling developers to build similar agent capabilities to what’s already in Deep Research.
Deep Research is described as a product that uses “multiple model turns and multiple tool calls behind the scenes” – which is precisely what the new Responses API is designed to facilitate for developers.

In essence, Deep Research serves as a concrete example of what developers can now build themselves using the newly announced tools and APIs. See below for more on Deep Research.

Consider supporting our work by leaving a small tip💰 here and inviting your friends and colleagues to subscribe to our newsletter📩

II. Deep Research

Introduction to Deep Research

Deep Research is an AI agent developed by OpenAI, integrated within ChatGPT, that automates comprehensive online research. It’s designed for complex research tasks that usually take hours, delivering detailed reports with sources and citations in 5-30 minutes. It’s significantly more thorough than standard ChatGPT responses because it’s specifically optimized for tasks needing extensive web research and external context, going beyond the model’s pre-trained knowledge.

Deep Research’s Role in OpenAI’s Agent Vision

Deep Research and Operator are currently separate products, but they represent steps toward a unified AI agent that can seamlessly handle various tasks (web search, computer operation, etc.)—a “fusion agent” that combines web, API, and desktop interactions. Deep Research exemplifies:

advanced reasoning model ➕ access to tools ➕ end-to-end optimization

The Deep Research Engine: A Six-Stage Iterative Research Process

Deep Research uses a fine-tuned version of OpenAI’s O3 reasoning model, trained end-to-end with reinforcement learning on browsing and reasoning tasks. It has access to a browsing tool and a Python tool. The core process is iterative:

Query Understanding: The model analyzes the user’s request.
Search: It formulates and executes web searches.
Information Extraction: It reads and extracts relevant information from web pages.
Synthesis: It synthesizes the gathered information.
Decision: It decides whether to continue searching or generate a report.
Report Generation: If sufficient information is gathered, it creates a structured report with citations.

This iterative, end-to-end approach allows the model to learn complex research strategies that might not be apparent to human designers.

User Profiles and Applications

Deep Research targets anyone doing “knowledge work,” both professionally and personally. Common use cases include:

Professional: Market research, competitive analysis, company due diligence, scientific/medical research, code/documentation lookup.
Personal: Travel planning, shopping, personalized learning, complex product comparisons, event planning.

Surprising applications include coding (finding documentation, writing scripts) and medical research (finding literature, identifying clinical trials).

Accuracy Mechanisms: Citations, Verification, and Limitations

The primary mechanism used to evaluate accuracy is citation: the reports include references to the sources used. The training process emphasizes correct citation. The clarification flow also helps ensure the model understands the user’s needs. However, the model can still make mistakes or rely on unreliable sources. Users should always verify critical information using the provided citations. This is an ongoing area of improvement.

Deep Research Usage Guide: Five Strategies for Effective Implementation

Be Specific: Provide detailed instructions.
Include Multiple Requirements: Handle complex, multi-part queries.
Request Structured Outputs: Ask for tables with citations.
Use the Clarification Flow: Respond thoroughly to clarifying questions.
Leverage Synthesis Capabilities: Use it for tasks requiring information gathering from multiple sources.

End-to-End Training vs. Modular Systems

The key differentiator behind Deep Research is end-to-end training using reinforcement learning. Most other approaches use a modular design, with language models acting as decision-making nodes within a pre-defined graph of operations. Deep Research, however, is trained holistically on complete research tasks. This gives it more flexibility and adaptability to handle edge cases, unexpected information, and complex queries. It learns to adjust its strategy based on the information it finds.

By optimizing directly for research outcomes through reinforcement learning, Deep Research can develop more sophisticated strategies than hand-coded systems. This approach of taking a state-of-the-art reasoning model, giving it access to tools, and optimizing it directly for outcomes is what makes Deep Research particularly powerful.

Architectural Decisions: Selecting Between End-to-End and Modular Approaches

The choice depends on the task’s characteristics:

Modular/Hard-coded: Best for predictable, well-defined workflows with strict rules and limited edge cases (e.g., database interactions with security constraints).
End-to-End (e.g., with RL): Superior for open-ended, adaptive tasks requiring flexibility and handling diverse, unpredictable situations (e.g., market analysis, complex research). End-to-end models often surpass human-designed solutions as model capabilities improve.

Reinforcement Learning’s Critical Role

Reinforcement learning (RL) allows the agent to adapt its approach in real-time, unlike fixed scripts. It can pivot based on the information it finds, making it more flexible and effective for tasks with unpredictable search paths. RL is now viable because powerful pre-trained language models (the “cake” and “frosting” in the analogy) provide a strong foundation for RL (the “cherry on top”) to optimize for specific tasks.

Technical Obstacles: Training Data, Accuracy, and User Interaction Design

In building Deep Research, creating high-quality training datasets was a major challenge, requiring:

Diversity: Covering a broad range of queries and industries.
Action Complexity: Simulating realistic human browsing behavior, including cross-source synthesis.
Ground Truth: Establishing reliable benchmarks and validation methods.

Ensuring factual accuracy and proper source attribution (citations) was another key challenge. The design of an effective “clarification flow” (where the model asks clarifying questions) was also crucial.

Next-Generation Features: The Future of AI Research Agents

Future developments include:

Expanded Data Sources: Accessing private data alongside public information.
Improved Browsing/Analysis: Enhancing the agent’s understanding and navigation capabilities.
Broader Agent Capabilities: Moving towards increasingly complex actions.
Scaling the Approach: Applying the core principles to a wider range of tasks. OpenAI sees agents as a major growth area.

How Deep Research Will Transform Work and Education

These agents are tools to enhance human capabilities, not replace jobs. They automate time-consuming tasks, freeing people for higher-level work and enabling tasks that were previously impractical. In education, they offer personalized and efficient learning experiences, adapting to individual needs and providing a more engaging alternative to traditional methods.

Staying Current: A Practitioner’s Guide to AI Agent Developments

Specialize: Focus deeply on 1-2 specific areas (e.g., RL fine-tuning, evaluation).
Curate Experts: Follow thought leaders in your chosen areas.
Monitor UX Patterns: Observe design choices in leading products.
Stress-Test: Use real-world problems to compare different approaches.

Deep Dive into OpenAI’s Agent Ecosystem

Table of Contents