I spent $96 and burned 150 million tokens with OpenAI’s Deep Research API and all I got were these 5 great insights

My goal was simple: make my newsletter faster and better using Deep Research. What happened next was a deep dive into agents, orchestration, prompt design, and the architecture behind making it scale.

Jul 07, 2025

Creating my newsletter has always been a fun but time consuming task. The iterations of it over the years have been this:

Iteration 1: Perform several different google searches across the AI landscape and cherry pick information that I like
- Pain point: Time….just so much time.
Iteration 2: Use ChatGPT or other LLMs to search for articles and information of the latest news
- Pain point: Time spent was better, accuracy and thoroughness was not
Iteration 3: Use Deep Research by OpenAI and web search to scrape the latest AI news across the web
- Pain point: Great for singular newsletter categories but scaling to multiple categories and collecting multiple (lets say 5 stories per category) did not scale well and fell flat
Iteration 4: Use OpenAI Operator to conduct searches across the web
- Pain point: Took forever and I quickly backed out of this use case
[We are here] Iteration 5: Open AI introduced deep research in the OpenAI API and so I decided to give it a try for my newsletter.

What is a Deep Research agent and why does it matter?

Deep Research seems to be everywhere right now. OpenAI just added their deep research models to their APIs, Anthropic recently released an article called ‘How we built our multi-agent research system’ and now you are reading my article. Practically blazing! 🔥 (this wasnt written by an LLM, I promise. I just thought a fire emoji would be cool here)

Perhaps before we talk about Deep Research, we should first talk about what created the need for Deep Research in the first place.

LLMs like GPT-3.5-turbo were great at isolated tasks: summarizing text, answering questions, generating quick code snippets. However, they lacked access to external knowledge, and that limited their usefulness in real-world applications.

Then along came RAG, or Retrieval-Augmented Generation, and it was a major step forward. It gave us a way to bring external context into LLMs by creating vector embeddings of documents, storing them in a vector database, and retrieving relevant chunks based on the input prompt. That made the model’s output more accurate, grounded, and useful. It’s still a valuable pattern today, and I’ve seen it work well across many enterprise environments. As I write this in July of 2025, RAG is great for 90% of current enterprise business use cases and needs. RAG is especially effective when:

your task doesn’t require multiple steps or tool use
you have a pre-defined vector store used to find semantically similar context to augment your LLM output with
the window of your required context for users is not rapidly expanding across your enterprise
you are largely in control of your consumed and indexed data

Before we go further, it’s worth pointing out that Deep Research doesn’t replace RAG. It can actually use RAG as one of many tools in a larger, more adaptive workflow. These approaches aren’t in conflict. They complement each other.

So, what happens when any one of those assumptions breaks down?

What if:

the user’s question requires multiple steps?
the right answer depends on something that wasn’t embedded ahead of time?
the context is scattered across dashboards, APIs, third-party tools, or documents you don’t directly manage?
the answer changes based on what you find along the way?

There is a wonderful white-paper titled “Deep Research Agents: A Systematic Examination And Roadmap” (a great morning coffee paper) that defines Deep Research as:

AI agents powered by LLMs, integrating dynamic reasoning, adaptive planning, multi-iteration external data retrieval and tool use, and comprehensive analytical report generation for informational research tasks.

The diagram above starts with user input and flows into an intent clarification step. To help Deep Research agents adapt to changing user needs, there are generally three planning strategies used to guide input through dynamic workflows:

Planning-only: The agent builds a task plan directly from the initial prompt with no follow-up. Most DR agents today take this approach.
Intent-to-planning: The agent asks targeted questions to clarify the user’s goal before planning. This is the strategy used by OpenAI’s Deep Research API today.
Unified intent-planning: The agent drafts a plan first, then checks with the user to refine or adjust it before moving forward. It combines structure with flexibility.

After the intent clarification step, depending on the planning strategy, the workflow kicks off. This can be handled by a single agent or span multiple agents, and it often includes tools like vector or relational databases, online retrieval, data analytics, or code execution.

Some workflows involve a single agent running tools directly, while others involve multiple agents working together. Either way, the goal is to tap into a broader range of data sources, reason through what’s being found, adapt to new inputs as they come in, and use the right tools based on both the original intent and what gets uncovered along the way.

The output is a structured research response that includes citations, tool usage, and a clear view into how the result was assembled.

Background on OpenAI’s Deep Research API

“The Deep Research API enables you to automate complex research workflows that require reasoning, planning, and synthesis across real-world information. It is designed to take a high-level query and return a structured, citation-rich report by leveraging an agentic model capable of decomposing the task, performing web searches, and synthesizing results.” [OpenAI Deep Research Cookbook]

Unlike ChatGPT where this process is abstracted away, the API provides direct programmatic access. When you send a request, the model autonomously plans sub-questions, uses tools like web search, code execution, and MCP (Model Context Protocol) to produce a final structured response.

You can access Deep Research via the responses endpoint using the following models:

o3-deep-research-2025-06-26: Optimized for in-depth synthesis and higher-quality output
o4-mini-deep-research-2025-06-26: Lightweight and faster, ideal for latency-sensitive use cases

Here is a general anatomy of the deep research API in use:

system_message = """
You are a world-class AI researcher with expertise in machine learning, generative AI, agentic systems, multi-agent orchestration, large language models (LLMs), and real-world deployments of AI.

Your job is to find the most relevant and high-quality AI content published STRICTLY between......

"""

user_query = "SPECIFIC TASK: Find 3 of the most groundbreaking AI research papers, techniques, or scientific advances"

response = client.responses.create(
  model="o3-deep-research-2025-06-26",
  input=[
    {
      "role": "developer",
      "content": [
        {
          "type": "input_text",
          "text": system_message,
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "input_text",
          "text": user_query,
        }
      ]
    }
  ],
  reasoning={
    "summary": "auto"
  },
  tools=[
    {
      "type": "web_search_preview"
    },
    {
      "type": "code_interpreter",
      "container": {
        "type": "auto",
        "file_ids": []
      }
    }
  ]
)

Insight #1: General usage findings that are good for everyone

There are two options for the “summary” parameter in the reasoning object
- “auto”: this will give you the best available summary
- “detailed”: for a more detailed report
Error Code: 400 - BadRequestError when using reasoning parameter
- Your organization must be verified to generate reasoning summaries.
  Please go to: https://platform.openai.com/settings/organization/general
  and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate.
- Apparently, your organization must be verified to generate reasoning summaries which help with understanding the workflow that the Deep Research workflow executed
Web Search Dependency
- Requires tools=[{"type": "web_search_preview"}] to function

Insight #2: Polling Pattern wins over a Standard Callout Pattern

Standard Callout Pattern:

# Standard callout pattern
response = client.responses.create(
    model="o4-mini-deep-research-2025-06-26",
    # background=False (default)
    input=[...],
    tools=[{"type": "web_search_preview"}]
)
# Your code sits here waiting... and waiting... and waiting...
result = response.output[-1].content[0].text

What This Code Does:
- This code sends the request to OpenAI's Deep Research API and waits synchronously for the entire research process to complete, blocking your thread so no other code can execute during the wait. It looks deceptively simple, appearing to be just like a regular API call, but it hides the complexity of the research process by handling all the polling, status checking, and progress tracking internally. When the research is finally finished, it returns the final result directly, making it seem like a straightforward API interaction despite the complex multi-step research workflow happening behind the scenes.
The Problem:
- Extreme wait times - took upwards of 30 minutes to run
- Complete blindness - no idea of progress, just sitting and waiting
- Timeout hell - constant timeout errors requiring adjustment of timeout settings
- No stability - timeout settings brought no reliability to the application
- Desperate measures - had to use Python libraries just to track how long we were waiting
- Hope-based development - just hoping it was doing what it needed to do
- Can't run multiple requests simultaneously (even with broken out async jobs)
- All-or-nothing failure - if it fails after 25 minutes, you lose everything
- Unpredictable performance - same request might take 5 minutes or 30 minutes
- User experience disaster - users think your application is broken
- Development nightmare - impossible to debug or optimize
- Resource waste - your application is essentially frozen during execution
- No cancellation - once started, you're committed to waiting it out
Real-World Impact:
- Monolithic prompts - 30+ minutes of complete silence
- Broken out jobs - still had to run synchronously, same problems
- Timeout setting chaos - constantly adjusting from 30s → 300s → 600s → ???
- Development paralysis - couldn't test effectively with such long wait times

Polling Pattern

# Returns immediately with response ID
response = client.responses.create(
    model="o4-mini-deep-research-2025-06-26",
    background=True,  # This changes everything
    input=[...],
    tools=[{"type": "web_search_preview"}]
)

# Poll for completion
response_id = response.id
while True:
    status_response = client.responses.retrieve(response_id)
    if status_response.status == "completed":
        result = status_response.output[-1].content[0].text
        break
    elif status_response.status == "failed":
        # Handle error gracefully
        break
    else:
        print(f"Still working... Status: {status_response.status}")
        await asyncio.sleep(10)  # Wait 10 seconds

What This Code Does:
- This code sends the request to OpenAI's Deep Research API with background=True, which immediately returns a response ID instead of blocking. It then enters a polling loop that checks the status every 10 seconds, giving you real-time visibility into the research progress. Unlike the synchronous approach, your application remains responsive and can handle multiple requests simultaneously. The polling loop continues until the research is complete, at which point it retrieves the final result, or gracefully handles any failures that occur during the process. This pattern exposes the complexity but gives you complete control over the user experience.
The Advantages:
- Immediate response - get response ID instantly, no initial waiting
- Real-time feedback - show progress updates to users ("Still working...")
- Responsive UI - application stays interactive during research
- Cancellable operations - can stop requests if needed
- Status visibility - always know what's happening with each request
- Debugging friendly - can log and monitor each step of the process
- Timeout control - no more timeout hell, just manage polling intervals
- Predictable behavior - consistent response times and error handling
Real-World Implementation:
- 10-second polling intervals - sweet spot between responsiveness and API courtesy
- Status monitoring - track "in_progress" → "completed" → "failed" transitions
- Progress tracking - use Rich library to show spinning progress bars
- Real-time feedback - users see something is happening instead of wondering if it's broken
- Development speed - can test and iterate much faster with immediate feedback
Performance Impact:
- Non-blocking execution - your application can do other work while waiting
- Predictable timing - each request takes 2-5 minutes consistently
- Resource efficiency - no blocked threads waiting indefinitely
- Better user experience - users get feedback instead of silence

The polling pattern transforms Deep Research from a frustrating, black-box experience into a transparent, controllable process. The extra complexity is worth it for the dramatic improvement in visibility and user experience.

Insight #3: Architectural Differences - Monolithic vs Segmented Async prompting

Monolithic Approach (What I Started With)

# One massive prompt trying to do everything
GIANT_PROMPT = """
Find AI news for the past week in these 8 categories:
1. Product releases (find 3 items)
2. Breakthrough research (find 3 items) 
3. Real-world use cases (find 3 items)
4. Agentic AI (find 3 items)
5. Thought leadership (find 3 items)
6. AI safety (find 3 items)
7. Industry investment (find 3 items)
8. Regulatory policy (find 3 items)

For each category, format as follows...
[3000+ words of detailed instructions]
"""

# Single API call for everything
response = client.responses.create(
    model="o4-mini-deep-research-2025-06-26",
    background=True,
    input=[{"role": "user", "content": [{"type": "input_text", "text": GIANT_PROMPT}]}]
)

What This Approach Does:
- The monolithic approach attempts to handle all 8 newsletter categories in a single Deep Research API call using one massive prompt. This seems efficient in theory - just one API call, one response to handle, and all your content delivered together. The prompt contains detailed instructions for each category, formatting requirements, and date filtering rules, creating a comprehensive specification that should theoretically produce a complete newsletter in one go.
The Problems:
- 15-30 minute wait times - single request handling 8 complex research tasks
- All-or-nothing failure - if any category fails, you lose everything
- No progress visibility - complete black box until it's done (or fails)
- Impossible to debug - can't tell which category is causing issues
- Generic results - AI tries to balance 8 different tasks, excels at none
- No specialization - same instructions applied to vastly different research needs
- Memory limitations - AI struggles to maintain context across 8 different domains
- Retry nightmare - failure means restarting all 8 categories from scratch

Segmented Async Approach (What I Built)

# base_prompt.py - Shared foundation
BASE_PROMPT = """
You are a world-class AI researcher...
🎯 RESEARCH STRATEGY - TWO-PHASE APPROACH:
PHASE 1 - BROAD EXPLORATION (Start Wide):...
PHASE 2 - FOCUSED INVESTIGATION (Then Narrow):...
🚨 ABSOLUTE DATE REQUIREMENTS - NO EXCEPTIONS:...
[Shared 2000-word foundation prompt]
"""

# product_releases.py
from base_prompt import BASE_PROMPT, DATE_RANGE

CATEGORY_PROMPT = f"""
{BASE_PROMPT}

🎯 SPECIALIZED TASK - PRODUCT RELEASES:
You are the PRODUCT LAUNCH SPECIALIST...
[Focused 400-word specialization on top of base]
"""

# newsletter.py - Main orchestrator
from product_releases import fetch_product_releases
from breakthrough_research import fetch_breakthrough_research
# ... other imports

async def run_parallel_newsletter():
    # Launch all 8 categories simultaneously
    tasks = [
        fetch_product_releases(),
        fetch_breakthrough_research(), 
        fetch_real_world_use_cases(),
        fetch_agentic_ai(),
        fetch_thought_leadership(),
        fetch_ai_safety(),
        fetch_industry_investment(),
        fetch_regulatory_policy()
    ]
    
    # Run all in parallel, save results as they complete
    results = await asyncio.gather(*tasks)

What This Approach Does:
- The segmented approach uses a shared base_prompt.py containing the foundational research methodology, date filtering, and formatting rules that every category needs. Each of the 8 specialized files imports this base prompt and adds its own expert-level specialization on top, creating a layered prompting system. The main orchestrator imports all category functions and launches them simultaneously using Python's async functionality, with each category saving its results immediately upon completion. This creates 8 parallel research workflows that share common methodology but have distinct expertise.
The Advantages:
- 2-5 minute completion per category (vs 30+ minutes for monolithic)
- Parallel execution - all 8 categories run simultaneously
- Real-time results - see files being created as categories complete
- Layered expertise - shared foundation + specialized focus per category
- Fault isolation - if one category fails, the other 7 continue working
- Partial success - get 7 working categories even if 1 fails
- Easy debugging - know exactly which category failed and why
- Iterative improvement - can refine individual categories without affecting others
- Consistent methodology - shared base ensures uniform quality across categories
- DRY principle - don't repeat the same base instructions 8 times
Architecture Benefits:
- Modular design - each category is self-contained and testable
- Shared foundation - common methodology across all categories
- Scalable - can add/remove categories without affecting others
- Maintainable - can update base methodology or individual specializations
- Professional structure - organized codebase vs single giant prompt file
- Development friendly - can test individual categories quickly

The segmented approach transforms Deep Research from an unreliable experiment into a production-ready system with proper software engineering principles.

Insight #4: Applying Prompting Principles for Research Tasks

I found Anthropic's engineering post about their multi-agent research system while trying to figure out why my Deep Research workflows were showing odd and inconsistent behavior. This quote hit me:

"Multi-agent systems have key differences from single-agent systems, including a rapid growth in coordination complexity. Early agents made errors like spawning 50 subagents for simple queries, scouring the web endlessly for nonexistent sources, and distracting each other with excessive updates. Since each agent is steered by a prompt, prompt engineering was our primary lever for improving these behaviors."

Anthropic laid out the following 8 principles that they learned for prompting research agents that I was able to apply to my own:

Think like your agents.
1. Simulate prompts and tools to surface failure modes and understand their effects. Effective prompting relies on developing an accurate mental model of the agent.
2. Where I applied it: Added structured thinking guidance with before/after search evaluation steps. The AI now thinks through search strategy and evaluates results instead of making random queries.
Teach the orchestrator how to delegate
1. Lead agents must break tasks into clear, scoped assignments. Sub-agents need goals, outputs, tool guidance, and task boundaries.
2. Where I applied it: Created specialized roles for each category - "RESEARCH BREAKTHROUGH SPECIALIST" vs "PRODUCT LAUNCH SPECIALIST" - with clear boundaries about what makes each category distinct from others.
Scale effort to query complexity.
1. Agents struggle to judge appropriate effort for different tasks, and embedded scaling rules help the lead agent allocate resources efficiently
2. Where I applied it: Set explicit search quotas: simple product releases get 3-5 searches, complex regulatory policy gets 6-8 searches, research breakthroughs get 5-7 searches for verification.
Tool design and selection are critical.
1. Agents followed heuristics: review all tools, match intent, and favor specialized over generic options.
2. Where I applied it: Added strategic search patterns - start with broad queries like "AI safety July 2025", then narrow to specific queries like "OpenAI announcement July 2025" based on findings.
Let agents improve themselves.
1. Claude 4 was used to troubleshoot agent failures, recommend prompt and tool improvements, and power a tool-testing agent that reduced task completion time by 40%.
2. Where I applied it: Didn't implement this yet - identified as future enhancement where failed searches could generate improved prompting strategies.
Start wide, then narrow down.
1. Agents were prompted to start with broad queries, assess results, and narrow focus, mirroring how expert researchers explore before diving deep.
2. Where I applied it: Implemented two-phase search strategy
  1. Phase 1 broad exploration to understand the landscape
  2. Phase 2 focused investigation drilling into promising leads.
Guide the thinking process.
1. Extended thinking mode helped agents plan, reason, and adapt by improving tool selection, task scoping, and overall performance.
2. Where I applied it: Extended thinking is unfortunately not something I could control as part of my workflow but would be a great API feature
Parallel tool calling transforms speed and performance.
1. Parallelizing subagents and tool use reduced research time by up to 90 percent, enabling faster, broader exploration across complex tasks.
2. Where I applied it: Built async system running all 8 categories simultaneously instead of sequentially - this was the biggest performance gain, cutting total time from 30+ minutes to 10 minutes.

Insight #5: Enforcing Trust in AI Research Results

The biggest challenge with Deep Research isn't finding information, it's ensuring you can trust what gets returned. AI agents will happily grab anything they find on a topic and present it as authoritative fact.

The Trust Problem

Deep Research agents have no inherent sense of source credibility. They'll treat a random blog post the same as a peer-reviewed paper, or worse - they'll find a summary of a summary and present it as the original source.

Anthropic discovered this exact issue:

"Human testers noticed that our early agents consistently chose SEO-optimized content farms over authoritative but less highly-ranked sources like academic PDFs or personal blogs. Adding source quality heuristics to our prompts helped resolve this issue."

My Comprehensive Trust Heuristics

Here's the defensive prompt I built to enforce trust:

🚨 SOURCE TRUST HEURISTICS - REJECT IMMEDIATELY:

SEO SPAM DETECTION:
- REJECT listicles ("Top 10...", "Best AI tools...", "X things you need to know...")
- REJECT content farms (sites that publish 50+ articles/day)
- REJECT AI-generated summaries or newsletters about other sources
- REJECT affiliate marketing content disguised as reviews

PRIMARY SOURCE ENFORCEMENT:
- ONLY official company blogs, research papers, press releases
- AVOID secondary sources like Medium/Substack weekly roundups
- AVOID news articles ABOUT announcements (get the actual announcement)
- AVOID social media posts (unless from official company accounts)

AUTHORITY VERIFICATION:
- Prefer .edu, .gov, .org over .com sites
- Verify author credentials (real researchers, official company employees)
- Check publication venue (peer-reviewed journals > random blogs)
- Cross-reference claims with multiple authoritative sources

RECENCY VALIDATION:
- Verify publication date matches claimed timeframe
- REJECT any content from earlier than specified date range
- Check for "updated" vs "published" dates (use published)
- Flag suspicious date claims

CONTENT QUALITY CHECKS:
- REJECT articles without named authors
- REJECT content without clear publication dates
- REJECT sources that don't provide contact information
- PREFER sources with editorial standards and correction policies

Why This Builds Trust

Eliminates information decay - no more telephone game of facts getting distorted
Ensures accountability - readers can verify claims against the original source
Prevents hallucination - agents can't make up details from secondary interpretations
Maintains credibility - your research is only as good as your sources

What This Prevents

Without source heuristics, agents will:

Grab SEO-optimized listicles over authoritative sources
Present opinions as facts
Cite summaries as if they were original research
Mix reliable and unreliable sources without distinction

The Result

By enforcing trust at the source level, I transformed Deep Research from "AI that finds stuff" to "AI that finds credible information." The agents still do all the heavy lifting of searching and synthesizing, but now I can build greater trust of what is returned because I've constrained them to trustworthy sources.

Another Coding Blog

Discussion about this post