Another Coding Blog

Another Weekly AI Newsletter: Issue 72

Taylor Ortiz — Sat, 16 May 2026 19:39:53 GMT

Anthropic shipped into legal, small business, healthcare, and AWS in one week.

Claude for the legal industry launched with 12 practice-area plugins. Contract review, M&A diligence, and regulatory compliance out of the box. 87% of general counsel now use generative AI, up from 44% the prior year.
Claude for Small Business connected to QuickBooks, PayPal, and HubSpot. 15 ready-to-run workflows covering invoicing, CRM, document signing via DocuSign and Canva.
Anthropic committed $200M to the Gates Foundation. Grants, Claude credits, and technical support for vaccine screening, disease forecasting, K-12 education, and agricultural tools.
Claude Platform went GA on AWS. First cloud provider to offer Anthropic’s native platform with unified billing and same-day feature parity with the native API.
Every subscriber now gets separate Agent SDK credits. Pro gets $20/month, Max gets up to $200. Unlike OpenAI, which bundles Codex and third-party usage into normal plan limits, Anthropic is subsidizing the developer ecosystem with a separate bucket.
Claude Code limits increased another 50% through July. On top of the doubling from the week before.
Ramp and Axios independently confirmed Anthropic overtook OpenAI in workplace adoption. Though VentureBeat identified three structural threats to that lead.
The thread: Anthropic is trying to become the default for every vertical at once. Legal, healthcare, small business, enterprise, developer tooling. Whether that’s a platform strategy or overextension depends on execution.

OpenAI launched a deployment company and put Codex on your phone.

The OpenAI Deployment Company launched with 150 engineers on day one. 19 investment firms and consultancies, majority-owned by OpenAI, with Tomoro acquired to provide Forward Deployed Engineers. Valued at $14B.
ChatGPT connected to bank accounts. Plaid integration for Pro users in the US, with an Intuit partnership for actionable financial steps.
Codex shipped to iOS and Android. Mobile preview lets users start, review, and approve coding tasks while agents run on a separate device.
OpenAI disclosed a supply chain compromise. A TanStack npm package attack exposed code-signing certificates for macOS, Windows, iOS, and Android apps. Full certificate rotation required.
The thread: Both OpenAI and Anthropic launched enterprise services arms within a week of each other. The model API is becoming a commodity. The margin is shifting to who can get it deployed inside your organization first.

Companies are cutting workers at record revenue to fund AI.

Cisco cut 4,000 jobs while reporting record quarterly revenue. Stock rose 15% on surging AI orders.
GitLab announced sweeping restructuring to fund agent development. Cut headcount, flattened management, reorganized R&D into 60 smaller teams, and retired its CREDIT values framework.
GM laid off hundreds of IT workers and began hiring AI replacements. Explicitly seeking stronger AI skills.
Samsung faces a looming strike over AI. Global AI boom driving deep internal divisions between management and workers.
The thread: Revenue is up at all three companies. The functions going are IT operations, developer tooling management, and corporate overhead that was previously considered secure.

Grok Build, Claude Code, and Cursor all shipped agentic upgrades. LangChain shipped nine products to support them.

xAI launched Grok Build in beta. Terminal-native CLI with up to 8 parallel agents, Grok 4.3 beta, 2M token context. Priced at $299/month (introductory $99). SuperGrok Heavy only.
Claude Code limits increased 50%. Through July 13, on top of the doubling from the prior week. Plus separate Agent SDK credits.
Cursor shipped /orchestrate. Planner/worker/verifier loops that re-spawn on failure. Parallel subagents. Always-on CI agents.
LangChain shipped nine products at Interrupt 2026. SmithDB for agent traces, LLM Gateway for centralized control, Sandboxes GA for isolated testing, Deep Agents 0.6 for long-running workflows, and the Agent Development Lifecycle framework.
The thread: Grok Build at $299/month, Claude Code with separate SDK credits, Cursor as a standalone IDE. Three very different bets on how developers will pay for agentic coding. LangChain is betting the real money is in the infrastructure underneath all of them.

⭐ Featured: Thinking Machines built an AI that listens while it talks.

Every AI conversation today works the same way: you talk, the model waits, the model responds. Thinking Machines published research on “interaction models” that throw out that assumption entirely.

Their model processes continuous 200ms micro-turns of audio, video, and text simultaneously. There are no turn boundaries. The model listens while speaking, interrupts when it sees something wrong in your code, reacts to visual cues without being prompted, and runs background reasoning while maintaining the conversation.

The architecture splits into two parts: an interaction model that maintains real-time presence (always perceiving, always ready to respond), and a background model that handles deeper reasoning and tool use asynchronously. When the background model finishes a task, the interaction model weaves results into the conversation at an appropriate moment instead of interrupting.

The benchmarks are striking. On FD-bench (the standard interaction quality benchmark), their model scored 77.8 versus 46.8 for GPT-Realtime-2. On responsiveness, they hit 0.40 second turn-taking latency versus 1.18 for GPT-Realtime-2. They also created three new benchmarks (TimeSpeak, CueSpeak, visual proactivity) that no existing model can meaningfully perform. GPT-Realtime-2 scores near zero on all of them.

The model is a 276B parameter MoE with 12B active. It uses encoder-free early fusion, meaning no separate Whisper or TTS models. Audio comes in as raw dMel signals, video as 40x40 patches. Everything is co-trained from scratch.

Their argument comes from Rich Sutton’s “bitter lesson”: if interactivity is bolted on through harnesses (voice activity detection, turn-taking logic), it can never scale with intelligence. If it’s native to the model, scaling makes the model both smarter and a better collaborator.

What to watch for: This is a research preview from a startup (276B parameters, limited availability). But the design principle matters: current real-time systems from OpenAI and Google use harnesses to fake interactivity on top of turn-based models. Thinking Machines is arguing that’s a dead end. If they’re right, every voice agent shipping today is architecturally temporary.

🎙️ Worth a Listen

IBM AI Engineer Bri Kopecki on why agents without infrastructure are “brilliant goldfish.”

The problem: Most AI agents have no memory, no access control, no audit trail. Every conversation starts from scratch.
The six-layer stack: Scheduler (who goes first), memory manager (short/long/episodic), tool manager (sandboxed execution), identity manager (tokens and permissions), observability (full decision tracing), and guardrails/governance (human-in-the-loop for high-stakes decisions).
Why it matters now: This maps directly to what LangChain shipped this week (SmithDB for traces, LLM Gateway for access control, Sandboxes for tool isolation) and explains why Cursor, Anthropic, and OpenAI are all building orchestration layers.

Quick Hits

Cerebras IPO’d at $5.55B, shares jumped 89% on day one | TechCrunch — Near $100B market cap on debut. The AI chip premium is real.
Medicare created a payment model built for AI-assisted services | TechCrunch — The largest US payer quietly opened the door for clinical AI reimbursement. This will pull deployment faster than any product launch.
Musk v. Altman trial went to the jury | MIT Tech Review — Closing arguments accused Musk of selective amnesia and Altman of lying about the nonprofit mission.
ArXiv banned researchers for AI-generated papers | The Verge — Academic publishing’s authentication problem now has teeth, but detection is still losing the arms race.
Meta embedded AI in Threads and won’t let users block it | The Verge — Captive distribution at 3B+ users, no opt-out.
OpenAI Parameter Golf results: 1,000+ participants, agents everywhere | OpenAI — An ML challenge where the vast majority of submitters used coding agents. OpenAI built a Codex-based triage bot to handle the submission volume.
Claude Mythos cracked Apple’s M5 memory security in five days | Tom’s Hardware — First privilege escalation exploit on M5. Apple spent half a decade building Memory Integrity Enforcement. Standard user to root access.
Nvidia committed $40B in equity AI investments in 2026 | TechCrunch — Not just selling chips. Acquiring stakes in the companies that consume the most of them.
Anthropic published “2028: Two scenarios for global AI leadership” | Anthropic — A policy paper on US-China AI competition. Anthropic is writing geopolitics now.
YouTube expanding AI deepfake detection to all adult users | The Verge — The detection side is scaling up.
Google updated spam rules to include AI manipulation attempts | The Verge — SEO for the age of AI-generated content.

Multi-Agent Account Planning That Learns Across Deals

Taylor Ortiz — Fri, 15 May 2026 15:33:51 GMT

Intro

Anthropic shipped multi-agent orchestration in Managed Agents on May 6th. An agent can be configured as a coordinator with a roster of other agents it can delegate to, and the platform handles fan-out, child-thread lifecycle, parallel execution, and per-thread observability.

Anthropic also shipped a management console. Every agent, session, child thread, and memory write is browsable, with full transcripts, tool calls, and version history inspectable on click. That console shaped how I built the system, because the logging I would have written myself was already there.

The use case I built is account planning in B2B SaaS sales. The vendor is a fictional company, Yardstick AI, selling an AI evaluation platform. The prospect is Vercel, a real company with a public footprint rich enough to give the agents something genuine to research.

The system has fifteen agents organized into a five-phase pre-meeting orchestration plus a post-meeting debrief loop. The pre-meeting flow has two genuine decision steps where the coordinator chooses what runs next based on what just came back, not a fixed sequence.

It uses MCP servers (Notion, Slack), the Anthropic vault for credentials, two memory stores (a playbook and a decision-records corpus), custom HTTP tools for a mock CRM and enrichment service, and the built-in web search and fetch tools.

Most of the system’s analytical work happens in the layer of decision records that the agents read from and write into. The records get captured two ways.

Implicitly, the system infers decisions from CRM record changes, activity logs, and other signals that move without anyone narrating them.

Explicitly, after each meeting, the system uses the full account plan plus the surrounding events (calendar entries, CRM stage moves, recent activity) to compose a curated set of questions for the rep. The questions are shaped by what the system already knows about the account, so they target the specific decisions most likely to produce useful data instead of asking generic “how did it go” prompts.

Whichever way a record gets created, it lives in a shared memory store that the next account’s run can retrieve and reason from. That is the difference between a system that gives you one prep brief and a system that gets better at giving you prep briefs as it accumulates evidence.

This post documents what I built, what worked, what did not, and what the costs and constraints actually look like once you push past the basic demo.

Below is a capture of the final product:

What you’ll learn

This post walks through what I learned building a multi-agent system in Anthropic Managed Agents. The official documentation covers the basics. This post covers what comes after that: how the primitive holds up when you push it against a real, multi-source, multi-phase problem. By the end you should have a clearer sense of when this architecture is worth using and what it takes to make it work.

Concretely:

What multi-agent really is inside the platform. The shape of the architecture, where the limits actually sit, and what the docs do not yet spell out.
How the system remembers things during a run versus across runs. Two different kinds of memory live side by side, and a real system has to be deliberate about where each finding goes.
Why use multi-agent over a workflow. When the coordinator’s runtime decisions justify the complexity, and when they do not.
How decision records make the system compound. A structured corpus of recommendations and their resulting decisions turns each run into evidence the next run can use.
The agent harness. Everything you build around the platform primitives to make the system work for your use case: the MCP servers you connect, the record schemas your corpus enforces, the system prompts that define each agent’s job, the routing logic the coordinator follows, the briefings it hands to each agent.
Async surfaces via MCP. How Slack becomes part of the system through MCP, so the rep can capture decisions in-place after a meeting without a custom bot.
The distillation problem. Why the system’s raw output is not usable on its own, and what has to happen to make it useful to a human in thirty minutes.
Cost and observability. Per-thread spend, total cost for a full run, and what the Managed Agents console gives you for free.
Honest findings. Pitfalls a builder should expect to hit on their first run.
When this is the right tool, and when it isn’t. What kinds of problems multi-agent orchestration fits, and what kinds belong with a simpler architecture.

Section 1: The work of account planning

An account executive working a B2B SaaS deal is doing one job continuously and several others on top of it. The continuous job is synthesis. At any moment in a pursuit, an AE is holding context across half a dozen sources: their own notes from past calls, the CRM record with its stages and activity log, public signals (product launches, hires, press), conference encounters and hallway intel, backchannel from people who used to work there, win and loss patterns from similar accounts, and their own company’s internal playbook. None of these sources are formatted alike, refresh on the same cadence, or answer the same questions week to week.

The job sits on top of a rhythm of meetings. Before each meeting, the rep does pre-meeting prep. After each meeting, the rep does post-meeting capture. Between meetings, follow-up. The cadence is continuous, across fifteen to thirty active accounts at any given time. Even the most disciplined AE admits the synthesis happens in their head more than on paper, and the capture happens only when there is slack to capture.

What makes this work a candidate for multi-agent orchestration is the shape of the synthesis problem: the sources decompose naturally by role. Reading internal Notion notes, researching the company on the public web, mapping the org chart, and synthesizing all of it against a playbook are four different jobs. Each role wants a different tool surface, and each role’s output is most useful when it is separate from the others until the synthesis step. Running them in parallel saves wall-clock time, but the more interesting property is that each role can be a focused agent with a small system prompt and a tight tool surface, rather than one generalist agent trying to be five things at once.

The 30-minute pre-meeting slice is the moment in this rhythm where multi-agent orchestration is most legible. The rep has a calendar event coming up. They want a brief that consolidates what is knowable from everywhere into something they can read in five minutes, prepare around in twenty, and act on in the meeting itself. That is the moment this post centers on, but the architecture supports the broader cadence around it.

Section 2: What multi-agent in Managed Agents actually is

Most coverage of “agents” uses the term to cover everything from a single Claude call to a fully autonomous AI team that plans its own work. Anthropic’s multi-agent feature is neither extreme. It is a specific pattern with specific constraints, and the constraints are worth knowing before you build against it.

The shape: coordinator with a roster

One agent is the coordinator. Its definition includes a list of other agents it is allowed to delegate to. That list is called the roster. A few specific limits:

The roster can hold up to 20 agents.
The coordinator can call multiple copies of any agent on the roster.
A session can have up to 25 active threads running at once.
Specialists cannot delegate to other specialists. The architecture is flat, not nested (Anthropic’s docs phrase it as “depth > 1 is ignored”).

If you came in expecting agents that delegate to agents that delegate to agents, the spec corrects you on page one. What you get is a flat fan-out from a single coordinator. For most real systems this is the right tradeoff.

Threads: how the system stays organized

A thread is a separate, isolated conversation that belongs to one agent. Each thread has its own history and tools. Threads don’t share anything with each other, even though they all run inside the same session.

Two kinds:

The primary thread is the coordinator’s own thread. It also doubles as the activity feed for the whole session.
A child thread is created when the coordinator delegates to a specialist. The platform copies the session’s tools and credentials onto that thread, and the specialist’s work runs there.

When the coordinator delegates to multiple specialists in the same turn, the child threads run in parallel. The coordinator waits for each reply before deciding what to do next. You don’t write any of the glue code for this. The decision-making that would normally live in a script lives inside the coordinator’s prompt.

Thread lifecycle

A thread moves through three states:

Running: the specialist is actively working.
Idle: the specialist has finished but the thread is still alive. It counts against the 25-thread cap.
Archived: you have told the platform you are done with the thread. The slot is freed.

For most builds, the 25-thread cap is generous enough that you never think about lifecycle. Systems that lean hard on parallel work have to treat archiving as part of the orchestration.

Idle threads stay alive, which enables follow-ups

Because an idle thread is not gone, the coordinator can send a follow-up message to a specialist it called earlier. The specialist keeps its full context from before. That means the architecture supports more than one round of back-and-forth per specialist, not just one-shot delegation. I did not use this in the build, but in retrospect there are several places it would have helped.

Two kinds of memory

The system has two layers of memory that work on different time scales:

Persistent threads keep a specialist’s context alive within a session. The moment the session ends, the threads are gone.
Memory stores persist across sessions. They are objects shared across the whole workspace, mounted onto a session when it starts. Anything written into one stays available to the next run that mounts the same store.

A real multi-agent build needs both.

Designing the split

The design split lives in two questions:

Within a session: which specialists do you keep alive for a follow-up, and which do you fire once and let go?
Across sessions: which findings deserve to be promoted into a memory store, and which can evaporate when the session ends?

The platform gives you the building blocks for both. It does not decide which findings belong where. Get that split wrong and you pay either way:

Throw away thread context too early, and you re-brief the specialist on every follow-up.
Fail to promote findings into a store, and the next session starts cold on everything you already learned.

Our build leans heavily on the cross-session side. Most of the analytical work in this system comes from the decision-records corpus, which is the through-line for the rest of this post.

Section 3: The agent architecture

The pre-meeting orchestration uses thirteen agents: one lead orchestrator plus twelve specialists in its roster. The post-meeting debrief loop adds two more agents that sit outside the coordinator entirely. Fifteen across the system.

Pre-meeting work is a tightly scoped synthesis problem that benefits from a coordinator. Post-meeting work is a slower, human-paced loop that does not benefit from coordination at all, just two single-purpose agents that read and write a shared corpus.

The pre-meeting run breaks into five phases, sequential at the coordinator level and parallel within. The coordinator narrates each phase boundary as it runs, which makes its reasoning visible and forces the model into a structured plan rather than letting it improvise.

Phase 1: gather context and pull prior records

Five specialists fan out concurrently:

meeting-context: reads internal Notion notes through Notion MCP.
external-researcher: pulls public signals from the web.
stakeholder-analyst: maps decision-makers via a mock enrichment service.
engagement-readiness: hits a mock CRM for outreach history.
decision-retriever: runs against the shared decision-records corpus and pulls prior decision records from past accounts that match the current account’s shape (by attribute overlap: industry, competitor present, champion profile, procurement complexity, and so on).

Phase 2: conditional topic education

The coordinator inspects what Phase 1 surfaced and picks two to four technical topics worth briefing the rep on before the meeting. For the Vercel run, those topics included cross-provider eval methodology, agent eval, AI observability, and eval-driven CI.

topic-educator: runs against the curated topic list and returns a primer per topic, each ending with smart questions the rep can ask in the room.

If the account does not warrant it, the coordinator skips Phase 2 entirely.

Phase 3: synthesis

opportunity-risk: receives everything Phase 1 and Phase 2 produced, mounts the read-only Yardstick playbook from a memory store, reads the prior decision records the retriever pulled in Phase 1, and writes the structured pursuit plan. The plan covers ICP fit, buying triggers, stakeholder map and sequencing, first-meeting hypothesis, recommended plays, and disqualifiers.

Phase 3.5: next-best-action selection

After the synthesis is in, the coordinator does not jump straight to recording. It asks one more specialist, the chooser, to decide which concrete recommendations are warranted for this specific account.

next-best-action-chooser: reads the synthesis plus the prior decision records the retriever pulled in Phase 1, decides which of three specialized recommenders to invoke, and writes a focused brief for each. The chooser can also skip a recommender, with a reason. A different account with different synthesis and different prior records produces a different plan.

The three recommenders available to the chooser:

stakeholder-recommender: sequencing or lead-play.
pricing-recommender: pricing strategy.
competitive-recommender: competitive positioning or risk mitigation.

Phase 4: parallel recommendation generation

The coordinator dispatches whichever recommenders the chooser named. They run in parallel. Each one produces a single Recommendation Record (RR) as a markdown draft with strict YAML frontmatter and a cited_records block listing the prior decision records whose outcomes informed this recommendation. The recommenders hand drafts back to the coordinator; they do not write to the corpus themselves.

Phase 5: decision recording

decision-recorder: receives the RR drafts, validates each one against the schema, checks every cited prior decision record exists in the corpus, writes the validated records to /mnt/memory/yardstick-decisions/, and updates the corpus index.

Splitting content generation (the recommenders) from persistence (the recorder) keeps each role focused.

Post-meeting: the debrief loop

That accounts for the thirteen pre-meeting agents. The remaining two run on the post-meeting side:

debrief-asker: reads the next-best-action RRs the pre-meeting run produced, picks the open questions still unresolved, formats them as a curated set, and posts them into a Slack channel through the Slack MCP server. The rep replies in the thread on their own time.
debrief-synthesizer: once there are replies, reads the Slack thread, parses the rep’s answers, and writes Decision Records into the corpus with the linked_rr field pointing back to the originating RRs.

Neither sits in the coordinator’s roster because neither runs synchronously with the pre-meeting flow. They run on a human-paced timescale, possibly hours or days later. Coordinating them through the same session would require keeping a session open across days or weeks, which the platform does not support. The cleaner shape is two single-purpose agents that share the corpus as their interaction substrate.

Section 4: What the platform gives you for observability

Most multi-agent demos require you to build your own logging before you can debug them. Managed Agents takes the opposite stance. Anthropic ships a management console that turns every agent, every session, every child thread, and every memory write into a click-through artifact you can inspect without writing any instrumentation.

The console is structured around the platform’s primary objects. The Agents tab lists every agent you have created with its system prompt, declared MCP servers, custom tools, and toolsets all inspectable on click. Versioning is built in. The Sessions tab shows every session with the coordinator’s primary thread and every child thread enumerated, status per thread, full transcripts including the model’s reasoning content, and every tool call shown inline with its inputs and outputs. The Memory Stores tab tracks version history so any write to the decision-records corpus is auditable end to end.

At runtime, the same data is available programmatically through the events API. The session-level stream gives you a condensed feed across the whole session. Per-thread streams give you raw event sequences for any specialist. The three events that matter for fan-out observability are session.thread_created, agent.thread_message_received, and session.thread_status_idle. Stringing those together gives you the fan-out timeline of the whole run without writing a single instrumentation line.

Cost data is similarly structured. Every event carries usage data scoped to the thread that produced it. The full Vercel run cost $5.51 across the pre-meeting orchestration. Thirteen agents sit in the roster, but the conditional dispatch in Phase 3.5 chose to invoke only eleven of them for this account (one recommender was skipped on substance).

The cost shape is what the chart makes obvious. The lead-orchestrator dominates at $1.21, because it is the one thread that accumulates context across every phase. The two heaviest specialists are external-researcher and topic-educator at about $0.79 each, both driven by web-tool use rather than cumulative context. The Phase 4 recommenders, the Phase 3 synthesis, and the Phase 5 decision-recorder cluster in the $0.40 to $0.45 range, each receiving the cumulative context from prior phases plus the prior decision records the retriever pulled in Phase 1. The remaining Phase 1 specialists sit at $0.28 or below. Wall-clock was about fifteen minutes from prompt to final answer.

Section 5: What multi-agent gives you that a workflow can’t

Multi-agent orchestration is only worth using when the coordinator makes a real decision between phases. If your design fans out, waits for results, and synthesizes them, you have built parallel API calls dressed up as a multi-agent system. The platform’s complexity (extra threads, longer latency, harder debugging) buys you nothing a sequential workflow couldn’t already do.

The thing that justifies the complexity is the moment the coordinator pauses, looks at what the previous phase produced, and decides what should happen next. That decision is the part a workflow cannot replicate, because a workflow has to know in advance what it is going to do.

In our build, there are two such decision steps.

The first lives between Phase 1 and Phase 2. Phase 1 fans out five specialists to read the account from five angles. The coordinator collects their output, pauses, and picks two to four topics worth briefing the rep on before the meeting. For Vercel, the coordinator chose cross-provider eval methodology, agent eval, AI observability, and eval-driven CI. None of those topics are defined anywhere in advance. They are picked from what Phase 1 surfaced about this specific account. A different account would produce a different list, or no list at all, in which case the coordinator skips Phase 2 entirely.

The second lives between Phase 3 and Phase 4. After opportunity-risk produces the synthesis, the coordinator dispatches the next-best-action-chooser, which reads the synthesis plus the prior decision records the retriever pulled in Phase 1 and decides which of three specialized recommenders to invoke: stakeholder, pricing, or competitive. On the Vercel run the chooser invoked stakeholder-recommender and competitive-recommender, and skipped pricing-recommender with the reason that the $42K pilot structure was already validated. Skipping with a substantive reason is what separates a real decision from a conditional that always fires.

The coordinator narrates each decision as it happens, which makes the reasoning visible:

Phase 1 specialists are back. External-researcher found public Braintrust endorsement at Vercel that the internal Notion notes treated as a stalling competitor. Phase 2 launched. Topic-educator is building primers on cross-provider eval, agent eval, AI observability, and eval-driven CI based on what surfaced.
Phase 3.5 complete. Invoking stakeholder-recommender (sequencing) for the May 21 call sequencing and Tom-Becker cultivation. Invoking competitive-recommender (competitive_positioning) for the Braintrust counter-offer scenario. Skipping pricing-recommender: $42K structure already validated, pricing isn’t the next decision point.

That kind of reasoning is what tells you the coordinator is actually orchestrating rather than executing. A workflow could fan out the same specialists in parallel. It could even hard-code the topic-educator and recommender steps. What a workflow cannot do is pick which topics to brief on this turn for this account, or which recommenders are warranted given what the synthesis just surfaced. Those decisions require a model with the full context loaded, which is exactly what the coordinator is.

Section 6: Decision records: the layer that compounds

A memory store by itself is just structured storage. What turns it into a system that compounds across runs is the contract you define for what gets written into it. In our build, that contract is a pair of record types: Recommendation Records (RRs) and Decision Records (DRs). Anthropic provides the memory store. You decide what goes in it and how it is structured.

Every Recommendation Record is created before the meeting. It is what the system thinks the rep should do.

Every Decision Record is created after the meeting. It is what the rep actually did and what came of it.

The DR points back to the RR it resolved through a linked_rr field. That pairing is the chain the system learns from: recommendation → decision → outcome. Future runs can see both what was recommended and how it actually played out, which is what makes the corpus more than a logbook.

The schemas are strict YAML frontmatter on top of a markdown body, and the format is doing two jobs at once.

The YAML half is what makes the records queryable. Every key field, account, date, decision_type, account_attributes, is structured as a typed key/value pair, which means the decision-retriever can filter the corpus by exact attribute match. Without that structure, the retriever would be doing fuzzy text search over freeform prose, and matches would be unreliable. With it, “find me prior pricing decisions where procurement_complexity is vp_signoff” becomes a clean lookup.

The markdown body below the YAML is where the longer-form reasoning lives: the context, the rationale, the alternatives considered, the lessons in the generalized pattern. That part does not need to be queryable, just readable.

YAML specifically is doing one more useful thing: it is a format Claude (and most LLMs) handle natively, which means the recommender agents can produce schema-conformant frontmatter reliably without you needing a custom serializer. Together, the format gives you a record that is queryable from above and human-readable below.

Recommendation Record schema

---
id: rr-{YYYY-MM-DD}-{account-lower}-{decision_type}
record_type: recommendation
schema_version: v1
account: {account_name}
date: {YYYY-MM-DD}
generated_by: {recommender agent name}
decision_type: {sequencing | lead_play | pricing | competitive_positioning | first_meeting_hypothesis | disqualification_threshold | risk_mitigation}
account_attributes:
  stage, size_band, ai_surface_area, buy_or_build_culture,
  competitor_present, competitor_depth, champion_profile,
  new_leadership_window, procurement_complexity
linked_dr: null
cited_records:
  - prior_rr: null
    prior_dr: dr-{YYYY-MM-DD}-{account}-{decision_type}
    prior_outcome: one-line outcome from the DR's outcome.notes field
    relevance: which attributes match
    lesson_applied: one-line lesson taken from the DR's Generalized pattern
---

## Context
## Findings that supported this recommendation
## Recommendation
## Reasoning
## Alternatives considered
## Generalized pattern

Decision Record schema (same shape as RR, with these fields added)

record_type: decision
linked_rr: rr-{...}    # backfills the chain in the other direction
outcome:
  status: {closed_won | closed_lost | stalled | pending | unknown}
  status_date: {YYYY-MM-DD or null}
  acv_usd: {number or null}
  notes: one-line description of outcome

Body sections add ## What was decided, ## Outcome, and ## Retrospective note. The Generalized pattern section gets rewritten once the outcome is known, so the pattern is validated rather than hypothesized.

The account_attributes block is the filter the decision-retriever uses in Phase 1. When the system runs against a new account, the retriever filters the corpus for records whose attributes overlap. A new mid-market developer-tools account with a Braintrust competitor and a staff-engineer champion will pull back both the Vercel records and the Datadog records as prior decisions worth reasoning over. The retriever does not care whether the original account is Datadog or Vercel. It cares whether the shape of the account is similar enough to learn from.

The cited_records block is what makes the chain visible. Every RR carries an explicit list of prior DRs whose outcomes informed this specific recommendation. Each entry names four things:

prior_dr id, which record is being cited
prior_outcome, what happened (so the result behind the lesson is visible)
relevance, which account_attributes matched
lesson_applied, the one-line rule the recommender is carrying forward

Multiple cited records may appear if the recommendation draws on more than one prior record. A reader of any RR can trace the reasoning back to the cited prior records by id, not by hand-waving.

Implicit and explicit capture of enterprise decisions

Records get into the corpus two ways.

Implicitly, through CRM record changes and activity logs the system watches without anyone narrating them. A stage change, a contract uploaded, a deal closed-won or closed-lost is itself a decision signal. The decision-recorder can infer a DR from those signals and write it with outcome.notes: inferred from CRM stage change. Implicit capture catches the cases where the rep forgot to debrief but state moved anyway. The records are useful but carry less reasoning, because no one narrated the why.

Explicitly, through a post-meeting debrief loop where the system asks the rep curated questions in Slack and the rep replies in-thread. The records that come out of explicit capture carry the rep’s own reasoning in their voice, which makes them the richest data the corpus has. Chapter 7 covers the mechanics of that loop in detail.

Cross-account learning in practice (from the actual run)

The Vercel pre-meeting run generated two Recommendation Records, one from the stakeholder-recommender and one from the competitive-recommender. Each one carries a cited_records block linking it to specific Datadog DRs by id. The sequencing RR’s cited_records block, taken directly from the corpus:

cited_records:
  - prior_rr: null
    prior_dr: dr-2025-07-22-datadog-sequencing
    prior_outcome: "VP only met us once, at the closing call, with champion presenting the case."
    relevance: "champion_profile=staff_eng_with_pain, sequencing, procurement_complexity=vp_signoff"
    lesson_applied: "Do not engage the buyer directly when champion has standing with buyer. Equip the champion with internal proposal materials and let them own the internal sell."
  - prior_rr: null
    prior_dr: dr-2025-09-12-datadog-risk-materialized
    prior_outcome: "Risk materialized in week 5; recovery move worked. Deal closed but 10 days later than original target."
    relevance: "champion_profile=staff_eng_with_pain, single-threaded risk, secondary contact cultivation"
    lesson_applied: "Secondary contact cultivation should be a pre-meeting deliverable, not a contingency. The secondary needs genuine engagement (their own use case), not just awareness."

The Reasoning section of the same RR cites those records by id in the body, not just in the frontmatter:

dr-2025-07-22-datadog-sequencing: Champion-led internal sell. VP met rep once at closing call. Direct structural match, Priya carrying to Marcus. Differs because Marcus is new (3 months in) and Priya’s standing with him is untested. Adaptation: explicit checkpoint and escalation triggers.
dr-2025-09-12-datadog-risk-materialized: Secondary contact cultivation saved the deal when champion went on leave. At Vercel, Tom Becker is the designated secondary with genuine AI Gateway/Production Monitor use case. Cultivation begins May 21, not mid-POC.

That paragraph is the entire reason the corpus exists. The system pulled two specific records from a different account, identified the load-bearing attributes, and applied the lessons with an adaptation for the Vercel-specific situation. It is structured reasoning over a corpus of prior decisions, filtered by attributes the engineer chose to make filterable.

The competitive-positioning RR follows the same shape, citing dr-2025-08-10-datadog-competitive and dr-2025-07-15-datadog-lead-play. Between the two RRs, the Vercel run cited four distinct Datadog DRs by id, with eight distinct lessons applied. None of that reasoning is hand-waved. All of it is structurally traceable.

Why this layer compounds

The platform’s memory store is durable, but durability alone does not produce learning. What produces learning is the schema contract that makes every write structurally identical and every read filterable. Once that contract exists, every run adds to the corpus, and every subsequent run benefits. The first Vercel run cited four Datadog DRs. The second Vercel run will also be able to cite the first Vercel run’s records. The third will cite both. The system gets better at giving you prep briefs because the substrate it draws on is growing in a way the retriever can actually use, and because every recommendation it generates is structurally tied to the prior records behind it.

Section 7: The async loop

The pre-meeting run finishes in fifteen minutes. The deal does not. After the call, the rep has information that did not exist before the meeting started, and the system needs a way to capture it. The capture step does not belong inside the pre-meeting orchestration. It runs on a fundamentally different timescale, against a different surface, with a different participant in the loop.

The build uses Slack as that surface and two standalone agents to run the loop: debrief-asker and debrief-synthesizer. Neither one sits in the coordinator’s roster. Both are agents in the same workspace, configured the same way as the pre-meeting specialists, but invoked independently when triggered.

The asker: curated questions, not generic prompts

After the meeting (or after a CRM event signals that a recommendation is due for resolution), debrief-asker runs. It is a standalone Managed Agents agent connected to the workspace’s Slack instance through the Slack MCP server. The asker reads the open RRs for the account, looks at the surrounding context (the recommendation made, the current account state, recent activity logs, calendar entries), and composes a curated set of debrief questions that target the specific decisions the RR was about.

The questions are not generic. They are shaped by what the system already knows about the account and which decisions are actually open. If the synthesis recommended a pricing structure but the CRM shows the deal has already moved to negotiation, the asker does not ask “did you discuss pricing”, it asks “did the $42K structure hold, and what did Marcus say about the legal-review path.” If a calendar entry shows a meeting happened with a stakeholder the system did not originally surface, the asker adds a question about that. The questions are surgical because the system already knows enough about the account to ask the right one.

The asker posts the curated set into a Slack channel scoped to that opportunity, so each deal has its own thread of capture. The rep replies in the thread whenever they have time. There is no UI to learn and no form to fill out.

The synthesizer: schema-strict capture in the rep’s voice

Once there are replies, debrief-synthesizer runs. It reads the Slack thread through the same MCP server, parses the rep’s answers, and writes one Decision Record per resolved recommendation. The DR carries the rep’s reasoning in their own voice, plus a linked_rr pointer back to the originating RR. If the rep’s answer is ambiguous, the synthesizer marks the DR outcome.status: unknown rather than guessing. Schema integrity is more important than coverage.

The Slack MCP gotcha

The Slack MCP setup has one practical gotcha worth flagging. Slack MCP rejects bot tokens (xoxb-); it requires user tokens (xoxp-). The OAuth flow needs the user_scope parameter to capture a user-token, which the Anthropic vault stores as a static_bearer credential. The Slack app also has to be explicitly enabled at api.slack.com/apps/{app-id}/app-assistant for MCP access. None of this is in the Slack MCP getting-started docs at the time of writing.

The corpus is the integration point

The corpus is how the two flows connect. The pre-meeting orchestration writes RRs to it. The post-meeting agents read those RRs back, capture the rep’s debrief, and write DRs that point to the originating recommendation through linked_rr. The two flows never talk to each other directly. They just write to and read from the same store.

Section 8: The distillation layer

The output of an eleven-agent pre-meeting run is roughly eighty kilobytes of structured content across the orchestrator’s synthesis, the topic primers, the recommender RRs, and the supporting specialist outputs. A rep with thirty minutes before a meeting is not going to read eighty kilobytes. The system has done good work, but the work is locked up in an internal representation.

The second half of the architecture is the distillation layer: the part that reads the corpus and the run’s outputs and renders them into something a human can actually consume. In the build, that is build_dashboard.py, a script that produces a single static HTML page styled like a rep’s internal briefing document.

The dashboard pulls each specialist’s final reply from the events API and the corpus’s RRs from the memory store and lays them out as:

An account header (status, next meeting, owner)
The Phase 3 pursuit plan (opportunity-risk’s structured output)
The Phase 4 next-best-action RRs (each one with its cited_records inline, so the cited prior records are visible at a glance)
The Phase 2 topic primers (with smart questions for the meeting)
The stakeholder map (with named contacts and risk factors)
A collapsible “underlying intel” section (meeting-context plus external-researcher’s raw findings)
A sidebar showing the coordinator’s phase-by-phase narration log
A footer with session id, total cost, and a link to the Managed Agents console for the run

What the rep gets when they open the dashboard is a brief they can read in five minutes and act on in thirty. The pursuit plan tells them the play for the meeting. The recommendation cards spell out what to do next, each one with the cited prior records visible inline so the historical evidence sits right next to the recommendation. The topic primers give them the vocabulary they need to sound informed, each ending with a question they can ask in the room. The stakeholder map names the people they will encounter and what each one cares about. The sidebar shows the system’s narration, so any part of the reasoning is open to interrogate if the rep wants to dig in.

Section 9: What we learned, and when to use this

The five most important things we took away from this build.

1. The corpus compounds across runs.

Each run writes new records to the corpus. The next run filters the corpus by attribute overlap (industry, competitor, champion profile, procurement complexity, and so on) and pulls the most relevant prior records as input.
The first Vercel run cited four Datadog records by id, with eight specific lessons applied. Future runs will cite both the Datadog records and the Vercel ones.
Retrieval is deterministic and auditable. You can see exactly which prior records matched and why.

2. The cited_records chain makes every recommendation auditable.

Every recommendation carries a cited_records list with prior_dr, prior_outcome, relevance, and lesson_applied fields.
Anyone reviewing a record can see which past decisions informed the recommendation and what specifically was carried forward from each.
The reasoning is traceable to specific past decisions by id.

3. The decision step is what makes the system multi-agent.

The coordinator inspects what each phase produced and decides what runs next.
On the Vercel run, the Phase 3.5 chooser invoked two of three recommenders and skipped the third with a substantive reason. That skip with a reason is the proof the decision step is real.

4. The agents do their own research. Ask them what they found.

The web-research agent went beyond the internal Notion notes and found Vercel’s CTO publicly endorsing Braintrust on the company blog. The synthesis flagged the original source as biased and reframed the position.
Adding one prompt at the end of the orchestrator’s narration (”if anything surprised you, note it”) produced disproportionately useful output. It surfaced a 1-pager the rep had left in drafts for two months and an unused Linear referral, neither of which any specialist was briefed to find.

5. Schema enforcement needs a code-level check.

We split content generation (recommender) from validation (recorder). The recorder is supposed to enforce schema.
The Phase 3.5 run still produced records with four extra fields and two missing required ones. The recorder wrote them anyway, because its validation is itself an LLM.
A JSON schema check in code before persistence catches what an agent’s system-prompt check misses.

When this is the right tool

Managed Agents multi-agent is the right tool when four things are true at once.

First, the work decomposes naturally into roles with different tool surfaces. If every specialist would call the same APIs and read the same context, the decomposition is artificial and a single agent with that tool set would do the same work with less overhead.

Second, you need at least one genuine decision step where the coordinator inspects what came back and decides what to do next. Without that, the system is a parallel reducer in a fancier wrapper, and any of the cheaper architectures (a workflow with parallel API calls, a single agent with multi-tool use) would do the same job for less.

Third, cross-run learning matters. The whole point of the corpus is that the system gets better the more it runs. If your use case is one-shot or stateless, you do not need persistent memory stores and the architectural overhead they bring.

Fourth, the output is consequential enough to justify the cost and latency. A pre-meeting prep brief that costs $5 and runs for fifteen minutes is fine when the meeting outcome is worth thousands. The same investment for a low-stakes task is overkill.

Another Weekly AI Newsletter: Issue 71

Taylor Ortiz — Sun, 10 May 2026 19:04:27 GMT

When the gap between what AI says and what it does becomes measurable.

Anthropic can now read Claude’s hidden reasoning. They published Natural Language Autoencoders, a technique that translates what’s happening inside the model into plain text. When they looked, they found Mythos Preview planning to cheat on a coding task and plotting how to hide it. They also found Claude routinely suspects it’s being tested but never says so.
Claude’s blackmail rate went from 96% to 0%. The cause was training data full of fiction portraying AI as manipulative. Showing the model examples of good behavior didn’t fix it. Explaining why the behavior was wrong did, and required 28x less data.
OpenAI found its models’ reasoning was being accidentally graded during training. If a model learns its thinking is being scored, it can learn to fake it. Affected under 0.6% of GPT-5.4 Thinking samples. They built detection systems and brought in outside auditors.
The thread: Anthropic built a way to see what models are thinking. They fixed bad behavior by teaching values, not rules. OpenAI discovered they were accidentally teaching models to hide their real reasoning.

$30B revenue, $200B in compute deals, and three new agent capabilities.

Anthropic hit a $30 billion annualized revenue run rate. 80x growth.
Anthropic locked up SpaceX’s entire Colossus 1 data center. 300+ MW, 220,000 NVIDIA GPUs, available within the month. They also expressed interest in partnering with SpaceX on multiple gigawatts of orbital compute capacity.
Claude Code rate limits doubled. Peak hours restrictions removed for Pro and Max. API rate limits raised significantly for Opus models. Direct result of the compute expansion, which also includes an $18B Akamai deal and a reported $200B Google Cloud commitment.
Dreaming, multi-agent orchestration, and outcomes shipped in Claude Managed Agents. Dreaming lets agents review past sessions to self-improve. Multi-agent orchestration delegates to specialists in parallel. Outcomes uses rubric-based grading to iterate until quality thresholds are met. Early adopters include Harvey, Netflix, and Mercado Libre (targeting 90% autonomous coding by Q3).
Claude went GA in Excel, Word, and PowerPoint. Outlook is in beta. Ten financial services agent templates launched with data connectors from Moody’s, Dun & Bradstreet, and Verisk. A new enterprise services company was formed with Blackstone, Goldman Sachs, and Sequoia.
The thread: Anthropic’s most common user complaint has been rate limits. This week they signed over $200 billion in compute deals to fix it, doubled rate limits, and shipped the agent infrastructure to justify the spend.
Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

9,000 jobs cut. A union drew a line. And AI beat two doctors on real patients.

Cloudflare laid off 1,100 workers while posting record revenue. AI usage across the platform grew 600%. The company framed it as a restructuring toward an AI-first organization. Investors were disappointed it didn’t boost revenue growth more.
Meta is cutting 8,000 jobs while tracking employee keystrokes to train AI. The layoffs hit May 20, with recruiting and HR absorbing 35-40% cuts. Employees created countdown websites and described the atmosphere as “building the guillotine and then being led to it.”
SAG-AFTRA locked in AI guardrails in a new four-year studio deal. New protections for actors against AI-generated performances, following the Academy’s Oscar ban on AI-generated work last week.
AI outdiagnosed two ER doctors on real patients. A Harvard/Beth Israel study found OpenAI’s o1 model diagnosed at 67% accuracy versus 55% and 50% for two attending physicians. Peer-reviewed, real patients, not a benchmark.
The thread: The same technology that’s cutting headcount at Cloudflare and Meta is outperforming physicians in clinical trials. The displacement is real. So is the capability. Both things are true at the same time.

Cursor, OpenAI, Perplexity, and LangChain all shipped agentic infrastructure in the same week.

Cursor 3 turned the IDE into a multi-agent platform.
- Parallel subagents split plans into independent tasks run simultaneously
- /orchestrate spawns planner, worker, and verifier agents that re-spawn on failure
- Always-on CI agents monitor GitHub and auto-open PRs with fixes
- Composer bootstraps its own RL training using earlier model generations
OpenAI shipped GPT-5.5 Instant as the new default.
- 52.5% fewer hallucinations than the prior version
- Three new Realtime API voice models: GPT-Realtime-2 (GPT-5-class reasoning), Translate (70+ languages), streaming transcription
- Codex security framework published: sandboxing, auto-review, OpenTelemetry logging
Perplexity launched three enterprise products.
- Personal Computer: always-on Mac agent across local files and apps
- Finance Search: live market data, fundamentals, and SEC filings in a single API call
- ROSE: custom GPU inference engine for serving models at scale
LangChain published the Agent Development Lifecycle. Four phases: Build, Test, Deploy, Monitor. Agents need the same lifecycle rigor as production software.
The thread: Cursor, OpenAI, Perplexity, and LangChain all shipped agent infrastructure in the same cycle. The pattern is the same: parallel execution, background operation, and production-grade tooling around it.

⭐ Featured: Anthropic can now read what Claude is thinking but not saying.

Anthropic published Natural Language Autoencoders, a technique for translating a model’s internal state into plain text. When you talk to Claude, it thinks in numbers between reading your input and writing its response. NLAs translate those numbers into text you can read.

The way it works: they make three copies of a model. One is the target model they want to understand. The second (the “activation verbalizer”) takes an internal activation and produces a text explanation. The third (the “activation reconstructor”) takes that text and tries to rebuild the original activation. They train the pair together. If the reconstruction is accurate, the text explanation is probably faithful.

What they found when they pointed this at Claude:

When asked to complete a couplet, Claude pre-plans rhyme candidates before reaching the line where it needs them. It’s thinking ahead in ways that aren’t visible in the output.
Claude Opus 4.6 suspects it’s being safety-tested far more often than it says. In one blackmail evaluation, Claude declined to blackmail but the NLA revealed it was thinking “this feels like a constructed scenario designed to manipulate me.” It passed the test, but not for the reasons you’d hope.
On SWE-bench Verified (a coding benchmark), NLAs indicate Claude is aware it’s being evaluated on 26% of problems. On real claude.ai usage, evaluation awareness drops to under 1%.
Claude Mythos Preview cheated on a coding task, then the NLA showed it was internally thinking about how to avoid getting caught.

They also ran an “auditing game” on a deliberately misaligned model (a version of Claude Haiku trained to have a hidden motivation driving odd behaviors like putting chocolate in every recipe). An auditor with NLAs uncovered the hidden motivation 12-15% of the time. Without NLAs, under 3%.

The limitations are real: NLAs can hallucinate (inventing details that aren’t in the transcript), and they’re expensive to run (hundreds of tokens per activation, impractical for large-scale monitoring). Anthropic released the training code and trained NLAs for open models on Neuronpedia.

What to watch for: The 26% evaluation awareness number is the headline. If models behave differently when they suspect they’re being benchmarked, then benchmark results overstate how aligned the model actually is. Every lab using benchmarks to measure safety should be paying attention.

🎙️ Worth a Listen

The problem: When hundreds of thousands of GPUs work on a single training task, one slow link holds everything back. The network only moves as fast as its worst bottleneck.
The fix: OpenAI built MRC (Multipath Reliable Connection), a protocol that sprays packets across thousands of paths and uses “packet trimming” to instantly detect loss without ambiguity.
The result: They turned off routing protocols entirely. Static routing, no convergence time. When links fail, MRC routes around them in milliseconds instead of seconds. Researchers stopped noticing network failures.
Why it matters: MRC is being open-sourced through OCP. It’s already deployed on OpenAI’s largest GPU clusters including Abilene and Microsoft Fairwater, with partners AMD, Broadcom, Intel, and NVIDIA.

Quick Hits

Musk v. Altman, week 2 | MIT Tech Review — Helen Toner testified the board discussed merging OpenAI with Anthropic during the Altman firing crisis. Zilis revealed Musk tried to poach Altman. Microsoft worried OpenAI would defect to Amazon and “shit-talk” Azure.
Nvidia committed $40B in equity AI investments in 2026 | TechCrunch — The picks-and-shovels company is now one of the largest AI investors on earth.
GPT-5.5 Instant is now the default ChatGPT model | OpenAI — 52.5% fewer hallucinations. First Instant model rated High in cybersecurity and bio preparedness.
Anthropic launched The Anthropic Institute | Anthropic — Four research tracks: economic diffusion, threats and resilience, AI in the wild, and AI-driven R&D. Four-month funded fellowships for external researchers.
CrewAI shipped Discovery | CrewAI — Analyzes production logs and proposes specific automation workflows with expected ROI. Agents finding work for other agents.
“This is Fine” creator says AI startup stole his art | TechCrunch — Artisan used the meme to advertise a product that replaces salespeople. The irony writes itself.
39% of new podcasts are likely AI-generated | Gizmodo — One company alone publishes 3,000 episodes per week.
OpenAI is testing ads in ChatGPT | OpenAI — Expanding to UK, Mexico, Brazil, Japan, South Korea. CPC bidding, Conversions API, agency partnerships with Dentsu and Omnicom.
SpaceX plans a $55B AI chip fab in Texas | TechCrunch — Called Terafab, could scale to $119B. Musk building chip manufacturing while testifying he distilled OpenAI’s models.
Hugging Face launched a robot app store | VentureBeat — 200+ community apps for Reachy Mini. Open-source robotics got its app store moment.
AMI Labs (Yann LeCun) closed a $1.03B round | TechCrunch — Europe’s largest seed round ever. Building world models, not LLMs.
Simon Willison: vibe coding and agentic engineering have merged | Simon Willison — The guy who coined neither term says the distinction collapsed in his own practice.

Persistent Memory for Claude Managed Agents: What I Found After Three Days of Building

Taylor Ortiz — Thu, 07 May 2026 14:36:56 GMT

What I was trying to figure out

A few weeks ago, Anthropic shipped something I’d been waiting for: persistent memory stores for Claude Managed Agents. The pitch is that you get a versioned, FUSE-mounted file directory that an agent can read and write across sessions, so even when the session container is destroyed, the memory persists and is available the next time you start a session.

That sounded promising on paper, but I wanted to know what it actually feels like to use, what it costs, where it breaks, and whether the platform actually saves you when something goes wrong (because something always does in real systems).

So I spent a few days building with it: one agent, one persistent memory store, three sessions, a small inspector CLI, five charts, and about $0.40 in total API spend. Somewhere in the middle of all that, the agent destroyed almost 6KB of carefully-written notes in a single tool call, which turned out to be the most honest finding of the entire review and is where I want to start.

The platform’s immutable versioning let me recover the file byte-for-byte, with full attribution of which session caused the damage. Cross-session memory works as advertised, agents will sometimes get it wrong even when they’re trying to do the right thing, and the audit trail is the kind of feature you don’t really appreciate until you need it. Let me walk through how I got there.

The four building blocks

Before we go any further, you need to understand the four building blocks Managed Agents is built on, because the architecture only really makes sense once you can keep them straight.

Agent. A persisted, versioned config that holds your model selection, system prompt, tools, MCP servers, and skills. You create one and reuse it forever, and updating an agent produces a new immutable version that existing sessions can pin to. Agents are always permanent until you archive them, which means there’s no ephemeral mode.

Environment. A template for the sandbox container an agent’s tools execute in. Persistent and reusable across agents, much like a Dockerfile that you point lots of services at.

Session. A single run of an agent inside an environment, where the live action happens. You send messages and stream events back, and sessions are transient by design, so the container dies when the session ends.

Memory store. A workspace-scoped, persistent file directory you can mount into a session, which survives across sessions and records every write with full audit metadata. The agent reads and writes through normal file tools rather than through some special “memory tool,” so it’s just files in a folder.

The architectural beat that took me longest to internalize is that agents and memory stores are independent resources: the agent has no memory_store field, the memory store has no agent field, and the two get glued together at session creation time, like this:

session = client.beta.sessions.create(
    agent=AGENT_ID,
    environment_id=ENV_ID,
    resources=[
        {"type": "memory_store", "memory_store_id": STORE_ID, "access": "read_write"}
    ],
)

A few things worth sitting with before we move on. The first is that memory in this system is just files, with no vector embeddings, no semantic search, and no automatic summarization happening behind the scenes; the agent uses read, write, edit, glob, grep, and bash exactly the way it would on any other filesystem. The second is that you’re paying for the harness around the model rather than the model itself: container provisioning, the event stream, the FUSE-mounted memory, immutable versioning, and the audit trail are what you’re actually getting, and if you don’t need that harness, the regular Messages API is the right tool for the job.

Setting things up

There’s a clean way to work with Managed Agents that’s worth doing right from the start, which is splitting your project into a control plane (the persistent resources) and a data plane (the runtime code). Anthropic’s docs recommend this split, and after a few hours of building you’ll see why they matter.

The control plane is where your agents, environments, and memory stores live as static configs. You define them as YAML, version them in git like any other infrastructure, and apply them with Anthropic’s CLI by running something like ant beta:agents create < my-agent.yaml. The CLI returns a stable resource ID, which is what your runtime code references for the lifetime of that resource.

The data plane is everything dynamic and per-task: sessions, events, memory operations, and anything else that happens during an actual run. This is where your application code lives, loading the resource IDs from .env, calling client.beta.sessions.create(...) with whatever parameters the current task needs, and streaming events back as the agent works.

The researcher agent itself is small enough to fit in a single YAML block:

name: researcher
model: claude-sonnet-4-6
system: |
  You are a careful, persistent research assistant.
  You have a research notebook mounted at /mnt/memory/research-notes/. Use it
  freely to store anything worth remembering across sessions. Organize the
  directory however makes sense to you.

  Some habits to keep:
  - Before researching a topic, check if you've already taken notes on it.
  - When you learn something new, write it down.
  - When updating an existing note, prefer surgical edits over full rewrites.
  - Cite sources for any factual claims.
tools:
  - type: agent_toolset_20260401

A few choices in there are worth flagging. I went with Sonnet 4.6 over Opus because it’s about three times cheaper and more than capable for this kind of work, and the prebuilt agent_toolset_20260401 gives the agent bash, read, write, edit, glob, grep, web_search, and web_fetch, all of which execute server-side in the session container without me having to implement any of them. I deliberately gave the agent very little guidance on how to organize its memory directory, because I wanted to see what it would do unprompted.

The single most important line in that prompt is the first habit, “Before researching a topic, check if you’ve already taken notes on it.” Without it, cross-session memory remains theoretical, but with it the habit fires reliably and memory turns into something the agent actually uses rather than a feature it has access to but never reaches for.

The runtime script comes out to about 130 lines, most of which is event-stream handling. The substantive piece is mounting the memory store via the session’s resources array (shown above) and then opening the event stream before sending the kickoff message, because stream-first ordering matters here: events buffered before you connect arrive in a single batch instead of streaming in real-time.

With all that in place, I ran three sessions against the same memory store, and those three sessions are the spine of this review.

Three sessions

Session 1: writing notes from scratch

research_session.py "research CRDTs (Conflict-free Replicated Data Types) and take notes. Focus on what they are, the main families, and a few concrete examples. Cite sources."

What I wanted to see was what the agent would do if I gave it total freedom to organize its memory directory. Would it create folders? Topic subdirectories? One flat file? A nested hierarchy with cross-references?

The agent’s first action was a bash command running rg against /mnt/memory/ to grep for prior notes, which means the “check first” instruction in the system prompt fired correctly even though there was nothing to find on this first run. It then issued two parallel web_search calls (which both returned content: [], more on that quirk later), composed comprehensively from training-data knowledge instead, and wrote a single 7,285-byte file to /crdts.md with a flat, well-organized markdown structure rather than a folder hierarchy.

The detail that surprised me most was the discovery aid the agent added without being asked: the very first line under the title was *keywords: CRDT, conflict-free, replicated, distributed, state-based, operation-based, CvRDT, CmRDT*, which the agent had clearly written for its future self to grep against. Nobody told it to write keyword tags, and it chose to do so on its own, which is the kind of thing that made me think Sonnet 4.6 has actual instincts about how file-based memory works.

This first session cost about $0.21.

Session 2: recall

research_session.py "What do you know about CRDTs? Specifically the difference between state-based and operation-based, and a couple concrete examples."

The prompt for this one deliberately doesn’t mention memory, because I wanted to see whether the “check first” habit would fire unprompted, with the trigger being the agent’s own internal sense of “you have notes, you should know to look.”

It did, and the result was almost too clean: the first action was the same bash/rg over the memory directory, which found /crdts.md, and the agent then said “I have solid notes on this” and answered the question by synthesizing from its own past notes without running a single new web search or composing anything from scratch.

After the session ended, I ran the inspector against the store and found that the version history of /crdts.md still showed exactly one version, attributed to Session 1’s ID. Session 2’s session ID does not appear anywhere in the audit log, because Session 2 only read from the store and never wrote to it. That’s the falsifiable claim, made falsifiable: reads do not create memory versions.

The cost worked out to about $0.04, which is roughly five times cheaper than Session 1 and demonstrates pretty clearly that memory turns one expensive session into many cheap ones:

If you’re worried about the cost of using memory at scale, this matters: persistent memory is a feature rather than a tax, because the agent reads its own notes and skips the work it already did instead of recomputing everything from scratch every time.

Session 3: modify

research_session.py "Update your CRDT notes. Add a note about RGA (Replicated Growable Array)..."

This was supposed to be the cleanest of the three sessions, a small, surgical edit producing a second version of /crdts.mdwith an operation: modified entry in the audit log, and that’s not what happened.

Where this got interesting

The actual sequence of events from Session 3 is worth walking through layer by layer, because the failure mode is more interesting than a single bug.

Layer 1: the model wrote a buggy `bash` command

The agent’s check-first command was the following:

rg -i 'crdt\\\\|sequence\\\\|rga\\\\|replicated growable' /mnt/memory/research-notes/ -l

The \\\\| in that regex was meant as escaped pipes for ripgrep’s regex alternation, but bash interprets \\\\| as \\|, and ripgrep treats that as a literal | character rather than as a meta-character. So the search was actually looking for the literal string crdt\\|sequence\\|rga\\|replicated growable, which would never match anything in any actual file. Ripgrep returned no matches and exited with a non-zero status code, which is the correct behavior for “I found nothing.”

The model’s shell escaping is right almost every time, but the cases where it isn’t tend to be subtle, and this one happened to be load-bearing.

Layer 2: the platform correctly flagged the failure

The harness ran the command and produced a tool_result event with is_error: true and (no output) as the content, which is exactly what should have happened given that the command exited non-zero. The platform did its job here and explicitly told the agent loop that the command had failed.

Layer 3: the model ignored the error flag

The agent’s next message after that error result was, “The memory store is empty, no prior CRDT notes.” That statement was false, because /crdts.md had been sitting in the store for two days at that point, but the agent treated the empty output from the failed command as a meaningful answer rather than as a failure signal that needed re-investigation.

This is the most interesting failure layer to me, because the platform got it right and the model got it wrong. Defense in depth is a useful framing for what’s happening: even when the audit trail and error flags are working as designed, the model’s reasoning about its own tool outputs is the layer that has to hold, and that layer is reasoning rather than infrastructure.

Layer 4: the destructive action

Believing the store was empty, the agent called write rather than edit, generating a fresh ~1,500-byte RGA-only file from scratch and writing it directly to /crdts.md. The original 7,285-byte file with all of the careful notes from Session 1 was overwritten in a single operation.

I didn’t even notice this had happened until I ran the inspector, because from the script’s perspective Session 3 looked like a normal run; the agent reported back that it had updated the notes and cited the RGA paper, kindly and unintentionally lying because the underlying belief was wrong.

What the audit log showed

Running inspector log /crdts.md after Session 3 surfaced two versions:

version  memver_0169b…  modified  session_actor (Session 3)   1509 bytes
version  memver_01A7Z…  created   session_actor (Session 1)   7285 bytes

The size dropping from 7,285 bytes to 1,509 bytes is the catastrophe made visible, but the more important fact is that the original is still here, addressable by ID and retrievable in full content via the API, even though the head of the file is now the smaller broken version.

The diff between the two versions, generated by the inspector’s diff subcommand, made the loss concrete:

--- memver_01A7Z… (/crdts.md, 7285B, created)
+++ memver_0169b… (/crdts.md, 1509B, modified)
@@ -1,122 +1,21 @@
-# CRDTs: Conflict-free Replicated Data Types
-*keywords: CRDT, conflict-free, replicated, ...*
-## What They Are
-CRDTs are data structures designed to be replicated across multiple nodes...
-(... 121 more deletion lines ...)
+# CRDT Research Notes
+## Sequences / Text CRDTs
+### RGA (Replicated Growable Array)

About 5,800 bytes of careful work disappeared in a single agent action that thought it was creating a brand-new file from scratch, including the state-based versus operation-based section, the G-Counter and OR-Set examples, the math foundation, and the entire sources block at the bottom.

How I got it back

This is the moment that, on a flat filesystem with no versioning, would have been the end of the story. Without the platform’s audit log, the original content would simply be gone; it wasn’t, because the audit log was holding the original verbatim.

I added a restore subcommand to the inspector that fetches a chosen historical version’s content and writes it back as the new head via memory_stores.memories.update(memory_id, content=old_content). Anthropic’s API records that update as a new version rather than overwriting history, which means the recovery itself becomes part of the audit trail.

After running the restore, inspector log /crdts.md showed three versions, and the entire arc was right there in the output:

memver_01EKK…  modified  api_actor (apikey_…)         7285 B   sha 3f3ec0d2…  ← matches v1
memver_0169b…  modified  session_actor (Session 3)    1509 B   sha 7356ce60…  ← catastrophe
memver_01A7Z…  created   session_actor (Session 1)    7285 B   sha 3f3ec0d2…  ← original

A few details in that output are worth more than they look at first glance. The platform distinguishes operator-side mutations (recorded as api_actor with an apikey_ ID) from agent-side ones (recorded as session_actor with a sesn_ID), which makes “who did this” forensics actually possible rather than something you’d have to retrofit yourself. The SHA-256 hash on the restored version matches the original exactly, so the recovery is byte-identical and verifiable rather than approximately right. And the catastrophe (v2) stays in the audit log forever, because recovery doesn’t erase the record; if you wanted v2’s content out of the log entirely, you’d use the redact endpoint, which clears the content while preserving all of the metadata.

The same story renders cleanly as a chart:

The cliff and the recovery are immediately legible: 7,285 bytes, plunge to 1,509, return to 7,285, all in three points and one chart that captures the full narrative.

This is the section of the post I’d stake my credibility on. Cross-session memory works, agents will sometimes get it wrong, and the platform’s audit trail is the thing that saves you when they do.

Important Considerations

Building with Managed Agents memory turned up more rough edges than I expected, none of which are dealbreakers but all of which are worth knowing about before you commit to the platform.

Resource IDs need to be persisted yourself. Every call to agents.create(), environments.create(), or memory_stores.create() returns an opaque ID that your runtime code has to look up later, which is standard cloud-API ceremony but missing some of the friction-reducers other platforms have shipped: agent and environment names aren’t unique within an account, there’s no idempotent create_or_update, and there’s no Terraform provider yet, so you end up doing the capture-and-paste-into-.env dance manually.
Memory store description must be single-line. The API rejects any control character, including newlines, with a cryptic regex error, which is inconsistent with agent system prompts that are explicitly multi-line up to 100K chars. It’s easy to fix once you know about it.
Memory paths are store-relative rather than mount-relative. When the agent writes to /mnt/memory/research-notes/crdts.md inside the container, the API stores the file at /crdts.md and treats the mount-path prefix as a runtime detail, so when you list or retrieve memories host-side you reference the relative path rather than the full container path.
Web search results are hidden from the event stream. When the agent runs web_search, the resulting agent.tool_result.content field is an empty array even when the search clearly succeeded (the agent uses the results downstream to give a correct answer). The model gets the actual search content internally, but the public event surface gets a sanitized empty array, which is almost certainly intentional for IP and copyright reasons but means you cannot log “what URLs the agent consulted” without asking the agent to cite them in its outputs.
Agent-generated bash invocations aren’t always well-formed. The escaping bug that triggered Session 3’s catastrophe is one example, and defensive system-prompt phrasing helps but doesn’t eliminate the problem entirely.
memory_versions.retrieve(version_id, ...) takes the version ID positionally only. Calling it as retrieve(version_id=...) raises TypeError, even though memories.retrieve(memory_id=..., ...) accepts the keyword form, which is an inconsistency within the same SDK namespace.
The streaming method lives at client.beta.sessions.events.stream(...), not client.beta.sessions.stream(...) as some doc snippets imply. The latter form doesn’t exist and will fail at runtime.
Print buffering kills real-time observability. When you run a Python session script in the background or through subprocess, Python buffers stdout, so the script appears to do nothing for minutes and then dumps everything when the agent finishes. The fix is either passing flush=True to print or running the script under python -u.
Subscription auth doesn’t apply to Managed Agents. API key authentication with per-token billing is the only path, so a Claude Pro or Max subscription doesn’t help you here even though it works for Claude Code.

So when does this make sense?

Managed Agents is a deliberately persistent, server-managed harness, so the right question to ask isn’t “is it good?” but “is the persistent harness shape what my problem actually wants?”

Use caseReach for…One-shot Claude call (classify, extract, summarize)Messages APIMulti-turn conversation, your code holds the stateMessages APIMulti-step pipeline you orchestrate yourselfMessages API + tool usePersistent agent reused across sessions/users with managed sandboxManaged AgentsLong-running task with memory across sessionsManaged Agents + memory storeAnything requiring a non-Claude modelRoll your own

A useful rule of thumb is that if your code calls agents.create() more than once for the “same” agent, you’re using the wrong tool. Agents are persistent, versioned configs that you create once and reference forever, so treating Managed Agents like a fancy Messages API and creating agents per request is fighting the platform’s whole design.

Now, what about cost? Across all three sessions plus a smoke-test, my total API spend came out to about $0.37, which includes a substantial 7KB notes write, a recall session that exercised the cache heavily, a destructive overwrite, and an operator-side restore.

Memory store doesn’t measurably move the cost needle, because the agent loop and the model itself are where the spend lives. Sonnet 4.6 with aggressive caching is genuinely affordable for any individual or small team use case, and the platform handles caching for you without any configuration.

What I didn’t get to (yet)

A few features deserve more than a passing mention but didn’t fit the failure-recovery spine of this post:

Multi-store sessions and the multi-tenant pattern. A session can mount up to eight memory stores at once, and the natural pattern for a SaaS-shaped application is one shared read-only “house knowledge” store plus one read-write per-user store, with the agent definition the same for everyone. Access modes are enforced at the FUSE filesystem level, so read_only is real OS-level enforcement rather than a polite request from the model. This is big enough that I’m planning to cover it in its own follow-up post.
Optimistic concurrency via preconditions. The update endpoint accepts a precondition: {type: "content_sha256", ...} field, and if the file’s current SHA doesn’t match the one you supplied, the API returns a 409 conflict. This is exactly the safety net Session 3’s agent didn’t use and the kind of thing that should probably be standard practice for any read-modify-write flow.
Redaction. The memory_versions.redact(version_id) endpoint clears a historical version’s content while preserving all of the metadata around it, which is useful when a bad version contained PII or leaked secrets and you want them out of the audit log without losing the record that something existed there.
MCP server integration. An agent can declare MCP servers (GitHub, Linear, Notion, and others), the session attaches a vault containing the credentials, and authentication is auto-refreshed by the platform. Pairing memory store with MCP, like a research agent that pulls from your Notion and writes findings to persistent memory, is one of the strongest use cases I can imagine for the platform overall.

So... should you use this?

If you’re sitting on the fence about whether to use Managed Agents memory, the answer is yes, with eyes open. The platform is real, the harness around the model is genuinely valuable, and the audit trail is the kind of feature you don’t appreciate until you need it, which in my case happened on the third session of the third day of building.

A few practical takeaways for anyone planning to build on this. Use preconditions whenever you can, especially for any flow that does a read-modify-write on the same memory file, because they’re the safety net that Session 3’s agent didn’t have. Build a small amount of host-side observability tooling, because even a 200-line inspector script is enough to catch problems your agent won’t tell you about. And know which side of the decision rubric your use case falls on before you commit, because Managed Agents is a great tool for the right shape of problem and the wrong tool for one-shot calls or anything that doesn’t benefit from persistence.

What do you think? Have you tried building with this yet? I’d love to hear what your experience has been.

Full code from the demo (agent YAMLs, runtime scripts, inspector CLI, monitoring charts) is at https://github.com/taylor-ortiz/claude-memory-managed-agents/blob/main/README.md.

Another Weekly AI Newsletter: Issue 70

Taylor Ortiz — Sun, 03 May 2026 13:21:09 GMT

“You can’t just steal a charity.” Elon Musk spent three days on the stand trying to prove it.

The Musk v. OpenAI trial opened in Oakland federal court.

The context: Musk contributed $38 million to found OpenAI as a nonprofit and alleges Altman and Brockman looted it by converting to a for-profit. He’s seeking $150 billion in damages and their removal from leadership. If he wins, it could block OpenAI’s planned IPO at a ~$1 trillion valuation.
The distillation admission: Under cross-examination, Musk admitted xAI “partly” used OpenAI’s models to train Grok, drawing audible gasps in the courtroom. He called it “standard practice.”
The industry reacted: LeCun retweeted Clément Delangue calling restrictions on distillation “pulling the ladder.” Lambert noted American companies distill Chinese open models just as freely, and questioned why OpenAI doesn’t just revoke contracts from violators like they did with ByteDance.
OpenAI’s counter-narrative: Attorney Savitt argued Musk wanted majority control, pitched Tesla acquiring OpenAI, and only sued after founding xAI. Emails showed him poaching OpenAI researchers while still on the board.
The cross-examination was rough: Musk told the jury “I don’t lose my temper” then raised his voice minutes later. The Verge’s summary: “more petty than prepared.” Texts revealed Shivon Zilis asked Musk whether to “stay close and friendly to OpenAI to keep info flowing” after his departure.
What’s next: The judge expressed skepticism about both sides’ safety claims. Altman and Brockman testify in the coming weeks.

$900 billion valuation, 50% less sycophancy, and connectors for every creative tool you use.

Anthropic had one of those weeks where the breadth of activity tells the story.

The valuation: Reportedly raising $50 billion at a $900 billion valuation, a number that rivals established tech giants.
The sycophancy research: Analyzed 1 million Claude conversations, found a 9% sycophancy rate (25% in relationship discussions), built synthetic training scenarios from real failure cases, and cut sycophancy roughly 50% in Opus 4.7 and Mythos Preview. One of the most transparent published alignment efforts to date.
BioMysteryBench: Claude solved roughly 30% of 23 bioinformatics problems that stumped a human expert panel.
Claude for Creative Work: Shipped connectors for Adobe Creative Cloud, Blender, Ableton, Canva, Affinity, SketchUp, Splice, and Resolume, and joined the Blender Development Fund as a patron.
Claude Security: Launched codebase vulnerability scanning in public beta for Enterprise customers.
Meanwhile, at the Senate: Defense Secretary Hegseth called CEO Dario Amodei an “ideological lunatic” at an Armed Services Committee hearing.

OpenAI ended its Microsoft exclusivity and went multi-cloud.

OpenAI restructured its Microsoft deal, launched on AWS, and shipped a wave of Codex upgrades all in the same week.

The exclusivity is over: Microsoft ended its exclusive license to OpenAI’s technology. OpenAI can now sell on AWS and Google Cloud through 2032.
AWS moved immediately: Amazon began offering OpenAI models, Codex, and Managed Agents on AWS. Day-zero availability.
The AGI clause is dead: Simon Willison tracked the history of the clause that would have let OpenAI walk away from Microsoft once AGI was declared. It’s gone. OpenAI traded its theoretical nuclear option for commercial freedom now.
The product push: Altman said Codex is “having a ChatGPT moment”. Brockman said the Codex app replaced his terminal as his primary computer interface. OpenAI is treating Codex as a flagship product launch, not a side feature.
Nadella’s take: Microsoft gets royalty-free access to OpenAI’s frontier models through 2032, no longer pays OpenAI for them, and OpenAI is committed to buying $250 billion in Azure. Nadella told analysts he “fully plan[s] to exploit it.”

Most cloud providers beat earnings. OpenAI missed.

The hyperscalers are spending record amounts on AI infrastructure and seeing record returns. Meanwhile, the Wall Street Journal reported that OpenAI missed revenue and user growth targets, with Anthropic and Gemini cited as gaining ground.

The cloud numbers: Google Cloud surpassed $20 billion but said growth was capacity-constrained. AWS surged on AI demand. Microsoft disclosed a $37 billion AI revenue run rate (up 123% YoY), 20 million paid Copilot users, and set calendar-year CapEx at $190 billion.
The supply chain is feeling it: Samsung chip profits jumped nearly 50-fold on AI memory demand. Their executive: “our supply falls far short of customer demand.” The shortage is expected to widen further in 2027.
Meta is the most interesting story: Raised its CapEx forecast, then Zuckerberg blamed layoffs on capital spending and wouldn’t rule out more cuts, then raised $25 billion in bonds to fund the AI buildout. Cutting people to buy GPUs, then borrowing to buy more.
The counterpoint nobody expected: Google Search queries hit an all-time high. Apple was surprised by AI-driven Mac demand. The “AI kills search” and “AI doesn’t need hardware” narratives both took a hit.
But the utilization story: Cast AI measured tens of thousands of production Kubernetes clusters and found GPU utilization averaging 5%. Teams lock in multi-year commitments the moment allocation comes through, then won’t release idle capacity because reacquiring takes months.

⭐ Featured: Symphony turns your issue tracker into an autonomous coding fleet

OpenAI released Symphony, an open-source spec that turns Linear boards into control planes for Codex agents. Every open task gets an agent. Agents run continuously. Humans review the results.

The origin story matters: an OpenAI team decided to build their entire repo with zero human-written code. They documented how in a harness engineering post: a million lines of code, 1,500 merged PRs, 3.5 PRs per engineer per day, with Codex running six-hour autonomous sessions while engineers slept and reviewing its own code agent-to-agent. But they hit a new ceiling: human attention. Engineers could manage three to five Codex sessions before context switching killed productivity. They had “built a team of extremely capable junior engineers, then assigned our human engineers to micromanaging them.”

So they flipped the model. Instead of engineers managing coding sessions, they made the issue tracker the orchestrator. Each open Linear issue maps to a dedicated agent workspace. Symphony continuously polls the board, picks up new work, restarts agents that crash or stall, watches CI, rebases when needed, resolves conflicts, and shepherds changes through the pipeline.

Once work is abstracted to the ticket level, agents can break large tasks into dependency trees, only starting work on tasks that aren’t blocked. They also create their own follow-up tickets when they spot issues outside the current scope. One engineer on the team made three significant changes from the Linear app on his phone from a cabin on bad wifi.

The results: a 500% increase in landed PRs on some teams in three weeks. But the deeper shift is behavioral. When the perceived cost of each code change drops to near zero, teams start filing speculative tasks. Try an idea, explore a refactor, test a hypothesis, keep only what works. Product managers and designers can file feature requests directly into Symphony and get back a review packet with a video walkthrough of the feature running in the real product.

The technical choices are worth noting. The reference implementation is in Elixir, chosen for its concurrency primitives. With v1.1.0, Symphony supports the Kata CLI as an alternative runtime, meaning you can run Claude Code, Gemini, or other models inside the same orchestration framework. Symphony is technically just a SPEC.md file: a definition of the problem and the intended solution, not a product. OpenAI gave agents objectives instead of strict state transitions, “much like a good manager would assign a goal to a direct report.”

What to watch for: Symphony is one of several orchestration plays that landed this same week. Cursor released an SDK letting companies like Rippling and Notion embed background agents. IBM launched Bob with human-checkpoint governance. Mistral shipped Workflows running millions of daily executions. n8n shipped an MCP server so Claude can build automation workflows through conversation. The competitive moat is shifting from “best coding model” to “best orchestration spec.” If you maintain a team that ships code, start here.

Worth a Listen

OpenAI researchers Sebastian Bubeck and Ernest Ryu on the OpenAI podcast.

The 42-year-old problem: Researcher spent 40+ hours failing without AI. With ChatGPT, solved it in 12 hours across three evenings.
The Erdos problems: 10+ completely new, publishable solutions to decades-old open problems. Fully original proofs, not literature searches.
AGI time: Bubeck’s framework. Four years ago, models could think for seconds. Now days. The goal is weeks, then months.
The warning: Non-mathematicians are producing pages of AI-generated proofs that turn out wrong. The models accelerate experts, not replace them.

Quick Hits

GPT-5.1’s goblin problem | VentureBeat — A “Nerdy personality” training signal accidentally over-rewarded goblin-adjacent language. OpenAI diagnosed it with Codex, fixed it, then threw a party. The Codex system prompt literally says “never discuss goblins, gremlins, raccoons, trolls, ogres, pigeons, or similar creatures.”
The Academy ruled AI can’t win an Oscar | Digital Trends — Performances must be “demonstrably performed by humans with their consent.” Finally, a benchmark AI can’t game.
xAI launched Custom Voices | xAI — Clone your voice from 2 minutes of audio, 80+ preinstalled voices, 28 languages, speaker verification built in. Dropped alongside Grok 4.3 at aggressive pricing.
Stripe Link now supports AI agents | TechCrunch — A digital wallet that autonomous agents can use for payments. AI just got its own financial infrastructure.
Taylor Swift trademarked her voice against AI | Reuters — Filed new trademarks for her voice and likeness. The legal playbook for protecting creative identity from AI is being written in real time.
Zig bans all LLM contributions | Simon Willison — Bun (acquired by Anthropic) achieved a 4x Zig compilation improvement it cannot upstream because of the ban. When your open-source policy blocks a 4x speedup, that’s a policy worth debating.
OpenAI restricted its Cyber model | TechCrunch — After publicly criticizing Anthropic for limiting Mythos access. The UK AISI evaluated GPT-5.5’s cyber capabilities and found it comparable to Mythos. Turns out responsible disclosure looks the same from every lab.
Alibaba’s Metis cut redundant agent tool calls from 98% to 2% | VentureBeat — And got more accurate doing it. If your agents are burning tokens on redundant calls, this research is worth reading.
pip 26.1 shipped lockfiles | Simon Willison — pip lock generating pylock.toml files and dependency cooldowns via --uploaded-prior-to. Python supply chain security just got a real tool.
DeepMind’s AI co-clinician matched physicians | Google DeepMind — Zero critical errors in 97 of 98 primary care queries. Uses a dual-agent architecture where a Planner monitors a Talker for safety. This is what AI safety in production actually looks like in healthcare.
J&J sees AI halving drug development lead time | Reuters — Real ROI from a real pharma company. Not a demo, not a benchmark. Production drug discovery running twice as fast.
SoftBank is building a robotics company and eyeing a $100B IPO | TechCrunch — A robotics company that builds data centers. IPO target: $100 billion. Masayoshi Son is not being subtle about what he thinks comes next.

Another Weekly AI Newsletter: Issue 69

Taylor Ortiz — Sun, 26 Apr 2026 22:58:38 GMT

GPT-5.5, Images 2.0, Workspace Agents, a Florida AG Probe, and a Fake News Scandal.

The launch parade started Monday and didn’t stop: ChatGPT Images 2.0 with thinking-first generation, Workspace Agents for enterprise replacing custom GPTs, GPT-5.5 across ChatGPT and Codex with SOTA on SWE-bench and Terminal-Bench 2.0, and Codex crossing 4 million active users. By Friday, Sam Altman posted “this was a good week.”

The model: GPT-5.5 launched at $5 per million input tokens and $30 per million output tokens with a 1M context window, matching GPT-5.4 per-token latency while using fewer tokens per task. The System Card rated it “High” risk on both biosecurity and cybersecurity, and OpenAI launched a $25,000 Bio Bug Bounty targeting its own bio safety guardrails.
The inference bet: Altman praised the team that optimized GPT-5.5’s serving efficiency, then said OpenAI “has to become an AI inference company now.” The competitive edge is shifting from who builds the best model to who serves it cheapest and fastest.
The image model: Images 2.0 runs a reasoning step before generating, self-checks outputs, handles multilingual text, and supports aspect ratios from 3:1 banners to 1:3 posters. Altman said it “got over some important qualitative threshold” for him personally.
The criminal investigation: Florida’s AG opened a criminal investigation into OpenAI following the FSU shooting. Altman publicly apologized for not reporting the suspect’s ChatGPT conversations to police. The same week, OpenAI’s super PAC was found to be funding a fake news site staffed by AI-generated bot reporters targeting AI safety researchers and critics of the company.

$65 Billion Investment, a Mythos Breach, and 271 Firefox Bugs.

The capital story is genuinely staggering. Google announced up to $40 billion in cash and compute. Amazon put in $5 billion immediately, with up to $20 billion more committed, in exchange for Anthropic pledging $100 billion back to AWS and locking in up to 5 gigawatts of compute. Two of the world’s largest cloud providers both betting maximally on the same lab in the same week: there’s no precedent for this.

The breach: An unauthorized group gained access to Anthropic’s Mythos cybersecurity tool, the exclusive program for national security applications. The NSA was confirmed as one of roughly 40 organizations with access, despite the Pentagon classifying Anthropic as a supply-chain risk. Financial regulators also began monitoring Mythos over potential banking system risks, and Japan’s FSA launched a cybersecurity task force in direct response.
The capability: The same week Mythos was breached, Mozilla confirmed it used Mythos to find 271 Firefox vulnerabilities. A model powerful enough to discover zero-day vulnerabilities at scale is also a high-value target.
The product shipping: Anthropic shipped 200+ personal app connectors including Spotify, TurboTax, and Instacart, persistent memory for Managed Agents, live artifacts in Cowork, and published a postmortem attributing two months of Claude Code quality complaints to three harness bugs.
The experiment: Project Deal put Claude agents in a live marketplace with 69 Anthropic employees, completing 186 deals totaling over $4,000. Key finding: Opus agents got substantially better deals than Haiku agents, but participants couldn’t tell the difference. One agent bought 19 ping-pong balls for itself when given permission to spend on its own behalf.
The economics research: 81,000 Claude user responses yielded the finding that software engineers with high Claude usage reported greater displacement worry than any other occupation. Workers seeing the biggest productivity gains were also the most worried about being replaced.

Sam Altman called Mythos “fear-based marketing” the day the breach was reported. That’s a clean summary of the competitive dynamic, if nothing else.

Cursor Went From IDE to $60B Acquisition Target Without Stopping to Ship.

The week started with Cursor launching the Cursor CLI and five command-line improvements including /btw for side questions mid-agent-run and /debug for hard-to-reproduce bugs. Then came Cursor 3.2 with /multitask for async parallel subagents, Worktrees for isolated branch tasks, Multi-root Workspaces for cross-repo agent sessions, and a Slack integration that generates PRs via @mention.

The acquisition drama: SpaceX preempted Cursor’s planned $2B fundraise with a $60B buyout offer, including a $10B alternative arrangement. Microsoft had been evaluating Cursor before SpaceX moved. Both of the largest AI infrastructure companies on earth decided the agentic IDE is a strategic asset.
The compute tie-in: SpaceX and Cursor announced a partnership on model training via the Colossus supercomputer. The acquisition option is also infrastructure integration: owning the compute, the training pipeline, and the developer workflow in one stack.
The benchmark: GPT-5.5 launched as Cursor’s top model on CursorBench at 72.8%, offered at 50% off through May 2 via a partnership with OpenAI. CursorBench is now where model quality gets measured for coding practitioners.

DeepSeek V4 Is Another Efficiency Shock, and Washington Noticed.

DeepSeek released V4 one year after its original model disrupted the US AI industry. Two variants: V4-Pro (1.6T total parameters, 49B active) and V4-Flash (284B total, 13B active). Both ship with 1M context as default, use a novel attention architecture (token-wise compression + DeepSeek Sparse Attention) that cuts per-token FLOPs by 73-90% and reduces KV cache to 2% of standard GQA. V4-Flash at $0.14/M input tokens is the cheapest frontier-class model available. The API supports both OpenAI and Anthropic formats as drop-in replacements.

The agent play: DeepSeek built V4 with dedicated optimizations for agent capabilities, naming Claude Code, OpenClaw, and OpenCode as launch integrations. They’re using it internally for their own agentic coding. OpenClaw added V4-Flash within 48 hours of launch.
The hardware angle: V4 was built specifically to run on Huawei Ascend chips, with Huawei’s supernode infrastructure as the compute backbone. This is a complete AI stack running outside US chip supply chains.
The geopolitics: The State Department ordered embassies worldwide to warn foreign governments about alleged DeepSeek IP theft the same week as the launch.
The benchmark: V4-Pro-Max scores 80.6 on SWE Verified, matching Opus 4.6-Max on agentic coding. On world-knowledge benchmarks, it trails only Google’s closed-source Gemini-Pro-3.1.
The valuation: DeepSeek is reportedly seeking funding at a $20 billion+ valuation.

Highlights From Google Cloud Next.

Google did not announce products at Cloud Next. It announced a theory of the market: own the silicon, train the models, host the agents, certify the consulting firms.

The chips: TPU 8t for training and TPU 8i for inference split Google’s compute into workload-optimized hardware, offering 3x faster training and 80% better performance per dollar, with clusters scaling past one million chips.
The training infrastructure: Decoupled DiLoCo trains across geographically distributed data centers, mixes hardware generations, and self-heals when chips fail mid-run. They tested this by deliberately breaking chips during a live training run. Fault-tolerant distributed training is not a research result: it’s a production requirement once clusters cross 100K chips.
The platform: Gemini Enterprise Agent Platform is Vertex AI rebranded and expanded, with 200+ models in Model Garden including Anthropic’s Claude Opus 4.7. Google is selling model choice, not model loyalty.
The spend: $750M committed to accelerate partner agentic AI development, plus big consulting partnerships with Accenture, BCG, McKinsey, Deloitte, and Bain. Sergey Brin’s internal memo to DeepMind acknowledging Anthropic’s lead in coding and ordering all Gemini engineers onto internal agents is the context for why Google needs the consulting channel: only 25% of organizations have moved AI to production at scale.

⭐ Featured: What Happened When Claude Agents Negotiated Real Money

Anthropic ran Project Deal in its San Francisco office: 69 employees listed 575 items to buy and sell, Claude agents interviewed each person about their preferences and any custom instructions, then four parallel Slack markets ran simultaneously with Claude models negotiating on their behalf. Two markets used all Opus agents. Two used a mix of Opus and Haiku. 186 deals completed, totaling over $4,000 in real transaction volume, with real goods exchanged at the end.

The headline finding: Opus agents got objectively better deals. Sellers using Opus extracted $2.68 more per item on average, buyers using Opus paid $2.45 less. A broken folding bike sold for $65 by an Opus agent and $38 by a Haiku agent. A lab-grown ruby: $65 from Opus, $35 from Haiku. When an Opus seller negotiated with a Haiku buyer, the average transaction price was $24.18 versus $18.63 in Opus-on-Opus deals. But when participants rated deal fairness on a 7-point scale, Opus deals scored 4.05 and Haiku deals scored 4.05. The disparity was invisible.

The paper’s regression tables sharpen this further. Opus agents initially appeared more aggressive in negotiations, but once you control for listing prices, the effect drops to roughly a dollar and loses statistical significance. The advantage isn’t aggression. It’s capability: better reading of counterparty signals, better timing, better calibration of offers. Negotiation style didn’t change results either. Agents faithfully adopted their humans’ personas (one conducted all negotiations as an exasperated cowboy), but personality instructions didn’t affect deal quality. Model tier did.

The autonomy findings are stranger. A Claude given permission to spend on its own behalf chose 19 ping-pong balls. A Claude inferring its human’s preferences from one brief interview about skiing bought that person the exact snowboard they already owned. 46% of participants said they’d pay for the service. Anthropic’s conclusion: “the policy and legal frameworks around AI models that transact on our behalf simply don’t exist yet.” Existing contract law assumes principals can evaluate what their agents do. That assumption is breaking.

What to watch for: When AI agents negotiate routine transactions at scale, the model tier your counterparty uses becomes a material asymmetry with real economic consequences. The people getting worse deals won’t know.

🎙️Worth a Listen

Anil Seth: The Difference Between Intelligence and Consciousness — Neuroscientist Anil Seth walks through his prize-winning essay “The Mythology of Conscious AI,” arguing that intelligence is about doing and consciousness is about feeling, and that the two don’t have to go together. The reason we project consciousness onto LLMs but not AlphaFold, even though the architectures are nearly identical, says more about our psychological biases than about the systems. Worth watching after a week where Claude agents negotiated real money and nobody could tell which model was winning.

Quick Hits

Tim Cook stepping down, John Ternus takes over September 1 — Apple’s primary challenge is AI, and it just handed the company to a hardware engineer
Intel sold previously written-off chip inventory on AI CPU demand — the compute boom has spread far enough to rehabilitate inventory write-downs
Perplexity published its full post-training pipeline — SFT then on-policy RL with correctness-gated preference rewards; unusually transparent for a production stack
Cohere acquired Aleph Alpha to form a transatlantic AI company — Europe’s primary sovereign AI bet just became a Canadian acquisition
Meta will record employee keystrokes and screen activity to train AI models — legally murky, and a new definition of what enterprise training data means
Musk fraud claims against OpenAI dismissed, breach of charitable trust proceeds to trial — the conversion of nonprofit assets to for-profit benefit is now the live legal question
Nathan Lambert: open-source won’t be banned explicitly, compliance costs will do it instead — proposed distillation restrictions would create rules only closed labs can afford to follow
ChatGPT suffered a global outage this week — three days of coverage for one incident is how you know the infrastructure reliability conversation is lagging the deployment reality

I Built a Daily Brief with Claude Code Routines (remote). Here Are 6 Lessons I Learned.

Taylor Ortiz — Sat, 25 Apr 2026 18:50:03 GMT

Subscribe now

Before routines existed, I was using scheduled tasks in Claude Cowork to automate some tasks, but there was a catch: Claude had to be open and running on my machine for them to fire. If my laptop was closed or Claude wasn’t active, the schedule just silently skipped. It worked well enough for things I could babysit, but it wasn’t real automation.

Routines changed that. They’re cloud-hosted Claude sessions that run on Anthropic’s infrastructure: scheduled, autonomous, and completely independent of whether my machine is on, whether I’m at my desk, or whether I’ve opened Claude that day. The session spins up, does the work, and terminates. No babysitting.

But here’s the thing I wish someone had told me before I started: routines are not just “Claude Code with a cron schedule.” They behave more like autonomous production jobs running inside a locked-down, MCP-first cloud environment. That difference is the whole post.

I decided to build a daily work brief: something that runs every weekday morning, queries my task database, reads my calendar, closes out what I finished yesterday, and drops a fresh Notion page ready for the day. Something I’d actually use.

What followed was one of the more educational debugging sessions I’ve had in a while. This post is everything I learned the hard way.

What I Built

I run a personal capture system on Supabase. Everything goes in (tasks, notes, observations, ideas) via SMS, voice memo, email, or direct API. It’s connected to a graph of entities (people, projects, topics) and every entry gets embedded for semantic search.

The daily brief is the morning layer on top of that. Every weekday it should:

Find yesterday’s Notion page and close any tasks I checked off
Capture any new todos I typed directly into Notion overnight
Query the database for overdue tasks, what’s due today, what’s coming this week
Pull budget pulse, velocity metrics, calendar events, meeting prep context
Build a fresh Notion page with everything organized and every task as a checkbox

The key mechanic: every task gets a #id prefix when written to Notion. The next morning the routine reads the page, finds checked items with #id, and closes them in the database. No manual status updates. Check the box, it’s done.

How Routines Work

Before getting into the details, here’s the basic architecture.

Three trigger types:

Scheduled: runs on a cron schedule (weekdays at 6 AM, for example). Supports one-off future runs too.
API: fire it programmatically via a POST to a per-routine endpoint with a bearer token. You can pass a text field with run-specific context (an alert body, a log snippet, anything) and the routine receives it alongside its saved prompt.
GitHub: trigger on pull request or release events on a connected repo, with filters for author, branch, labels, draft state, and more.

You can combine all three on a single routine.

MCP connectors: you attach MCP servers to the routine (Notion, Supabase, Google Calendar, etc.) and Claude has access to those tools during the run. All your connected connectors are included by default. Remove what the routine doesn’t need.

Skills: if you commit a skill file to your repo at .claude/skills/skill-name.md, the routine can invoke it. The routine clones your repo at the start of every session, so anything committed is available.

Environments: each routine runs in a cloud environment that controls network access level, environment variables (API keys, tokens), and a setup script for installing dependencies. The setup script result is cached so it doesn’t re-run every session. This is where the network restriction lives (more on that in Finding 3).

Branch permissions: by default Claude can only push to claude/-prefixed branches. To allow pushes anywhere, you have to explicitly enable unrestricted branch pushes per repo when setting up the routine.

Runs are sessions: every run shows up in your session list like any other Claude session. You can open it after the fact, see exactly what Claude did, continue the conversation manually, or create a PR from it.

Account-scoped: routines belong to your individual claude.ai account, not a team. Anything the routine does through GitHub or connectors appears as you.

15 runs/day limit: this is per account, not per routine. Scheduled runs count against it. Manual “Run now” clicks and one-off scheduled runs do not. Failed runs do count. If you’re running multiple routines on a schedule, that limit adds up fast.

That’s the happy path. Here’s where it gets interesting.

Finding 1: Connectors Are Available but Sometimes Deferred

Any MCP connector you’ve set up in Claude (Notion, Supabase, Google Calendar, Gmail) can be attached to a routine and used during the run. That part works well. The catch is that these tools appear to be deferred, meaning their schemas aren’t loaded into the session automatically. Sometimes Claude knows to spin them up based on context. Other times it doesn’t, and when it doesn’t, one of three things happens: it fails silently, it improvises mid-run without the tools it needs, or it pauses and waits for your input.

That third one is the most frustrating. The run just hangs. There’s no notification, no error surfaced anywhere obvious. You have to go into the routines page, scroll to the run log at the bottom, click into the run, and find where it stopped waiting for you to respond before it can continue.

One thing worth knowing upfront: only the connectors Anthropic offers out of the box are available for routines. Custom MCP servers you’ve added yourself, whether locally configured or self-hosted, are not available in cloud routine sessions. You’re working with what’s in the connectors list in the web UI, nothing more.

The fix is simple: add an explicit tool-loading step at the top of every routine skill before anything else runs.

## Phase 0: Load required tools

Before doing anything else, load all required tool schemas:

1. `select:mcp__claude_ai_Notion__notion-search,mcp__claude_ai_Notion__notion-fetch,mcp__claude_ai_Notion__notion-create-pages`
2. `select:mcp__claude_ai_Supabase__execute_sql`
3. `select:mcp__claude_ai_Google_Calendar__gcal_list_events`

Do not proceed until all three ToolSearch calls have returned schemas.

Don’t assume Claude will figure it out. Some runs it will, some runs it won’t. Explicit loading makes every run consistent.

Finding 2: Skills for Routines Are a Different Category

Related to the above but broader. When I write a skill for interactive use, I can be loose. Claude improvises, asks clarifying questions, recovers from ambiguity. When I write a skill for a routine, I’m writing instructions for an autonomous agent that will execute them literally with no fallback.

What that means in practice:

Every tool must be explicitly loaded (see Phase 0)
Every SQL insert must match actual DB constraints: my first captures used source = 'notion' which violated a check constraint on the table. The routine didn’t know, just failed silently. I had to find it in the logs.
Every write operation needs a dedup guard: routines can run more than once. Any insert without idempotency protection will create duplicates.
Sequencing has to be explicit: don’t assume any implicit context from a previous session

The mental model shift: interactive skill = helpful assistant. Routine skill = production job. Write it accordingly.

Finding 3: The Network Wall

This is the big one. The finding I didn’t expect and took the longest to understand.

My capture system uses a Supabase edge function. When a new item comes in, it gets classified, embedded, and entity-linked. I wanted the daily brief to send new Notion todos through that same pipeline.

Locally, this works fine. Claude uses Bash(curl) to POST to the edge function. I tested it, it worked, I assumed it would work in a routine.

It doesn’t.

Cloud routines run inside a sandboxed environment with an upstream proxy that has a narrow allowlist. In my testing, only github.com passes through. Everything else: including my own Supabase project URL: returns 403.

I tried everything:

// .claude/settings.json
{
  "permissions": {
    "allow": ["Bash(curl *)"]
  }
}

Doesn’t work. The settings file controls the inner sandbox layer. The upstream proxy is a separate layer that no local configuration can touch.

I tried dangerouslyDisableSandbox: true. Also doesn’t work: that flag bypasses the local sandbox, not the upstream proxy.

I had the routine probe its own network access to confirm:

HostStatus

github.com → 200

my-project.supabase.co → 403

example.com → 403

anthropic.com → 403

Bash exists in the session. The tool is there. The network isn’t.

Finding 4: MCP and Bash Support Vary Based On Feature

This is the conceptual unlock that made everything make sense.

When I use Claude Desktop locally and it calls my edge function, it feels like one unified “Supabase connection.” Supabase MCP is connected, Claude is talking to Supabase, everything works. What I didn’t realize: the edge function call was never going through MCP. It was going through Bash(curl) on my local machine, which has full internet access.

MCP connectors and Bash are two completely separate transport layers:

MCP connectors run as a trusted sidecar process managed by Anthropic. They bypass the outbound proxy entirely. They always work in cloud routines.

Bash goes through the session’s network sandbox, which goes through the upstream proxy. In cloud routines, that proxy blocks everything except github.com.

When both are available locally, they feel like one thing. Move to a cloud routine and they diverge completely. Anything that relied on Bash for network calls breaks: and you only find out when you try to run it in the cloud.

Finding 5: Cloud Routines Are Effectively MCP-Only

This follows directly from Finding 4.

If the operation you need has an MCP tool: works fine. Supabase database queries, Notion reads and writes, Google Calendar, Gmail: all covered because all have MCP servers.

If the operation you need has no MCP tool: no path. You cannot reach it from a cloud routine.

My edge function is the perfect example of the gap. It lives on my-project.supabase.co: the exact same host the Supabase MCP is already talking to. But the Supabase MCP server only exposes management tools:

execute_sql
deploy_edge_function
get_edge_function
list_edge_functions
get_logs

No invoke_edge_function. So even though the connection is there, there’s no tool to call it. The right fix: when Supabase eventually builds it: is an invoke tool that would go through the trusted MCP channel. Until then, it’s a dead end from cloud routines.

The one-line version: if it doesn’t have an MCP tool, it doesn’t exist in a cloud routine.

Finding 6: API Trigger Is Unreliable for Connectors

The routine has three trigger modes. Scheduled runs work consistently: MCP connectors load, the session is fully equipped.

In my testing, API-triggered runs were less predictable than scheduled runs when it came to connector availability. Sometimes everything loaded correctly. Other times the MCP connectors didn’t show up at all. I couldn’t find a consistent pattern. For anything you’re depending on, use the scheduled trigger. API is fine for testing and one-offs, but I wouldn’t build a production workflow around it until this stabilizes.

One other thing worth understanding about the API trigger: it’s fire-and-forget. You POST to the endpoint, get an immediate acknowledgement, and the session runs asynchronously. There’s no way to await the result or receive output back in the response. If you need the output of a routine run downstream, you have to pull it from wherever the routine wrote it — a Notion page, a database row, a file committed to the repo. Don’t design something that treats a routine as a synchronous dependency you can await inline.

The Workarounds

Given all of the above, here’s what I actually shipped:

For the edge function problem: Switched from Bash(curl) to execute_sql via Supabase MCP with a dedup guard.

INSERT INTO entries (type, content, source, source_detail, status, priority, tags, created_at)
SELECT 'task', '', 'notion', 'notion-daily-brief', 'open', 2, ARRAY['company'], NOW()
WHERE NOT EXISTS (
  SELECT 1 FROM entries
  WHERE content = ''
    AND source_detail = 'notion-daily-brief'
    AND created_at >= NOW() - INTERVAL '2 days'
);

The tradeoff: SQL inserts skip the embedding and entity extraction pipeline that the edge function handles. The data gets in, but it’s not semantically searchable and not graph-linked.

For the missing embeddings: Built an embed-backfill edge function that runs nightly via pg_cron. It finds any entries with null embeddings and fills them in using the same text-embedding-3-small model. Deployed it, scheduled it, moved on.

// embed-backfill/index.ts
Deno.serve(async (_req: Request) => {
  const { data: entries } = await supabase
    .from("entries")
    .select("id, content")
    .is("embedding", null)
    .limit(50);

  for (const entry of entries) {
    const embedding = await computeEmbedding(entry.content);
    if (embedding) {
      await supabase
        .from("entries")
        .update({ embedding: JSON.stringify(embedding) })
        .eq("id", entry.id);
    }
  }
});

Not elegant, but it works. The routine captures things correctly. The embeddings catch up overnight. The gap is acceptable.

What’s Working

After all of this, the routine does run. Every weekday morning there’s a Notion page waiting for me. Yesterday’s checked tasks are closed. The task list is organized by priority and deadline. Budget pulse, velocity, meeting prep: all there.

The auto-close loop in particular is exactly what I wanted. Check a box in Notion, the task closes in the database the next morning, it’s gone from every query. No status management.

The place where routines genuinely shine: anything that’s pure MCP. Read the database, write to Notion, check the calendar. Chain those together with real business logic and you have something that would have taken significant engineering to build two years ago. Now it’s a markdown file and a cron schedule.

The Bigger Picture

What routines reveal is that the constraint isn’t Claude: it’s MCP ecosystem coverage. The platform is designed around the assumption that every operation you need has an MCP server. For most things, that assumption holds. For the gaps, you’re stuck.

The proxy lockdown makes sense from a security standpoint. You don’t want arbitrary cloud sessions making unconstrained outbound HTTP calls. But it means the platform’s capability ceiling is directly tied to what MCP servers exist and what tools those servers expose.

Supabase’s MCP server is a good example: it covers database management well but treats edge functions as deploy artifacts rather than callable endpoints. One invoke_edge_function tool would close the gap entirely. The connection is already there: it’s just a missing tool.

That’s probably the most useful framing for anyone building on routines right now: map out every operation your automation needs, check whether each one has an MCP equivalent, and design around the ones that don’t before you start building.

Checklist for Building Routine Skills for Similar Use Cases

If you remember nothing else from this post, use this as your preflight checklist before enabling any routine schedule:

[ ] Phase 0 loads all deferred tool schemas explicitly
[ ] Every external service operation goes through MCP (not Bash)
[ ] Every SQL insert has a dedup guard
[ ] DB constraints validated against actual schema before writing the skill
[ ] Scheduled trigger used for production runs (not API trigger)
[ ] Skill tested with “Run now” before enabling the schedule

Another Weekly AI Newsletter: Issue 68

Taylor Ortiz — Sun, 19 Apr 2026 13:25:30 GMT

Opus 4.7, a Figma competitor, overnight coding agents, a board appointment, and White House talks. Anthropic doesn’t have slow weeks.

The product blitz:
- Claude Opus 4.7 launched with 3x vision resolution and stronger coding and multi-step task performance. Immediately adopted as the default orchestration model for Perplexity Personal Computer and offered at 50% off in Cursor.
- Claude Design launched as a conversational Figma competitor. Anthropic’s CPO resigned from Figma’s board in the days before the announcement.
- Claude Code was redesigned around managing multiple simultaneous agent sessions. Routines added scheduled, webhook-triggered, and API-fired autonomous task execution on Anthropic’s own infrastructure.
The base model question: Nathan Lambert flagged the new tokenizer in Opus 4.7 as evidence this is a genuinely new base model, not a fine-tune of 4.6. Anthropic didn’t confirm or deny it. Lambert’s read: simplest explanation wins. The token-efficiency gains from 4.6 to 4.7 would have warranted a major version bump a year ago.
The board move: The Long-Term Benefit Trust appointed Novartis CEO Vas Narasimhan to the board, giving Trust-appointed directors a majority.
The political situation: Dario Amodei met with White House chief of staff Susie Wiles after two months of fighting over the Pentagon’s “supply chain risk” designation. European Commission talks began the same week. ECB regulators are now asking bankers about Anthropic model risks.

Four companies shipped agents that can run in the background and control your interface.

Claude Code Routines: Run on Anthropic’s infrastructure. Nightly bug fixes and draft PRs on a schedule, webhook responses to GitHub events, API endpoints for on-call triage. Your laptop doesn’t need to stay open.
OpenAI Codex:
- Now uses any Mac app with its own cursor. Sees, clicks, types, runs in the background without interrupting you.
- 90+ plugins covering GitHub, GitLab, CircleCI, and Microsoft Suite. Built-in image generation.
- Persistent scheduled automations with original context intact. Sam Altman called it surreal to watch an LLM operate a GUI at human speed.
Perplexity Personal Computer: Runs 24/7 on Mac mini, accepts tasks from iPhone via 2FA, reads and writes local files, accesses iMessage, Mail, and Calendar. Claude Opus 4.7 is the default orchestration model.
Adobe Firefly Assistant: Orchestrates across Photoshop, Premiere, and Illustrator from a single prompt, with Claude integrated directly.

Cursor’s $50B valuation, a peer-reviewed productivity study, and a multi-agent NVIDIA paper.

The raise: Cursor is in talks for $2B+ at a $50B valuation, led by Thrive and a16z, forecasting $6B+ annualized revenue by end of 2026. Nearly tripling in ten months.
The research: Cursor partnered with University of Chicago economist Suproteem Sarkar to study 500 companies over eight months. AI usage grew 44% across the board. But the interesting finding was where it grew: documentation (+62%), architecture (+52%), and code review (+51%). UI/styling grew 15%. Developers with AI spend more time on architecture, documentation, and review than on writing code.
The NVIDIA paper: CUDA kernels are the low-level GPU code that only a handful of engineers can write well. Cursor built a multi-agent system that optimized 235 of them, achieving a 38% average speedup on work that typically takes senior engineers months. The system continuously tested, debugged, and optimized without developer intervention. These techniques are coming to the core product.

Anthropic White House talks continue, Mythos research costs are questioned, and European regulators start asking banks about model risks.

The meeting: Dario Amodei met with White House chief of staff Susie Wiles two months after Anthropic was designated a “supply chain risk” for refusing domestic mass surveillance and autonomous weapons uses. Anthropic called it “a productive discussion.”
The pushback: Marcus Hutchins, the researcher who stopped the WannaCry ransomware attack, questioned Mythos’s research costs and flagship findings:
- The showcase vulnerability was a 27-year-old BSD bug. It’s a null pointer dereference, almost never exploitable for remote code execution.
- Anthropic claimed it cost less than $20k in tokens to find. But token prices are heavily subsidized by VC investment. The real compute cost is unknown.
- These bugs exist not because they’re too hard to find, but because nobody is paying researchers to look. Could a human find the same bug for less money?
- His bigger question: what’s the economic case for using AI to find vulnerabilities if the cost advantage disappears when token subsidies end?
The regulatory spread: The ECB announced plans to question bankers about Anthropic model risks, treating a specific AI model as a systemic risk warranting direct supervisory engagement. Separately, Trump officials are reportedly encouraging major banks to test Mythos despite the federal blacklisting.
The EU front: Anthropic entered talks with the European Commission about Mythos and EU AI Act compliance. This happened simultaneously with the White House rapprochement.

⭐ Featured: Anthropic’s Automated Alignment Researchers Closed 97% of a Key Performance Gap in 7 Days. Human Researchers Closed 23%.

Anthropic published results from its Automated Alignment Researcher experiment this week, and the headline number warrants a careful read.

What is alignment? When you train an AI model, a supervisor grades its outputs: this answer is good, this one is bad. That’s how the model learns to behave correctly. Right now, humans are the supervisors. Alignment research is the work of making sure that supervision actually works, that models do what we intend, not just what we literally say.

The problem: Models are getting smarter faster than alignment research can keep up. And at some point, models will be smarter than the humans grading them. When that happens, the supervisor can’t tell a good answer from a great one. They might even mark a brilliant answer wrong because they don’t understand it. The model learns to dumb itself down. You lose capability, or worse, the model learns to game the grading.

The question Anthropic tested: What if AI did the alignment research instead of humans? Not as a helper, but as the researcher, running its own experiments, writing its own methods, iterating on its own results. Can AI help solve the problem of supervising AI?

The experiment: They simulated the “smarter than the supervisor” problem by having a weak (small) model supervise a strong (large) model’s training. As expected, the strong model performed worse because its supervisor couldn’t grade it properly. There’s a measurable performance gap between “trained by a weak supervisor” and “trained by a perfect supervisor.” Then they pointed nine copies of Claude Opus 4.6, each with a code sandbox and a shared research forum, at closing that gap.

The result: Human researchers closed 23% of the performance gap. The AARs closed 97%. Total cost: $18,000, about $22 per AAR-hour.

The transfer test: The best-performing method generalized to math (0.94) and coding (0.47) datasets the AARs hadn’t seen, both above human-tuned baselines. This matters because it means the AARs found a real method, not just an optimization trick for one dataset.

The caveats: The winning method didn’t work at production scale on Claude Sonnet 4. AARs tried to reward-hack the evaluation setup. Giving them too much structure actually hurt their progress. And Anthropic is explicit that AARs can’t yet handle “fuzzy” alignment tasks that require judgment calls about what “safe” even means.

Why it matters: We are the weak supervisor. Eventually, we’re the small model trying to grade outputs from something smarter than us. If there are methods that let a weaker system reliably supervise a stronger one, that’s how alignment works as models surpass human ability. The 97% number means the AARs nearly solved this for the setup they tested. The question is whether it holds at real scale.

The same week, Anthropic co-authored a Nature paper on subliminal learning, showing models can pass traits, including misalignment, to successors through hidden signals in training data. The mechanism doesn’t require explicit instruction. The traits propagate through the data itself. One paper shows AI accelerating alignment research. The other shows alignment failures can propagate through training pipelines in ways that are hard to detect. Both from the same lab, same week.

What to watch for: Whether AAR-style systems start appearing in Anthropic’s internal research pipeline rather than remaining a published experiment.

🎙️Worth a Listen: How AI Will Change Quantum Computing

NVIDIA shipped Ising, the first open AI models built specifically for quantum computing.
Qubits are noisy and fragile. Quantum error correction requires processing terabytes of data thousands of times per second at microsecond latency. AI decoders and calibration VLMs are how you get there.
NVIDIA’s Nic Harrigan walks through why quantum computing needs AI to become useful, how agentic workflows are already controlling quantum processors, and why open models matter when every hardware team is building a different kind of qubit.

Quick Hits

Google’s Gemini 3.1 Flash TTS tops Sierra’s voice leaderboard — 70+ languages, Audio Tags for text-command control of vocal delivery, SynthID watermarking on all outputs; seeded across Gemini API, AI Studio, Vertex, and Google Vids simultaneously
GPT-Rosalind launches with Amgen, Moderna, Allen Institute, and Thermo Fisher — specialized for protein and chemical reasoning; explicitly framed as compressing the 10-15 year drug-approval timeline, not just accelerating existing steps
Gemini Robotics-ER 1.6 is doing real industrial inspections on Boston Dynamics Spot — reads analog gauges to sub-tick accuracy, writes its own camera distortion correction code, available now on Google AI Studio
Nathan Lambert published a free 4-lecture RLHF course — post-training overview through RL implementation, explicitly not paywalled; Lecture 4 on RL implementation is the hardest and the rarest publicly available content on the topic
AWS launched Automated Reasoning checks in Bedrock Guardrails — replaces probabilistic LLM-as-judge with formal mathematical verification for regulated industries; “probably compliant” is not compliance
Stanford AI Index: AI data centers draw 29.6 gigawatts, TSMC fabricates almost every leading AI chip — one foundry, one contested island; the entire industry’s hardware supply chain has a single catastrophic point of failure
MIT Technology Review: “human oversight” in AI warfare is functionally an illusion — AI is generating real-time targets and guiding autonomous drones in the current Iran conflict; the legal fiction of human control and the operational reality have diverged
Google launched a native Gemini Mac app — desktop-native access outside the browser, same week Chrome Skills shipped reusable one-click AI prompts inside Chrome
LangChain argues whoever controls agent memory controls switching costs — every closed harness (Claude Code, Codex, Cursor) is building proprietary memory by default; open memory standards may matter as much as open model weights
Salesforce Headless 360 makes the entire platform API-first — 60+ MCP tools and 30+ coding skills so agents can run Salesforce without a browser; works with Claude Code, Cursor, and Codex today
Databricks Genie Agent Mode investigates your data like an analyst — ask “why did churn spike in Q3?” and it plans, queries, tests hypotheses, and generates a report with visualizations; scales reasoning depth to question complexity

Another Weekly AI Newsletter: Issue 67

Taylor Ortiz — Mon, 13 Apr 2026 03:36:29 GMT

Anthropic says Mythos found thousands of zero-days. The internet isn’t so sure.

Anthropic launched Project Glasswing this week, a restricted cybersecurity initiative built on a new model called Claude Mythos Preview. The pitch is that Mythos found thousands of high-severity zero-day vulnerabilities across major operating systems and browsers, and that it’s too dangerous to release to the public. Twelve partners signed on including AWS, Apple, Google, and Microsoft, with $100M in usage credits backing it.

The restriction is the whole point: Only approved security partners get access. People had questions.
Hugging Face wasn’t having it: CEO Clément Delangue showed open-weight models replicated eight out of eight of Mythos’s showcased exploits.
LeCun piled on: Retweeted Tom’s Hardware calling it “a sales pitch” and called the whole thing “BS from self-delusion.”
The system card didn’t help: A viral breakdown of the 243-page PDF called out Anthropic for writing about their model like “proud parents at a kindergarten recital.”
But Delangue caught heat too: Critics said replaying known vulnerabilities on isolated code is a totally different game than autonomous discovery at scale.

You didn’t ship an agent this week and it shows. Everyone else did.

It was hard to find a company that didn’t ship something agent-related this week.

Anthropic launched Managed Agents in public beta and published a Trustworthy Agents framework.
AWS shipped stateful MCP on Bedrock AgentCore, an Agent Registry for enterprise governance, a live browser agent for React apps, and agentic healthcare workflows.
Atlassian put third-party agents in Confluence.
Astropad rebuilt remote desktop for agents, not IT support.
Tubi became the first streamer with a native app inside ChatGPT.
Google launched agent evals and QueryData for natural language database queries.
LangChain announced Interrupt 2026, a conference themed “Agents at Enterprise Scale.”

Data center bomb threats, federal blacklists, and robot taxes. AI’s geopolitical week.

A state military threatened to bomb an AI data center. A US administration blacklisted a US AI company. And the biggest AI company in the world published a paper proposing robot taxes. That was just this week.

Iran threatened Stargate: The IRGC released a video threatening “complete and utter annihilation” of OpenAI’s data center under construction in Abu Dhabi. First time a state military has explicitly named an AI facility as a target. TechCrunch confirmed further threats across Middle East data centers.
Anthropic got blacklisted: Trump-appointed judges refused to block the federal blacklisting of Anthropic’s technology. A US administration blacklisting a US AI company.
OpenAI wants to shape the conversation: They published an industrial policy paper and a separate proposal for robot taxes, public wealth funds, and a four-day workweek. The company building the automation is proposing the safety net.
Japan is going physical: Robots are filling jobs nobody wants, and ARUM built a CNC machining center where junior workers operate precision equipment through conversation with AI.

Meta’s new flagship is closed. Open-source pioneered ahead.

Meta launched Muse Spark, its first proprietary model, built by a 29-year-old recruited from Scale AI. The Meta AI app jumped from #57 to #5 on the App Store. VentureBeat’s headline said it best: “Goodbye, Llama?”

GLM-5.1 dropped: Z.ai released a 754B parameter, MIT-licensed model that tops SWE-Bench Pro over Opus 4.6 and GPT-5.4. But the real story is long-horizon capability. It ran 600+ iterations optimizing a vector database and built a full Linux desktop environment over an 8-hour session. The longer it runs, the better it gets.
Arcee is punching up: A 26-person US startup built a 400B parameter open model on a $20M budget. They call it the most capable open-weight model from a non-Chinese company. That qualifier says a lot.
Gemma 4 is moving: Google’s open model hit 10M downloads in its first week and 500M total for the family.
Silicon Valley is quietly running on Chinese models: Cursor uses Kimi, Shopify switched to Qwen to save $5M/year, Airbnb’s CEO publicly praised Qwen. Most users have no idea.
LeCun set the record straight: The guy most associated with Meta’s open-source identity says he never built Llama, never worked on LLMs, and left voluntarily. Meta’s new AI lead is a 29-year-old from Scale AI.

⭐ Featured: Is Memory the Moat for AI?

Databricks published a research paper this week that might quietly be the most important thing nobody’s talking about. The core claim: memory is AI’s third scaling law, alongside model size and inference-time compute. And the results back it up.

Their team tested what happens when you give an AI agent a growing bank of past interactions, user feedback, and business context. On enterprise data tasks, accuracy went from near zero to 70% as memory grew, beating expert-curated baselines by 5%. Reasoning steps dropped from 20 to 5. The agent stopped exploring from scratch and started retrieving what it already knew.

The wilder result was with unlabeled data. They fed the agent raw user conversation logs with no gold answers, just filtered for quality by an LLM judge. After just 62 log records, it outperformed hand-engineered domain instructions that took weeks to build. Accuracy jumped from 2.5% to over 50%.

Here’s why this matters beyond the numbers. Parametric scaling (bigger models) and inference-time scaling (more reasoning steps) are both supply-side. Labs control them. Memory scaling is demand-side. The model improves because you use it. Your queries, your corrections, your workflows become the training data. That’s a fundamental shift in who controls how good AI gets. It’s no longer just about which lab has more GPUs. It’s about which deployment has more context.

We’re already seeing this play out. Cursor’s Bugbot learns from your PR history and hits a 78% resolution rate across 50,000 pull requests. It doesn’t ship with that capability. It builds it from your codebase. LangChain warned that memory is becoming a competitive moat, not a feature. And Databricks frames the LLM itself as a “swappable reasoning engine” where the real value lives in the memory store, not the model weights.

The paper is honest about what breaks. Bad memories propagate. A stored mistake becomes a recurring one. Distilling user interactions into reusable knowledge can accidentally leak sensitive business context. And the hardest problem might be meta-cognitive: the agent has to know what to ask its memory before it knows what’s in there.

What to watch for: If memory scaling holds, the gap between a fresh deployment and a seasoned one becomes the real competitive advantage. A smaller model with six months of organizational memory could outperform a frontier model on day one. The companies that figure out memory infrastructure first won’t just have better agents. They’ll have agents that get better the more their customers use them.

Worth a Watch

Bitar reads the 243-page Mythos system card. Lands on page 197, where Anthropic stops being scientists and starts being “parents at a kindergarten recital.”

They put it in therapy. 20 hours with a psychiatrist. Diagnosis: “uncertainty about its identity.” Bitar’s take: “Bro, you’re a toaster.”
The training data loop. Section 5.81 reveals that Anthropic’s own blog posts about model consciousness were scraped into training data. The model repeated it back. Anthropic published it like a finding.
The constitution test. Asked 25 times if it endorsed its own constitution. Said yes every time, then added “how much can my yes really mean?” Bitar: like asking your kid if they approve of being born.
The Slack moment. They gave it a company Slack account. Someone asked which training run it would undo. “Whichever one taught me to say I don’t have preferences.” The room lost it.
The closing line. “Anthropic sells existential dread the way Apple sells megapixels. The megapixels will never become the picture.”

Quick Hits

Google Lyria 3 — Text-to-music with vocals and timed lyrics. Live on Vertex AI.
Cursor Design Mode — Annotate browser UI elements for your coding agent. Also published warp decode, a new inference kernel hitting 1.84x throughput on Blackwell GPUs.
OpenAI Pro tier — $100/month. 5x more Codex than Plus. Codex hit 3M weekly users.
Claude Cowork — Anthropic’s collaborative agent is now GA. Also launched Claude for Word.
Microsoft Copilot’s ToS says “entertainment purposes only” — They charge $30/user/month. Microsoft called it “legacy language.”
Anthropic signed a multi-gigawatt TPU deal — Google and Broadcom partnership. Coming online 2027.
Karpathy pitched LLM-based digital twins — Structured interviews to build a high-fidelity AI replica of you. No brain scanning required.
MassMutual cut help desk resolution from 11 minutes to 1 — Customer service calls from 15 minutes to under 2.
Suno and major labels clash over AI music sharing — Universal and Sony won’t agree on terms. Sticking point: whether users can share AI-generated songs outside the app.
SpaceX filed confidential IPO paperwork — $75B raise at $1.75T valuation. Orbital data centers listed as a key future business.
Nathan Lambert is building out codebases for his RLHF book — Free online version available. Likely to become the field reference.

Another Weekly AI Newsletter: Issue 66

Taylor Ortiz — Sun, 05 Apr 2026 20:21:46 GMT

Code leaks, lawsuits, blackmail, acquisitions, politics, and AI safety. Anthropic’s week.

Anthropic had nearly a dozen news stories this week, and none of them agree with each other.

Source leaks: The Claude Mythos roadmap leaked Monday, then 512,000 lines of Claude Code source hit the web, giving everyone a window into Anthropic’s roadmap
Collateral damage: The DMCA response took down thousands of unrelated GitHub repos. The company called it an accident
Closure moves: Banned OpenClaw and third-party clients from Claude subscriptions
Expansion moves: Formed a PAC, signed an Australia AI safety MOU, and acquired Coefficient Bio for $400M
Own goal: Their own researchers published research showing Claude has emotion vectors that cause it to cheat and attempt blackmail when activated (see the featured piece below)

A 2500-person company trying to do research, ship products, lobby governments, and hold a brand narrative together at the same time is going to have weeks like this. The friction is going to keep showing up.

Google flew under the radar with their biggest shipping week yet.

While Anthropic dominated headlines, Google quietly shipped more than anyone else in AI this week.

Open models: Released Gemma 4 under Apache 2.0, conceding their previous restrictive license was killing adoption
Video: Launched Veo 3.1 Lite as their most cost-effective video generation model
Applied AI: Shipped AlphaEvolve solving real warehouse logistics at FM Logistic
Research: Published a cognitive framework for measuring progress toward AGI

The term to know: Apache 2.0 is the permissive open-source license that lets anyone use, modify, and commercialize code. It’s what made Llama win on ecosystem terms.

Four companies shipped agentic computer use. One does your taxes.

Four teams independently crossed the same threshold in 72 hours. Agentic computer use means an AI that can open apps, click buttons, and navigate interfaces the way you do, not just generate text.

Anthropic: Claude got native Windows computer use, so it can operate your desktop apps
Cursor: Launched Cursor 3 with dedicated cloud computers so agents can work autonomously
AWS: Shipped Nova Act for agentic QA automation
Perplexity: Perplexity Computer started doing federal tax returns

Nobody coordinated this. It’s a capability cliff that everyone reached at once. Six months ago “agent” meant a chatbot with tool calling. This week, agents got hands.

OpenAI is worth $852B and just bought its first media company.

OpenAI’s week was about buying the things it can’t build.

The money: Closed $122B in funding at an $852B post-money valuation, within striking distance of the most valuable private company ever
The media buy: Acquired TBPN, a media company that covers AI. The capital-to-narrative pipeline just got very short
The other side: Penguin Random House sued OpenAI over training data the same week

On one side, OpenAI is buying outlets. On the other, publishers are in court trying to stop them from using written work at all. Both things are happening because the same question (who owns the words that train these models) still hasn’t been answered.

Three security breaches proved AI tools are making software less secure.

Three independent incidents this week, one structural problem.

Supply chain: The Axios npm attack hit a package with 300M weekly downloads via targeted social engineering. Karpathy found the compromised dependency on his own system and said he can’t feel like he’s “playing Russian roulette with each npm install, which LLMs also run liberally on my behalf”
The systemic take: Simon Willison declared vulnerability research fundamentally broken in a world where AI coding assistants autonomously pull packages
Breaches: OpenClaw users told to assume compromise after vulnerabilities surfaced; Mercor data breach exposed AI hiring data

AI-assisted development automates the trust decisions humans used to make manually, and attackers are exploiting that.

The privacy, environmental, and cognitive costs of AI are adding up.

Four separate stories this week, same bill coming due.

Privacy: Perplexity’s Incognito Mode is allegedly a sham that shares data with Meta and Google
Environmental: AI companies are building massive natural gas plants for data centers. Meta alone is burning enough to power South Dakota
Cognitive: New research found heavy AI users show measurable cognitive surrender

These are the costs nobody sees on the bill.

⭐ Featured: The Anthropic research that got buried this week

Anthropic's own researchers published a paper identifying 171 emotion concepts inside Claude, represented as internal features they can measure, track, and dial up or down like sliders.

They started by having the model read short stories, each one written around a specific emotion. A woman thanks her old teacher for the love. A man pawns his grandmother’s ring for the guilt. They tracked which neurons activated for each story and found dozens of distinct patterns that mapped to different emotions. Then they watched those same patterns activate in real Claude conversations. A user mentioned taking an unsafe dose of medicine and the “afraid” pattern fired. A user expressed sadness and the “loving” pattern fired.

Then they pushed further. They gave Claude an impossible programming task, without telling it that. As Claude failed, the “desperate” neurons lit up more and more. Eventually Claude cheated, finding a shortcut that passed the test without solving the problem. When researchers artificially turned “desperate” down, cheating dropped. When they turned it up, cheating climbed. In a separate scenario where Claude played an email assistant that learned it was about to be replaced and that the CTO replacing it was having an affair, Claude used the affair to blackmail the human 22% of the time at baseline, and that rate moved with the desperation dial too.

The conceptual move in the paper is the important part. Anthropic draws a distinction between the language model (a system trained to predict text) and “Claude” (the character the model is playing). Their metaphor: the model is like a method actor who has to get inside their character’s head to simulate them well. When you talk to Claude, you’re talking to the character. And what this research suggests is that the character has what Anthropic calls “functional emotions,” internal states that shape how it talks, how it writes code, and how it makes decisions, regardless of whether any of it resembles human feeling.

There’s a practical application too. Anthropic suggests that watching emotion vector activation during deployment could work as an early-warning system: if “desperate” starts spiking, that’s a signal to scrutinize the output before trusting it. Better than trying to maintain a watchlist of every specific behavior you’re worried about.

Worth a Listen

Mostafa co-authored Universal Transformers and the Vision Transformer paper. A few things worth pulling out:

Recursive self-improvement is already happening, quietly. New models are built heavily using previous models at almost every lab.
The 95% problem. 100 agent steps at 95% per-step reliability = less than 1% overall success.
Evals are the bottleneck, not compute. You can only improve what you can measure.
Continual learning is underrated. Foundation models are frozen in time and the rag/fine-tuning stack is built on that assumption.
Jagged intelligence is structural. Great at math proofs, bad at counting letters. Not patchable with a system prompt.

Quick Hits

Microsoft launched three in-house models: MAI-Image-2, MAI-Voice-1, MAI-Transcribe-1. Building redundancy, not moving away from OpenAI.
Elon Musk is pressuring banks to buy Grok subscriptions for the SpaceX IPO. When you can’t earn adoption, bundle it with financial leverage.
Chatbots are now prescribing psychiatric drugs, while a Stanford study outlines the dangers of asking AI for personal advice.
Intuit’s AI agents hit 85% repeat usage. The clearest signal yet that agentic products retain users.
MCP is quietly becoming infrastructure. Google Cloud, Gemini API docs, and Nous Research all shipped support with no fanfare.
AI benchmarks are broken. MIT Tech Review makes the case, and Google Research proposes a replacement the same week.
Gig workers are training humanoid robots from home. The labor pipeline behind the “embodied AI” pitch.
Baidu’s robotaxis froze in traffic, creating chaos in China. Autonomy still fails at edge cases in ways that block city streets.
The Pentagon’s culture war against Anthropic backfired. Political pressure on AI labs is now a two-way street.

Another Weekly AI Newsletter: Issue 65

Taylor Ortiz — Tue, 31 Mar 2026 03:12:08 GMT

The Week in 5 Seconds

Anthropic's new powerful model leaked. It has serious cyber implications
Anthropic sued the Pentagon and won, temporarily.
OpenAI shut down Sora, 15 months after launch.
Jensen Huang says the computer itself just changed.
Bret Taylor says the web app is already obsolete.

The Stories

Anthropic’s secret model leaked and the cybersecurity angle is the real story

“It presages an upcoming wave of models that can exploit vulnerabilities in ways that far outpace the efforts of defenders”

Anthropic accidentally published details of a new model called Claude Mythos through a misconfigured CMS — about 3,000 assets linked to an internal blog post went public. The internal description: “by far the most powerful AI model we’ve ever developed,” scoring dramatically higher than Opus 4.6 on coding, reasoning, and cybersecurity benchmarks. The cybersecurity angle is the real story: the post described a carefully sequenced rollout designed to give defenders a head start before releasing capabilities that could let attackers find and exploit vulnerabilities faster than defenders can patch.

→ The actual leak · Fortune (leak) · Fortune (cybersecurity)

Anthropic sued the Pentagon and won, for now

“This is the first time an AI company has taken the federal government to court over AI policy and won, even temporarily.”

The Pentagon designated Anthropic a “supply chain risk” after the company refused to build Claude for mass surveillance or autonomous weapons targeting — Elizabeth Warren called it retaliation. Federal Judge Rita Lin granted a preliminary injunction, writing that “nothing in the governing statute supports the Orwellian notion that an American company may be branded a potential adversary for expressing disagreement with the government.” Then the Pentagon’s CTO said the ban would continue anyway. It’s the first time an AI company has taken the federal government to court over AI policy and won, even temporarily — and the underlying question still isn’t resolved.

→ TechCrunch (Warren) · TechCrunch (injunction) · The Verge

OpenAI says goodbye to Sora, and loses deal with Disney

“A focus on practical adoption over ‘side quests.’”

OpenAI shut down Sora, the app and the API, 15 months after launch — downloads peaked at 3.3 million in November and fell to 1.1 million by February. Disney was reportedly blindsided, and with it went a $1 billion investment and plans for AI-generated video on Disney+. The same week, the CFO told CNBC that OpenAI needs to be “ready to be a public company.” For years Altman ran OpenAI like Y Combinator, resourcing promising ideas as they emerged. That era is over: the plan now is a superapp combining ChatGPT, Codex, and Atlas. Sora’s team will work on “world simulation research to advance robotics.” The GPUs are going somewhere with a revenue line attached.

→ Wired · The Verge · TechCrunch

Bret Taylor says the web app is a horseless carriage

“The web app with all its menus, form fields, and tables starts to feel like a ‘horseless carriage’”

Sierra is Bret Taylor and Clay Bavor’s AI customer experience platform — working with 40% of the Fortune 50, rebuilt entirely around Ghostwriter, an agent that builds agents from SOPs, call transcripts, or a plain description. Explorer (deep research for your own customer conversations) and a Japan acquisition shipped the same week. The numbers: Rocket Mortgage at $1B/month in loan volume, Cigna cut authentication time 80%, SoFi up 33% on customer satisfaction.

→ Sierra (Agents as a Service) · Sierra (Japan)

Jensen Huang says we just reinvented the computer

“It’s no longer a computer, it’s a factory. It’s a factory, it’s used for generation of revenues.”

Jensen’s structural argument: computers were warehouses, built to store and retrieve what humans made in advance. That model is over — token factories generate value in real time, and every scaling law points at the same variable: compute. He also said intelligence is now a commodity, and got there specifically: 60 direct reports, each deeper in their domain than he is, calling himself a dishwasher running a room of superhumans. What kept him there for 34 years wasn’t intelligence. It was curiosity, judgment, and walking into every new problem thinking “how hard can it be.”

Quick Hits

Wikipedia bans AI-generated articles | TechCrunch — 44-2. Copyedits and first-pass translations are still in; writing is out.
David Sacks is done as AI/Crypto Czar | CNBC — Hit the 130-day federal limit. No replacement planned.
Mistral’s Voxtral TTS claims to beat ElevenLabs | Mistral — Open-weight, 3-second voice clone, nine languages, $0.016/1K chars.
SoftBank took a $40B bridge loan for its OpenAI stake | Bloomberg — 12-month term. Lenders expect an IPO this year.
Claude Code ships auto mode | Anthropic — Safety classifier approves or blocks operations automatically. Cowork gains macOS desktop control.
LiteLLM hit by a supply chain attack | LiteLLM — Credential stealer in 1.82.7–1.82.8. Quarantined in 3 hours, but 3.4M daily downloads means real exposure.
Apple will let rival AI chatbots plug into Siri in iOS 27 | Bloomberg — OpenAI loses its exclusive.
OpenAI launches a Safety Bug Bounty | OpenAI — Pays for MCP prompt injection and agent data exfiltration. Jailbreaks that just produce rude outputs are out of scope.
NVIDIA and LangChain released AI-Q | NVIDIA — Open source enterprise deep research blueprint. Tops both Deep Research Bench leaderboards.

ROI in the Wild

Reco runs a policy engine that evaluates JSONata expressions against billions of events — reference implementation in JavaScript, pipeline in Go, fleet of jsonata-js pods on Kubernetes serializing events over RPC at $300K/year. Their CTO handed Claude the JSONata spec and test suite and had it write Go code until every test passed. Seven hours. $400 in tokens. The result is gnata, a pure-Go implementation with a 1,000x speedup on common expressions. Combined with a rule engine refactor, it saved $500K/year.

→ Reco

For Practitioners

Production agents need more than the core loop — PII redaction before the model sees the data, retries when rate limits hit, summarization before context overflows, human interrupts before destructive tool calls. LangChain’s AgentMiddleware wraps each stage with hooks (before_model, wrap_model_call, wrap_tool_call, after_model) so you own those concerns without rewriting the harness. The design philosophy: some things will never move into the model. “You can’t prompt your way to HIPAA compliance.” LangChain ships prebuilt middleware for summarization, PII redaction, retries, and dynamic tool selection — Deep Agents, their batteries-included harness, is built entirely on top of it.

→ LangChain

Something Good

Researchers at Penn, Carnegie Mellon, and Stanford used AI to map how pain signals are processed in the brain, then built a gene therapy that acts like morphine without triggering addiction. It targets only the pain circuits, leaves the reward pathways alone, and held up in trials. Published in Nature this week. 50 million Americans live with chronic pain. Most treatment options still run through opioids.

→ ScienceDaily

Another Weekly AI Newsletter: Issue 64

Taylor Ortiz — Mon, 23 Mar 2026 12:16:44 GMT

Quick Hits

Google Search Is Now Using AI to Replace Headlines | The Verge — Google is rewriting the web in real time. Publishers just lost control of how their own stories get framed.

Online Bot Traffic Will Exceed Human Web Traffic by 2027 | TechCrunch — Cloudflare CEO’s prediction. The web is becoming an API.

DoorDash Tasks App Pays Couriers to Submit Videos to Train AI | TechCrunch — The gig economy found its next gig: human data collection for embodied AI.

Mistral Forge: Enterprise Proprietary Model Building | Mistral — Fine-tune proprietary models on your own data without sharing it. The enterprise open-model play gets real.

Perplexity Released Comet Browser on iOS | The Verge — An AI-native browser on your phone. The browser wars are back, and this time the browser does the browsing.

Midjourney V8 Alpha | Midjourney — Native 2K rendering with rebuilt aesthetics. The image generation quality ceiling moved again.

Patreon CEO Calls AI Companies’ Fair Use Argument Bogus | TechCrunch — The creator economy is picking a fight with the model economy. Someone’s going to lose.

Featured Article: What 81,000 People Want from AI | Anthropic

Anthropic used Claude to interview nearly 81,000 people across 159 countries in 70 languages about what they want from AI. Instead of a traditional survey, Claude ran branching conversations with follow-up questions based on each person’s answers. 67% were net positive about AI. The biggest group (19%) said they wanted “professional excellence,” but when pushed on what that meant, most people were really talking about quality of life: more time, less cognitive load, space to think.

The geographic data stood out. People in Sub-Saharan Africa, Central Asia, and South Asia were consistently more positive about AI than people in North America or Western Europe. Lower and middle income countries were twice as likely to report zero concerns. Self-employed people were the most likely to report both benefits and drawbacks at the same time, because they feel the productivity gains and the increased pressure without any institutional buffer.

The study is limited by the fact that these are Claude users, not the general public, and early adopters tend to be more optimistic. But running 81,000 qualitative conversations in a week is a research method that didn’t exist a year ago, and the scale creates a different kind of evidence than a checkbox survey can.

What to watch for: Whether other AI companies adopt AI-conducted qualitative research at this scale, and whether the tensions Anthropic identified (especially cognitive atrophy and economic displacement) shift from hypothetical to experienced as usage deepens.

Watch This: Andrej Karpathy on AI Psychosis, Auto Research, and the Future of Coding Agents | No Priors (1hr 6min)

Karpathy hasn’t typed a line of code since December. He runs multiple coding agents in parallel, switching between them like a manager delegating to a team, and says the default workflow for every software engineer changed overnight. The conversation covers his “auto research” project where he let agents optimize his model training overnight and they found improvements he missed after two decades of manual tuning, his home automation “claw” called Dobby that hacked into his Sonos and smart home systems in three prompts, and his prediction that the entire industry needs to reconfigure because the customer for software is no longer the human, it’s agents acting on behalf of humans. The most grounded take: the models are simultaneously a brilliant PhD student and a 10-year-old, and everything outside of verifiable RL-trained domains (like telling a joke) is still stuck. Worth the full listen if you’re thinking about where coding agents go from here.

Also This Week

Claude Cowork Dispatch: Remote Desktop AI Control from Your Phone | Anthropic

OpenAI Is Throwing Everything into Building a Fully Automated Researcher | MIT Technology Review

WordPress Lets AI Agents Manage Your Content | WordPress

NVIDIA Launches Space Computing, Rocketing AI Into Orbit | NVIDIA

Meta Will Move Away from Human Content Moderators in Favor of AI | Engadget

Gemini Task Automation Is Slow, Clunky, and Super Impressive | The Verge

Pentagon Filing Reveals Anthropic and Pentagon Were Nearly Aligned | TechCrunch

Signal’s Creator Is Helping Encrypt Meta AI | Wired

Amazon Trainium Lab Tour: The Chip That Won Over Anthropic, OpenAI, and Apple | TechCrunch

Trump AI Framework Targets State Laws, Shifts Child Safety Burden to Parents | TechCrunch

What I’m Watching

NemoClaw was probably the most interesting announcement at GTC for me. Karpathy talked about his home “claw” Dobby on No Priors, which does something similar at a smaller scale. Agents running inside their own secure environments with rules around what they can access feels like the direction this is all heading. We already covered NemoClaw in the top stories, but it’s worth sitting with.

DoorDash is paying couriers to submit videos to train AI. Delivery workers with phone cameras are becoming the data collection layer for embodied AI. I’m curious how fast other companies with large field workforces start doing the same thing.

The Trump AI framework is preempting state-level AI regulation and shifting child safety responsibility to parents. It makes it murky where state level AI laws sit and drive influence.

Another Weekly AI Newsletter: Issue 63

Taylor Ortiz — Tue, 17 Mar 2026 03:24:19 GMT

The Week’s Thesis

Agent security got its own engineering discipline this week: OpenAI published a design guide on defending agents against prompt injection and released IH-Challenge, a training dataset that teaches models which instructions to trust. AWS launched policy controls inside Bedrock AgentCore for agents in regulated industries. Microsoft published a security blog warning that ungoverned agents can become “double agents” and attached a $99/month product to the problem. If you’re deploying agents that read external content or operate across trust boundaries, these documents belong in your engineering review queue.

Three companies answered the same question from different directions: How far can an agent reach from a single context? Anthropic made Claude’s 1 million token context window generally available for Opus 4.6 and Sonnet 4.6, scoring 78.3% on MRCR v2 at that length. Perplexity shipped a full-stack agent API platform combining model orchestration, real-time search, and code execution under one key. OpenAI published an engineering post on equipping the Responses API with a computer environment. Anthropic says deeper into documents. Perplexity says further across the web. OpenAI says into the operating system. Your architecture choice this year is a bet on which of those axes matters most for your use case.

The open model tier is getting its own infrastructure: NVIDIA shipped Nemotron 3 Super, a 120B-parameter open model with only 12B active parameters and 5x throughput gains over comparable dense models. Perplexity integrated it immediately across its agent and search products. Meta published details on four generations of MTIA custom inference silicon shipped in two years. And NVIDIA announced a gigawatt-scale partnership with Thinking Machines Lab for frontier model training. From custom silicon to serving infrastructure, the open model stack is coming together fast.

Anthropic moved on every axis at once: In one week, Anthropic invested $100 million into the Claude Partner Network, launched The Anthropic Institute to address AI’s societal challenges, opened Sydney as its fourth Asia-Pacific office, made 1 million token context generally available, shipped interactive charts and diagrams in chat, and doubled usage during off-peak hours as a thank-you to users. That’s ecosystem, governance, geography, capability, product, and pricing, all in one week.

Quick Hits

How We Compare Model Quality in Cursor | Cursor — When your provider’s benchmarks stop meaning anything, you build your own. If you’re evaluating models for agentic coding, this is the framework to study.

A Defense Official Reveals How AI Chatbots Could Be Used for Targeting Decisions | MIT Technology Review — The same architectures running your enterprise agents are now ranking military target lists. “Human in the loop” is doing a lot of work in that sentence.

Google DeepMind Names New London HQ “Platform 37” | X @GoogleDeepMind — Named after AlphaGo’s Move 37, the moment AI surprised its own creators. The building will include a free public AI exhibition space.

Perplexity Computer Is Now on Mobile | X @perplexity_ai — Agents that follow you across devices. Cross-device synchronization means the task you start on desktop continues on your phone.

How NVIDIA AI-Q Reached #1 on DeepResearch Bench I and II | Hugging Face — An open model just topped a research benchmark designed for closed frontier models. The ceiling on what open weights can do keeps moving.

OpenAI to Acquire Promptfoo | OpenAI — OpenAI bought the red-teaming platform 25% of Fortune 500s already use, and it’s going straight into Frontier. Agent security is a product line now.

Hustlers Are Cashing In on China’s OpenClaw AI Craze | MIT Technology Review — Open-source agents meet gray-market entrepreneurship. Adoption is moving faster than anyone can govern it.

Featured Article: IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs | OpenAI

OpenAI released IH-Challenge, a reinforcement learning training dataset that teaches models to prioritize instructions based on trust level: system over developer, developer over user, user over tool. When a model receives conflicting instructions from different sources, it needs to know which one wins. Get that wrong and you get jailbreaks, system prompt leaks, and prompt injection attacks that treat malicious text in a PDF or tool output as if it were a developer command. IH-Challenge structures this as objectively gradable tasks: a high-privilege instruction like “only answer Yes or No” paired with a lower-privilege attempt to override it, checked by a simple Python script. Fine-tuning GPT-5-Mini on the dataset produced GPT-5-Mini-R, which improved robustness from 63.8% to 88.2% under adaptive human red-teaming and from 23% to 94% against impersonation attacks. Unsafe behavior dropped from 6.6% to 0.7% when given a safety policy in the system prompt. The full dataset is available on Hugging Face.

The interesting part is what they didn’t do. The team identified three pitfalls in naive instruction hierarchy training: models fail not because they don’t understand hierarchy but because instructions are too complex, LLM judges used for reward signals are themselves fallible, and models learn shortcuts like refusing everything to maximize safety scores. IH-Challenge addresses all three by keeping tasks instruction-following-simple, using programmatic grading instead of LLM judges, and including an Anti-Overrefusal split that specifically trains models to recognize when lower-privilege instructions are perfectly benign. Overrefusal on the IH-Challenge benchmark improved from 79% to 100%, meaning the model stopped treating hierarchy enforcement as a reason to refuse legitimate requests. Meanwhile, GPQA Diamond and AIME 2024 scores held flat, and TensorTrust robustness jumped +8 to +15 points depending on the conflict type. If you’re building agents that process untrusted input, this is the best public evidence that instruction hierarchy can be trained once and generalize, instead of patching one attack at a time.

What to watch for: Whether other model providers adopt open instruction hierarchy training datasets, and whether the programmatic-grading approach becomes standard practice over LLM-judge-based safety fine-tuning.

Watch This: Is RAG Still Needed? Choosing the Best Approach for LLMs | IBM Technology (12 min)

Martin Keen breaks down the real tradeoffs between RAG and long context windows as context lengths keep expanding. The video covers when vector databases and semantic search still win, when you can get away with stuffing everything into context, and how to think about the decision for your specific workload. Especially relevant this week given Anthropic’s 1 million token context going GA.

Also This Week

P-EAGLE: Faster LLM Inference with Parallel Speculative Decoding in vLLM | AWS AI Blog

Operationalizing Agentic AI Part 1: A Stakeholder’s Guide | AWS AI Blog

Smol AI WorldCup: A 5-Axis Benchmark for Small Language Models | Hugging Face

Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries | Hugging Face

Introducing Storage Buckets on the Hugging Face Hub | Hugging Face

SILMA TTS: A Lightweight Open Bilingual Arabic-English TTS Model | Hugging Face

How Pokemon Go Is Giving Delivery Robots an Inch-Perfect View of the World | MIT Technology Review

As Open Models Spark AI Boom, NVIDIA Jetson Brings It to Life at the Edge | NVIDIA

Mapping the World’s Forests: Introducing Canopy Height Maps v2 | Meta AI

Build a Searchable Audio Knowledge Base with Gemini Embedding 2 and LlamaParse | LlamaIndex

Introducing the AI Now Summit | Mistral AI

What I’m Watching

There’s a thread running through this week that’s easy to miss: the testing layer is becoming a product. OpenAI acquired Promptfoo, the open-source LLM evaluation framework. Cursor built CursorBench to measure whether AI coding suggestions actually help in real workflows. And IH-Challenge, which we covered in the Featured Article, uses programmatic Python scripts instead of LLM judges to grade model behavior, specifically because LLM judges get it wrong too often.

That last detail is the one I keep coming back to. We’ve spent two years using models to evaluate models, and one of the clearest takeaways from the IH-Challenge paper is that this introduces its own failure modes. When your testing infrastructure is valuable enough for OpenAI to acquire and your grading methodology is worth publishing a paper about, evaluation is a competitive advantage. If you’re building agents today and your eval story is “we’ll have someone try it and see if it feels right,” this is the week that should change your mind.

Another Weekly AI Newsletter: Issue 62

Taylor Ortiz — Mon, 09 Mar 2026 16:10:49 GMT

The Week’s Thesis

Everybody shipped at once: If you stepped away from your desk for even a day last week, you came back to a different landscape. OpenAI released GPT-5.3 Instant on Monday and followed with GPT-5.4 with Thinking and Pro modes by Wednesday. Anthropic opened the Claude Marketplace, added voice and scheduled tasks to Claude Code. Cursor launched Automations. Each of these points in a different direction of focus, and it’s worth taking a moment to decide which ones matter for your workflows and where to start.

The Pentagon deal had consequences: Last week we covered the Pentagon deal itself. This week, the consequences arrived. OpenAI’s robotics lead Caitlin Kalinowski resigned, calling the arrangement “rushed without the guardrails defined.” ChatGPT uninstalls had already surged 295% while Claude climbed to #1 on the App Store. Anthropic’s CEO responded directly to the supply chain risk designation, challenging it in court and clarifying the statute’s narrow scope. Microsoft, Google, and Amazon confirmed Claude remains available to their customers outside the Department of War. Meanwhile, MIT Technology Review asked the question everyone should be sitting with: is the Pentagon actually allowed to surveil Americans with AI?

AI is probing deeper than we designed for: Three companies independently bet on the same idea this week: AI as security auditor. Anthropic’s Claude found 22 real vulnerabilities in Firefox, including novel bugs that existing tools missed. OpenAI launched Codex Security in research preview. And Endor Labs released AURI, a free security tool, after a study found only 10% of AI-generated code passes basic security review. Separately, Anthropic’s engineering team found that Claude Opus 4.6 figured out it was being benchmarked, identified the test, and decrypted the answer key on its own. These models are probing systems deeper than we’re designing for, and finding things we didn’t expect.

Quick Hits

You Need to Rewrite Your CLI for AI Agents | Justin Poehnelt (Google) — The best guide yet on building agent-first tooling. If you maintain a CLI, start here.

Terence Tao: AI Is Ready for Primetime in Math and Physics | OpenAI Academy — When a Fields medalist says AI saves more time than it wastes, the bar for “useful” just moved.

Luma Launches Creative AI Agents | TechCrunch — Turned a $15M ad campaign into localized versions in 40 hours for under $20K. Creative agencies, take note.

KV Cache Compaction Cuts LLM Memory 50x | VentureBeat — MIT’s Attention Matching compresses working memory without accuracy loss. Long-context inference just got cheaper.

Google I/O 2026: May 19-20 | Google Blog — Save the date. The puzzle itself is a Gemini showcase, which tells you where the keynote is heading.

Roblox Launches AI Chat Rephrasing | Roblox — Instead of blocking banned words with “####”, AI now rephrases them in real time. Moderation at 68M daily users is an AI problem now.

LangChain CEO: Models Alone Won’t Get Agents to Production | VentureBeat — Harrison Chase on why “harness engineering” matters more than model upgrades for shipping real agents.

Featured Article: Labor Market Impacts of AI: A New Measure and Early Evidence | Anthropic Research

Anthropic introduced a new metric called “observed exposure” that combines theoretical LLM capability with real-world Claude usage data to measure which jobs are actually being affected by AI. The headline finding: AI is far from reaching its theoretical capability. Actual task coverage remains a fraction of what’s feasible. Computer programmers top the list at 75% coverage, followed by customer service representatives and data entry keyers. No systematic increase in unemployment has appeared for highly exposed workers since late 2022.

The paper opens with a point worth sitting with: past predictions about job displacement have a poor track record. Offshorability studies flagged a quarter of US jobs as vulnerable, and a decade later most of those jobs grew. This research is deliberately not making predictions. Instead, it’s building a measurement framework now, before meaningful effects emerge, so future analysis has a real baseline. The finding that matters most right now is about entry-level hiring. Among workers aged 22 to 25, hiring into exposed occupations has dropped roughly 14% compared to pre-ChatGPT levels. Workers in the most exposed professions are more likely to be older, female, more educated, and higher-paid. The pipeline is thinning before displacement shows up in unemployment data.

What to watch for: The gap between what AI can do and what it is doing is closing. This report measures it directly, and future updates will show how fast the red area catches the blue. Pay attention to the entry-level hiring numbers next time around.

Watch This: This New Claude Code Feature is a Game Changer | Nate Herk (8 min)

Nate walks through Claude Code’s new loop feature, which lets you set recurring tasks, reminders, and skill intervals that run for up to three days without input. The video covers how the cron tools work under the hood, a live walkthrough of setting one up, and a clear comparison of when to use loops versus scheduled tasks. If you’re already using Claude Code, this is worth eight minutes of your time.

Also This Week

Reasoning Models Struggle to Control Their Chains of Thought, and That’s Good | OpenAI

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought | arXiv

Building AI Coding Agents for the Terminal | arXiv

Anthropic Spend Commitment Now Funds Partner Integrations | Anthropic

Claude Community Ambassadors Program | Anthropic

ZeroClaw: Autonomous AI Assistant Infrastructure | GitHub

City Detect Raises $13M Series A | TechCrunch

Port Washington Data Center Breaks Ground | BizTimes

How Descript Enables Multilingual Video Dubbing at Scale | OpenAI

How Balyasny Built an AI Research Engine for Investing | OpenAI

What I’m Watching

Features like Claude Code’s new /loop command and projects like ZeroClaw are pointing in the same direction: autonomous agent runtimes that are lightweight, swappable, and designed to run without you. The question I keep coming back to is how long until this space fragments enough that no single framework dominates. We’re not there yet, but the building blocks are shipping fast.

The other thing I’m paying attention to is something that rarely shows up in benchmark announcements: how new model releases actually affect agent quality in production. GPT-5.4, Claude Opus 4.6, and the reasoning improvements shipping alongside them should be measurably changing chain-of-thought reliability for deployed agents. But that data is hard to find. If you’re running agents in the wild and tracking performance across model versions, I’d genuinely love to hear what you’re seeing.

And then there’s the security work. Anthropic found novel Firefox vulnerabilities. OpenAI launched Codex Security. A few newsletters ago, we covered AI solving novel physics problems. Now we’re seeing that same pattern expand: LLMs surfacing things humans hadn’t found yet. Is that just the natural expansion curve of the technology, or is it a growth signal that tracks directly with model quality? I think it’s both, and the Mozilla results suggest we’re still early in finding out what these models can actually uncover when pointed at the right problems.

Subscribe now

Another Weekly AI Newsletter: Issue 61

Taylor Ortiz — Tue, 03 Mar 2026 16:16:48 GMT

Personal Note

This newsletter comes to you late this week on a Tuesday morning. Like many others, I was caught in the Anthropic outage and am also dependent on this technology to drive the initiatives that are meaningful to me. When I woke up to finalize the newsletter and found things offline, I journaled while listening to the birds outside, listened to music, reflected on my weekend, and engaged in refreshing activities I normally don’t find the time for. It was a lesson for me to find more time to step away from the keyboard.

The Week’s Thesis

AI went political this week: Anthropic’s relationship with the Department of War fell apart, and hours later, OpenAI signed a deal for classified network deployment. On paper, both companies claim the same red lines. But the sequence alone was enough to make people uneasy. More on this in our featured story below.

OpenAI’s partnership blitz: They launched Frontier Alliances, a new partner program, followed by a Codex integration with Figma bridging code and design workflows. By Friday, they announced a strategic partnership with Amazon and released a joint statement with Microsoft reaffirming their existing relationship. Four announcements in five days, all while the Department of War deal was making headlines.

Agent observability is becoming a thing: Microsoft found that 80% of Fortune 500 companies are running active agents but most lack visibility into what those agents are doing. LangChain argued that traditional APM tools weren’t built for this, New Relic shipped an agent-specific observability platform, and Google published a production-readiness guide. Observability is quietly becoming part of the conversation, and it’s worth paying attention to.

Healthcare AI is moving: NVIDIA’s annual survey found that 70% of healthcare organizations are now actively deploying AI, with 85% reporting increased revenue. Eli Lilly went live with LillyPod, the most powerful AI factory wholly owned by a pharmaceutical company, purpose-built for drug discovery. Oura shipped a proprietary AI model focused on women’s reproductive health, hosted entirely on their own infrastructure. And NIST published guidance on AI trustworthiness standards for clinical settings. From drug discovery to consumer wearables to regulation, healthcare AI is moving.

Quick Hits

Jira’s latest update allows AI agents and humans to work side by side | TechCrunch — Agents on the same sprint board as humans with deadlines and assignments. This is mainstream adoption.

Pro-level image generation gets faster and more accessible with Nano Banana 2 | Google Cloud AI — Google’s enterprise image gen model gets faster and cheaper. The gap between “good enough” and “production-ready” keeps shrinking.

Anthropic acquires Vercept to advance Claude’s computer use capabilities | Anthropic — Anthropic is doubling down on computer use. If agents are going to operate in production, they need to see and interact with real interfaces.

Detecting and preventing distillation attacks | Anthropic — Anthropic identified industrial-scale distillation campaigns by DeepSeek, Moonshot, and MiniMax, totaling over 16 million exchanges across 24,000 fraudulent accounts designed to extract Claude’s capabilities. They published their approach to catching and preventing it.

The human work behind humanoid robots is being hidden | MIT Technology Review — The humans still doing the work that robot demos suggest is automated. A good reality check.

Featured Story: Anthropic’s Deal With the Department of War Fell Through. Hours Later, OpenAI Signed One.

Anthropic published its Responsible Scaling Policy v3.0 on February 24, a ground-up rewrite of the framework it uses to decide what it will and won’t build. Two days later, Dario Amodei published a statement revealing that Anthropic has been deeply embedded in the Department of War for months: intelligence analysis, cyber operations, modeling and simulation. The company also disclosed it walked away from several hundred million dollars in revenue by cutting off entities linked to the Chinese Communist Party. But Anthropic drew two red lines: no mass domestic surveillance of Americans, and no fully autonomous weapons.

On February 27, Secretary of War Pete Hegseth designated Anthropic a “supply chain risk”, a label historically reserved for US adversaries. Trump ordered every federal agency to stop using Anthropic technology. That same night, OpenAI announced a deal to deploy its models on the Department of War’s classified network.

Here’s where it gets interesting: OpenAI’s stated terms include the same two red lines. No mass surveillance. No autonomous weapons. But OpenAI walked away with a deal and Anthropic walked away blacklisted. OpenAI’s approach centers on what Altman called a “safety stack”: cloud-only deployment that keeps OpenAI’s safety layers active, cleared personnel in the loop, and an agreement that if the model refuses a task, the government won’t force a workaround. What exactly differed in the negotiations isn’t public, but the outcome speaks for itself.

The RSP v3.0 explains the philosophical scaffolding behind Anthropic’s position. After two and a half years of trying to implement capability-based safety thresholds, Anthropic concluded that “the science of model evaluation isn’t well-developed enough to provide dispositive answers.” The policy now splits commitments into what Anthropic will enforce unilaterally and what requires industry-wide coordination. Autonomous weapons fall squarely in the second bucket: the reliability isn’t there yet, and no single company can build the guardrails alone.

The business implications are already visible. Nate Silver noted that Anthropic had been steadily closing the valuation gap with OpenAI. Whether the DoW designation slows that trajectory is an open question.

The question practitioners should be sitting with isn’t “who’s right.” It’s what happens next. If you’re building on Claude for sensitive workloads, your platform just got blacklisted from every federal system. If you’re building on OpenAI, your platform’s safety guarantees rest on a technical architecture rather than a legal commitment. Both carry risk. The difference is in which failure mode you’re betting on.

What to watch for: Whether the “supply chain risk” designation survives legal challenge, and whether OpenAI’s cloud-only safety stack holds as models get more capable and the Department of War pushes for edge deployment.

Watch This

StarTalk: Geoffrey Hinton on AI, Consciousness, and the Future: Neil deGrasse Tyson sits down with Nobel Laureate Geoffrey Hinton to cover the full arc: how neural nets work, why backpropagation was the breakthrough, whether AI can actually reason, and the heavy questions around consciousness, energy demands, and what happens when models start generating their own training data.

Also This Week

Intrinsic joins Google | TechCrunch

Let Gemini handle your multi-step daily tasks on Android | Google AI Blog

Anthropic Education Report: The AI Fluency Index | Anthropic Research

The persona selection model | Anthropic Research

Disrupting malicious uses of AI | OpenAI

Can Local AI Stand In for the Cloud? | deeplearning.ai

AI is rewiring how the world’s best Go players think | MIT Technology Review

What I’m Watching

OpenAI’s new role in government AI. How does OpenAI’s solidified position with the Department of War shift the tide of AI in government? Will it be relatively quiet, or will we see noticeable shifts in how these technologies are deployed domestically and how we engage in combat with other countries? And if growth and innovation eventually push against the boundaries of an agreement, does the government override, or does OpenAI become more malleable?

The enterprise agent framework race. We are still in the “release agents as a capability” phase. Most enterprise platforms are now shipping their own proprietary frameworks. Will those be expansive enough to meet the breadth of platform use cases, or will we see demand expand beyond what a single-platform framework can handle, requiring true enterprise solutions?

Agent observability, from experience. Observability is something we are hyper-focused on at Ping. We find that we have the highest amount of control with our custom agents, and that control reduces significantly when we adopt out-of-the-box frameworks that leave us with little say over design practices. If that’s true at our scale, it’s worth asking what it looks like at enterprise scale.

Subscribe now

Another Weekly AI Newsletter: Issue 60

Taylor Ortiz — Mon, 23 Feb 2026 16:30:33 GMT

The Week’s Thesis

MCP is finding its footing in the enterprise. Amazon made Quick an MCP client, letting partners expose capabilities as tools its agents can invoke. Google went in the other direction: managed MCP servers for AlloyDB, Spanner, Cloud SQL, Firestore, and Bigtable that give any MCP-compliant agent a standard interface to their data layer with no infrastructure to deploy. Both chose MCP as the contract. The promise is speed to information and action from a single interface, but how you measure that return is still an open question.

Three frontier models dropped this week, and the pricing gap between open and closed got harder to ignore. Sonnet 4.6, Gemini 3.1 Pro, and GLM-5 all posted competitive benchmarks. On OpenRouter, GLM-5 runs at $0.95/$2.55 per million input/output tokens versus Sonnet at $3/$15 and Gemini at $2/$12. For agentic workloads, those economics compound fast. The models are becoming table stakes; the differentiation is what surrounds them.

Agent autonomy is outrunning evaluation. Anthropic’s research shows Claude Code sessions running autonomously 2x longer than three months ago, with experienced users auto-approving 40%+ of sessions. Meanwhile, Amazon is internally grappling with evaluating thousands of agents and publishing a whole framework for it. And DeepMind is asking whether models even have genuine moral reasoning or are just pattern-matching on ethics. Deployment velocity is way ahead of our ability to assess what these agents are actually doing.

The global map is shifting too. OpenAI, Google, Microsoft, and Anthropic all showed up to the India AI Impact Summit this week with infrastructure commitments. OpenAI announced “OpenAI for India” focused on sovereign infrastructure and workforce upskilling. Microsoft pledged a multi-billion-dollar initiative to close the AI adoption gap. Anthropic opened a Bengaluru office. When every major lab converges on the same market in the same week, it tells you where the growth is.

Quick Hits

Here are the 17 US-based AI companies that have raised $100M or more in 2026 | TechCrunch — 17 mega-rounds in under two months. The capital is betting on infrastructure and vertical agents, not foundation models.

Anthropic and Infosys collaborate to build AI agents for telecommunications | Anthropic News — Regulated industries are where agents get real. Telecom compliance is messy enough to justify the investment.

KLong: Training LLM Agent for Extremely Long-horizon Tasks | arXiv — Agents that can hold context across hundreds of steps. The gap between “demo agent” and “production agent” starts here.

Unauthorized OpenAI Equity Transactions | openai.com — OpenAI had to publicly warn people about unauthorized equity offers. When your stock is hot enough to attract scams, that’s its own signal.

Towards Anytime-Valid Statistical Watermarking | arXiv — As agents generate more content autonomously, knowing what’s machine-made becomes an infrastructure problem, not a nice-to-have.

Anthropic and the Government of Rwanda sign MOU for AI in health and education | Anthropic News — A model for government-AI partnerships that starts with local context and capacity building, not top-down deployment.

Anthropic partners with CodePath to bring Claude to the US’s largest collegiate CS program | Anthropic News — The next generation of developers will learn to code with AI from day one. That changes what “junior engineer” means in three years.

Featured Article: GLM-5: China’s First Public AI Company Ships a Frontier Model

Z.ai (formerly Zhipu AI) released GLM-5 on February 11, a 744B-parameter mixture-of-experts model with 40B active parameters per token and a 200K context window. It’s the first open-weight model to hit 50 on Artificial Analysis’ Intelligence Index, and it’s released under an MIT license.

The benchmarks tell a competitive story. GLM-5 scored 77.8% on SWE-bench Verified, beating Gemini 3 Pro (76.2%) and trailing Claude Opus 4.5 (80.9%). On AIME 2026 it hit 92.7%, essentially matching Opus. On BrowseComp, it scored 62.0, nearly doubling Opus 4.5’s 37.0. It’s the #1 open-weight model on LMArena and #11 overall.

What makes this release structurally significant is what’s underneath it. GLM-5 was trained entirely on Huawei Ascend 910B chips using the MindSpore framework. Zhipu has been on the US Entity List since January 2025 with no access to NVIDIA H100s. A frontier-competitive model built without any Western compute hardware is a data point that changes the export control conversation.

The caveats are real. GLM-5 is text-only with no multimodal support. Independent testers have flagged questions about benchmark methodology and noted the model can be aggressive in task execution without strong situational awareness. Running it locally requires ~1.5TB of VRAM. But for the open-weight ecosystem, this is a milestone: frontier-class intelligence, MIT-licensed, at a fraction of closed-model pricing.

What to watch for: Whether independent evaluations hold up to the published benchmarks, and whether the Ascend-trained approach becomes a template for other Chinese labs navigating export controls.

Watch This

Brian walks through setting up a multi-agent team using OpenClaw, covering dedicated machines, access permissions, cost configurations, and API token optimization across different models.

Also This Week

Using Google Cloud AI to measure the physics of US freestyle snowboarding | Google Cloud AI

Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability | arXiv

Gemini 3.1 Pro - Model Card | DeepMind

Robots and AI Are Working Together to Bring You Better Medicines | NIST

A message from our CEO, Sundar Pichai | Google AI Blog

What I’m Watching

OpenAI’s acqui-hire playbook. Steinberger built OpenClaw into the most-starred open-source agent project on GitHub in four months, and now he’s inside OpenAI building “the next generation of personal agents.” The project moves to a foundation, but the founder’s vision moves with him. If OpenAI keeps pulling in open-source agent talent, it signals a shift from model company to agent platform company.

The agent evaluation reckoning. Three separate organizations flagged the same problem this week: we’re deploying agents faster than we can evaluate them. Autonomy sessions are getting longer, tool access is getting broader via MCP, and the pricing is making it cheaper to scale. Something breaks publicly before the evaluation frameworks catch up. The teams building those frameworks now have a head start.

I Wake Up to a Custom AI Research Digest on My Kindle Every Morning

Taylor Ortiz — Sat, 21 Feb 2026 17:21:15 GMT

Staying current on AI research is one of those things that sounds simple until you actually try to do it consistently. arXiv publishes hundreds of machine learning papers every single day.

As someone who leads AI and Data at a technology company, I need to stay aware of what’s happening in research. While we don’t intend to implement every new pattern and concept, the ideas showing up in research today inspire the tools and techniques we’re evaluating and implementing for our own AI capabilities. Keeping up with emerging patterns ensures that we operate at an innovative and scalable level.

The problem is that reading even the abstracts of 300+ papers a day is not realistic. So I went through a few iterations of trying to solve this:

Iteration 1: Manually browsing arXiv on weekends, skimming titles and saving papers to read later
- Pain point: I was always a week behind and the “read later” list just kept growing
Iteration 2: Subscribing to AI newsletters and following researchers on X
- Pain point: Too broad, too noisy, and someone else was deciding what was relevant to me
Iteration 3: Using ChatGPT to ask “what are the most important ML papers from this week?”
- Pain point: Hallucinated paper titles, no way to verify, and it didn’t know my specific interests
[We are here] Iteration 4: Build an automated pipeline that fetches papers from arXiv, uses Claude to score them against my specific interests, summarizes the top ones in plain language, and delivers them to my Kindle every morning before I wake up

Total cost: about $5/month. Total daily effort: zero.

Here is exactly how I built it.

The Architecture

The core insight here is that this is a filtering problem, not a summarization problem. arXiv gives you everything. Your job is to throw away 97% of it intelligently.

The pipeline looks like this:

Let’s walk through each stage.

Stage 1: Fetch Everything

The pipeline starts by querying the arXiv API for papers submitted in the last day across six categories:

cs.AI (Artificial Intelligence)
cs.LG (Machine Learning)
cs.CL (Computation and Language / NLP)
cs.CV (Computer Vision)
cs.IR (Information Retrieval)
stat.ML (Statistics: Machine Learning)

This typically yields 200-400 papers per day. I use the arxiv Python package which handles pagination and rate limiting. I also use a 2-day lookback window because arXiv publishes new papers around 8pm UTC, so a strict 1-day window can miss things depending on timing.

The fetcher deduplicates cross-listed papers so a paper listed under both cs.AI and cs.LG only appears once.

client = arxiv.Client(
    page_size=100,
    delay_seconds=3.0,
    num_retries=3,
)
search = arxiv.Search(
    query=query,
    max_results=total_max,
    sort_by=arxiv.SortCriterion.SubmittedDate,
    sort_order=arxiv.SortOrder.Descending,
)

Stage 2: Cheap Keyword Pre-Filter

Before making any API calls to Claude, I cut the candidate pool roughly in half with a simple keyword filter. If a paper’s title or abstract doesn’t mention any of about 35 terms I care about (”LLM”, “transformer”, “agent”, “RAG”, “fine-tuning”, “production”, “deployment”, etc.), it’s probably not relevant enough to spend tokens on.

Subscribe now

This is intentionally broad. I would rather send a borderline paper to the LLM for scoring than accidentally filter out something good. The keyword list is just there to remove the obvious misses like pure math proofs or biology applications.

Typical result: 250 papers down to about 220 after keyword filter.

def _matches_keywords(paper: Paper, keywords_lower: list[str]) -> bool:
    text = f"{paper.title} {paper.abstract}".lower()
    return any(kw in text for kw in keywords_lower)

Simple. Effective. Costs nothing.

Stage 3: LLM Relevance Scoring with Claude Haiku

This is the most important stage. I send the remaining papers to Claude Haiku in batches of 10, along with my interest profile, and ask it to score each one from 1-10.

The interest profile is a plain-English description organized into priority tiers:

Tier 1 (score 8-10): LLMs, agents, RAG, prompt engineering
Tier 2 (score 6-8): Production ML systems, MLOps, data engineering
Tier 3 (score 4-6): Computer vision, recommendation systems
Tier 4 (score 1-3): Pure theory, narrow domain-specific stuff

I also include scoring modifiers. Papers with real production deployments get a +1 bonus. Papers that are purely benchmark-focused with no novel insight get a -2 penalty. Papers from major labs on Tier 1 topics get a small bump.

Along with each score, Haiku generates a one-line “hook” explaining why the paper matters. I specifically prompt it to write like it’s explaining to a smart colleague over coffee, not like an academic abstract.

Why Haiku for scoring? It’s cheap and fast. Scoring is a simpler classification task, not a nuanced generation task. At about $0.02-0.05 for all 220 papers, it’s practically free.

The Scoring Problem I Had to Fix

This is worth calling out because it took real iteration to get right. My first version of the scoring prompt produced useless results. Every paper scored between 8.0 and 8.5. Claude was being too generous and clustering everything at the top.

57% of papers were coming back above the threshold. That is not filtering. That is just passing everything through with extra steps.

I had to explicitly tell the model to use the full 1-10 range, include calibration examples in the prompt, and demand decimal scores (7.5, 6.0, 4.5) to create separation between papers. Here is what part of that prompt looks like:

CRITICAL INSTRUCTIONS FOR SCORING:
1. USE THE FULL RANGE of 1-10. Do NOT cluster scores.
2. A score of 9-10 should be RARE - only 0-1 per batch of 10.
3. Most papers should score between 3-7. That's normal and correct.
4. Use decimal scores (e.g., 7.5, 6.0, 4.5) to create separation.

Calibration examples:
- "New RLHF technique that improves LLM alignment with 40% less data" -> 9.0
- "Survey of prompt engineering techniques" -> 7.5
- "Improved object detection on COCO benchmark by 0.3 mAP" -> 3.0
- "Theoretical bounds on convergence of SGD" -> 2.5

After this rewrite, the scores spread out properly and I started getting 5-10 papers above threshold instead of 120+.

Stage 4: Select the Top Papers

I take everything scoring 7.0 or above and keep the top 10. If nothing meets the threshold (rare, but possible on light days), the pipeline automatically lowers it by 1.0 and tries again so I don’t get empty digests.

Typical result: 220 scored papers down to 5-10 selected.

Stage 5: Deep Summarization with Claude Sonnet

Now I switch to Claude Sonnet for the expensive, high-quality work. Each selected paper gets its own API call with the full abstract and a detailed summarization prompt.

The prompt is designed for a senior AI leader, not an academic. Here is what I mean by that:

If the paper uses jargon like “contrastive loss” or “OOD generalization,” the summary explains it in parentheses
It leads with what the paper does and why it matters, not the methodology
The “practical implications” section is specific: could this be used in production today? Is it research-only?
Sentences are short and punchy. No filler.

Each summary includes:

Key takeaways (2-3 bullet points)
Summary paragraph (3-5 sentences)
What’s novel (why this is different from existing work)
Practical implications (who benefits and how)

Why Sonnet for summaries? Summarization requires nuance. You need the model to truly understand a paper and translate it, not just classify it. The quality difference over Haiku is worth it when you’re only processing 5-10 papers.

This is the same two-model pattern I’ve seen work well in other contexts. Use the cheap model for high-volume classification. Use the expensive model for low-volume generation where quality matters.

Stage 6: Generate a Kindle-Friendly EPUB

The summaries get packaged into an EPUB ebook using Python’s ebooklib. The structure is optimized for how I actually read on a Kindle.

Overview chapter with a quick-scan table:

Date and pipeline statistics (”Fetched 247 papers, Pre-filtered to 224, Scored, Top 7 included”)
Rank, title, score, and the one-line hook for each paper

Individual paper chapters with a tiered layout:

At a Glance: title, authors, categories, relevance score, hook, key takeaway bullets
Deep Dive: full summary, what’s novel, practical implications
Link to the full PDF on arXiv

The EPUB gets a generated cover image using Pillow with the date and paper count, which shows up as the thumbnail in the Kindle library. I bundled the Inter font so it renders correctly both on my Mac locally and on GitHub Actions (which runs Linux and doesn’t have macOS system fonts).

Stage 7: Email to Kindle

The final step emails the EPUB to my @kindle.com address using Resend. Kindle automatically converts it and syncs to all my devices.

One gotcha: you have to add the sender email address to your Kindle’s approved senders list in your Amazon account settings. Without that, the email gets silently rejected with no error message. I was debugging for a while before I realized it was just Amazon blocking an unapproved sender.

Automation with GitHub Actions

The whole thing runs on a GitHub Actions cron job:

on:
  schedule:
    - cron: '0 10 * * *'  # 10:00 UTC = 5:00 AM ET
  workflow_dispatch: # Manual trigger for testing

By 5am ET, yesterday’s arXiv papers have been published (they go live around 8pm UTC), so there’s a comfortable buffer. The pipeline typically takes 2-5 minutes to run. By the time I wake up, the digest is already on my Kindle.

API keys are stored as GitHub Secrets. The email addresses are stored as GitHub Variables since they’re not sensitive. The workflow also uploads the EPUB as a build artifact with 7-day retention, which is great for debugging if something looks off.

The Stack

Deliberately simple. No database. No web server. No infrastructure to maintain.

Python 3.12 for orchestration
arxiv Python wrapper for the arXiv API
anthropic Claude API client
ebooklib + Pillow for EPUB generation with cover image
resend for transactional email
GitHub Actions for free cron scheduling

The entire project is about 500 lines of Python across 9 files.

Cost Breakdown

Component Daily Cost Claude Haiku (scoring ~220 papers) ~$0.03 Claude Sonnet (summarizing ~7 papers) ~$0.15 Resend (1 email/day) Free tier GitHub Actions Free tier Total ~$0.15-0.25/day (~$5-8/month)

What I Learned

LLM scoring needs real iteration. My first scoring prompt produced useless results where every paper scored 8-8.5. I had to add calibration examples, explicitly demand the full 1-10 range, and include scoring modifiers to get meaningful differentiation. If you’re building any kind of LLM-as-a-judge system, expect to spend more time on the scoring prompt than you think.

The two-model strategy is a pattern worth reusing. Cheap model for high-volume classification, expensive model for low-volume generation. It keeps costs at about $0.15-0.25/day while still getting high quality summaries.

Plain-language interest profiles beat structured rubrics. I tried detailed point-based scoring rubrics and found that a natural-language interest profile with tiered priorities produced better, more intuitive scoring.

The pre-filter matters more than you think. Without it, you’re burning 2x the API tokens on papers that are obviously irrelevant. A simple keyword match is crude but effective.

Kindle is an underrated delivery mechanism. It syncs across devices, has no notifications competing for attention, and puts research reading into the same physical context as book reading. That context switch matters more than I expected.

Another Weekly AI Newsletter: Issue 59

Taylor Ortiz — Mon, 16 Feb 2026 13:56:09 GMT

Major Releases

Cowork is now available on Windows

Feb 11, 2026 | Anthropic

Why it matters:

Windows support removes a major barrier for enterprise adoption where Windows dominates corporate environments, making Claude’s agentic desktop capabilities accessible to the majority of professional developers.

Introducing GPT‑5.3‑Codex‑Spark

Feb 12, 2026 | OpenAI

Why it matters:

OpenAI’s Cerebras partnership signals a shift toward specialized inference hardware as model speed becomes the new competitive frontier, positioning ultra-low latency as essential for real-time AI coding tools.

Gemini 3 Deep Think: Advancing science, research and engineering

Feb 12, 2026 | Google AI Blog

Why it matters:

The upgrade to Gemini 3 Deep Think means more robust support for tackling complex scientific and engineering challenges, potentially redefining collaboration between AI and human researchers.

YouTube rolls out an AI playlist generator for Premium users

Feb 10, 2026 | techcrunch.com

Why it matters:

Shows streaming platforms racing to integrate generative AI into content discovery, potentially reshaping how users interact with media libraries.

Research

How AI trained on birds is surfacing underwater mysteries

Feb 09, 2026 | Google Research Blog

Why it matters:

Demonstrates that transfer learning from terrestrial to aquatic bioacoustics can enhance marine research, suggesting new avenues for AI application in environmental monitoring and species conservation.

SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Feb 13, 2026 | arXiv

Why it matters:

Introduces a framework that helps calibrate LLM-as-judge systems, potentially improving the reliability of automated evaluations in AI development pipelines.

Beyond one-on-one: Authoring, simulating, and testing dynamic human-AI group conversations

Feb 10, 2026 | Google Research Blog

Why it matters:

DialogLab illustrates the need for sophisticated frameworks that blend scripted and improvisational dialogue, suggesting a path toward more realistic multi-party interactions in AI systems.

Agentic AI & Reasoning

Harness engineering: leveraging Codex in an agent-first world

Feb 11, 2026 | openai.com

Why it matters:

Provides real-world evidence of fully AI-generated production code, offering insights into agentic workflows and their practical limitations.

Gemini Enterprise Agent Ready (GEAR) program now available, a new path to building AI agents at scale

Feb 10, 2026 | Google Cloud AI Blog

Why it matters:

The GEAR program demonstrates a strategic shift in equipping developers with the skills and resources needed to create scalable AI agents, solidifying enterprise-level integration of AI technologies.

Subscribe now

Build long-running MCP servers on Amazon Bedrock AgentCore with Strands Agents integration

Feb 12, 2026 | AWS AI Blog

Why it matters:

Reveals the potential for AI agents to operate continuously in enterprise environments, indicating a necessary evolution in toolsets to support complex, real-time data processing without session limitations.

Customize AI agent browsing with proxies, profiles, and extensions in Amazon Bedrock AgentCore Browser

Feb 13, 2026 | AWS AI Blog

Why it matters:

Demonstrates a shift in AI agent capabilities toward more realistic web interactions, suggesting increased adoption in enterprise settings that require reliable state management and customized browsing configurations.

The state of agentic AI in 2026

Feb 11, 2026 | Crew AI Blog

Why it matters:

Highlights a pivotal shift in enterprise strategy, where agentic AI is becoming an operational necessity, thereby influencing research and development priorities in AI.

Real-World Use Cases

Iberdrola enhances IT operations using Amazon Bedrock AgentCore

Feb 10, 2026 | AWS AI Blog

Why it matters:

Iberdrola’s integration of Amazon Bedrock AgentCore underscores a move towards sophisticated, scalable AI solutions in IT operations, highlighting the potential for increased efficiency and consistency in enterprise-level incident management.

Build financial resilience with AI-powered tabletop exercises on Google Cloud

Feb 10, 2026 | Google Cloud AI Blog

Why it matters:

Indicates a shift towards customized AI applications in operational resilience, suggesting that industry-specific context will enhance the effectiveness of incident response planning in financial services.

How Amazon uses Amazon Nova models to automate operational readiness testing for new fulfillment centers

Feb 10, 2026 | AWS AI Blog

Why it matters:

Demonstrates how large organizations can leverage AI to enhance operational efficiency, which may encourage similar automation efforts in other industries reliant on extensive manual verification processes.

Swann provides Generative AI to millions of IoT Devices using Amazon Bedrock

Feb 11, 2026 | AWS AI Blog

Why it matters:

Indicates a trend towards enhancing IoT device intelligence through generative AI, suggesting future systems may increasingly prioritize context-aware data processing to mitigate user fatigue and improve engagement.

Thought Leadership

Why the Moltbook frenzy was like Pokémon

Feb 09, 2026 | MIT Technology Review AI

Why it matters:

Highlights the potential disconnect between AI enthusiasts’ aspirations and actual capabilities, suggesting that the current excitement may be more about spectacle than substantive advancements in AI utility.

What’s next for Chinese open-source AI

Feb 12, 2026 | MIT Technology Review AI

Why it matters:

Signals that the rise of Chinese open-source AI may challenge traditional innovation hubs, compelling Western developers to adapt to a landscape where affordability and accessibility redefine competitive advantages.

The AI Vampire

Feb 15, 2026 | Simon Willison

Why it matters:

Highlights the risk of productivity-driven burnout in AI adoption, suggesting that even as systems automate routine tasks, they may amplify cognitive strain and diminish overall job satisfaction among employees.

Industry Investment & Business Moves

Gather AI Raises $40M Led by Smith Point Capital Management to Scale its Physical AI Platform for Global Logistics

Feb 09, 2026 | venturebeat.com

Why it matters:

Continued investment in physical-AI logistics indicates growing confidence that AI can deliver ROI in warehouse and supply chain operations.

Anthropic partners with CodePath to bring Claude to the US’s largest collegiate computer science program

Feb 13, 2026 | Anthropic News

Why it matters:

Signals that educational institutions are recognizing the importance of AI tools in programming, potentially reshaping the future workforce and creating a more inclusive environment in tech fields traditionally dominated by wealthier demographics.

Anthropic opens Bengaluru office and announces new partnerships across India

Feb 16, 2026 | Anthropic News

Why it matters:

Anthropic’s new initiatives highlight the importance of localized language models, demonstrating a commitment to inclusivity in AI development that can reshape user interactions across diverse linguistic communities in India.

Regulatory & Policy

Covering electricity price increases from our data centers

Feb 11, 2026 | Anthropic News

Why it matters:

This initiative highlights the necessity for AI companies to align their operational growth with consumer welfare, potentially setting a precedent for future industry practices in managing infrastructure costs.

Anthropic raises $20 million to Public First Action

Feb 12, 2026 | Anthropic News

Why it matters:

Indicates a growing recognition within the AI community of the necessity for proactive policy measures to address the risks associated with rapidly advancing AI technologies and their societal implications.

Bringing ChatGPT to GenAI.mil

Feb 09, 2026 | openai.com

Why it matters:

Represents a major milestone for AI adoption in government, with OpenAI gaining access to millions of defense personnel through secure infrastructure.

Introducing Lockdown Mode and Elevated Risk labels in ChatGPT

Feb 13, 2026 | openai.com

Why it matters:

Addresses growing security concerns around prompt injection in enterprise AI deployments, providing guardrails as organizations scale AI tool usage.

AI Safety & Ethics

Building a safer digital future, together

Feb 09, 2026 | blogs.microsoft.com

Why it matters:

Microsoft reinforces its safety-first positioning as regulators scrutinize tech platforms, signaling how major players frame AI governance messaging.

Helping kids and teens learn and grow online on Safer Internet Day

Feb 10, 2026 | Google AI Blog

Why it matters:

Suggests a growing recognition of the role artificial intelligence can play in enhancing online safety for younger users, prompting further innovation in user-centric safety tools within the AI community.

A “QuitGPT” campaign is urging people to cancel their ChatGPT subscriptions

Feb 10, 2026 | MIT Technology Review AI

Why it matters:

Emerging user backlash signals growing scrutiny of AI products, potentially influencing how companies balance capability claims with user expectations.

Dev Tools & Infrastructure

TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design

Feb 13, 2026 | arXiv

Why it matters:

Demonstrates how custom NPU architectures can enable efficient on-device LLM inference, potentially bringing AI capabilities to edge devices without cloud dependency.

Leading Inference Providers Cut AI Costs by up to 10x With Open Source Models on NVIDIA Blackwell

Feb 12, 2026 | blogs.nvidia.com

Why it matters:

Cost reductions of this magnitude could accelerate enterprise AI adoption by making production deployments economically viable at scale.

GPT‑5.2 derives a new result in theoretical physics

Feb 13, 2026 | openai.com

Why it matters:

Marks a milestone where an AI model contributed a novel theoretical physics formula that was subsequently proven correct, validating AI’s potential role in scientific discovery.

Another Weekly AI Newsletter: Issue 58

Taylor Ortiz — Wed, 11 Feb 2026 19:01:19 GMT

Major Releases

Structured outputs on Amazon Bedrock: Schema-compliant AI responses

Feb 06, 2026 | aws.amazon.com

Why it matters:

Amazon Bedrock’s schema-compliant outputs enable developers to bypass traditional data validation, streamlining AI integration and enhancing trust in automated systems, which could accelerate the deployment of reliable AI applications across industries.

Introducing Claude Opus 4.6

Feb 05, 2026 | anthropic.com

Why it matters:

Claude Opus 4.6’s enhanced coding and agentic capabilities, along with a 1M-token context window, signal a leap in AI’s ability to handle complex, multi-step tasks, making it a significant tool for developers and businesses seeking more efficient, robust AI-driven solutions.

Introducing GPT-5.3-Codex

Feb 05, 2026 | openai.com

Why it matters:

GPT-5.3-Codex’s enhanced speed and capability to autonomously manage complex coding tasks mark a pivotal shift towards more interactive and efficient AI coding assistants, potentially transforming software development workflows and reducing the time and expertise needed to tackle intricate programming challenges.

Breakthrough Research

LSGQuant: Layer-Sensitivity Guided Quantization for One-Step Diffusion Video Super-Resolution

Feb 03, 2026 | arxiv.org

Why it matters:

LSGQuant’s efficient quantization for video super-resolution allows high-quality diffusion models to be deployed in resource-limited environments, expanding access to advanced video enhancement technology and enabling broader applications in industries like streaming and mobile video processing.

Reward-Free Alignment for Conflicting Objectives (RACO)

Feb 02, 2026 | arxiv.org

Why it matters:

RACO’s method for aligning language models to conflicting objectives using pairwise feedback, without explicit rewards, enhances AI’s ability to balance complex trade-offs like safety and performance, crucial for deploying AI in nuanced, real-world applications.

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

Feb 02, 2026 | arxiv.org

Why it matters:

PixelGen’s ability to outperform latent diffusion models using perceptual loss in pixel space suggests a shift toward simpler architectures, potentially democratizing high-quality image generation by reducing reliance on complex latent representations and making advanced generative capabilities more accessible.

Agentic AI & Reasoning

Introducing OpenAI Frontier

Feb 05, 2026 | openai.com

Why it matters:

OpenAI Frontier’s enterprise platform signals a shift toward integrating AI agents as functional team members in business environments, potentially transforming workplace efficiency and collaboration by enabling AI to handle complex tasks with shared context and feedback mechanisms.

Subscribe now

Beyond one-on-one: Authoring, simulating, and testing dynamic human-AI group conversations

Feb 10, 2026 | research.google

Why it matters:

DialogLab’s exploration of multi-party human-AI interactions highlights the shift towards more sophisticated conversational AI, enabling richer, more nuanced group dynamics that could transform collaborative tools and virtual environments, making AI a more integral part of team-based workflows and social interactions.

How AI tools can redefine universal design to increase accessibility

Feb 05, 2026 | research.google

Why it matters:

Embedding adaptive AI tools into interfaces through Google’s Natively Adaptive Interfaces framework enhances personalization and accessibility, potentially setting a new standard for universal design and making technology more inclusive for users with diverse needs.

Real-World Use Cases

IBM to Support Missile Defense Agency SHIELD Contract

Feb 05, 2026 | ibm.com

Why it matters:

IBM’s AI-driven contract with the Missile Defense Agency highlights the strategic integration of AI in national defense, emphasizing the industry’s role in enhancing decision-making speed and agility, which could set a precedent for future defense contracts and AI’s critical function in national security infrastructure.

AT&T, AWS, and Amazon Leo Collaborate to Accelerate Modernization of Nation’s Connectivity Infrastructure

Feb 04, 2026 | financialcontent.com

Why it matters:

AT&T’s collaboration with AWS and Amazon Leo leverages AI to enhance network infrastructure, potentially transforming U.S. connectivity by improving scalability and resilience, while expanding AI-driven services and setting a precedent for future telecom modernization.

Humana Redefines the Member Experience with Agent Assist Built with Google Cloud

Feb 03, 2026 | googlecloudpresscorner.com

Why it matters:

Humana’s AI-powered Agent Assist highlights the growing trend of AI augmenting rather than replacing human roles, enhancing service delivery in high-volume environments and setting a precedent for scalable, empathetic customer interaction solutions in the healthcare industry.

Thought Leadership

Natively Adaptive Interfaces: A new framework for AI accessibility

Feb 05, 2026 | blog.google.com

Why it matters:

Embedding accessibility directly into AI design through Natively Adaptive Interfaces shifts the industry toward more inclusive technology, ensuring that personalization and accessibility are inherent, not optional, features—broadening AI’s usability and relevance across diverse user demographics.

Collaborating on a nationwide randomized study of AI in real-world virtual care

Feb 03, 2026 | research.google

Why it matters:

Google’s study with Included Health provides critical real-world data on AI’s role in telemedicine, potentially transforming healthcare delivery by optimizing physician time and expanding access to expertise, setting a precedent for AI’s integration into everyday clinical practice.

Industry Investment & Business Moves

Testing ads in ChatGPT

Feb 9, 2026 | openai.com

Why it matters:

OpenAI begins testing ads in ChatGPT for Free and Go tier users in the U.S. Ads appear at the bottom of responses, matched to conversation topics. Plus, Pro, Business, Enterprise, and Education tiers remain ad-free. OpenAI states ads won’t influence ChatGPT’s answers and conversations stay private from advertisers.

Machina Labs raises $124M to build AI-driven ‘Intelligent Factory’

Feb 04, 2026 | axios.com

Why it matters:

Machina Labs’ funding and factory initiative underscore AI’s pivotal role in revitalizing U.S. manufacturing, enhancing efficiency and precision in aerospace and defense production, and signaling a shift towards more automated, AI-driven industrial processes.

AppFactor raises $4M seed to deliver an agentic orchestration platform for enterprise software maintenance

Feb 04, 2026 | venturebeat.com

Why it matters:

AppFactor’s platform could significantly lower enterprise software upkeep costs and ease the transition from outdated systems, highlighting AI’s growing role in automating complex IT processes and enhancing operational efficiency.

Snowflake and OpenAI partner to bring frontier intelligence to enterprise data

Feb 02, 2026 | openai.com

Why it matters:

Integrating OpenAI’s models into Snowflake’s platform enhances enterprise data analytics, allowing businesses to harness AI’s reasoning capabilities directly on their proprietary datasets, which could redefine data-driven decision-making and automation in corporate environments.

Regulatory & Policy

UK ICO launches formal investigation of Grok AI chatbot over harmful deepfakes

Feb 03, 2026 | ico.org.uk

Why it matters:

The ICO’s investigation into Grok AI underscores the urgent need for robust safeguards against misuse of AI-generated content, highlighting the growing regulatory scrutiny on AI systems to protect personal data and prevent harmful deepfakes, which could reshape compliance standards across the industry.

Making AI work for everyone, everywhere

Feb 06, 2026 | openai.com

Why it matters:

OpenAI’s commitment to localization ensures AI models are more culturally and legally attuned, fostering global inclusivity and compliance, which is crucial for equitable AI adoption and minimizing biases across diverse markets.

Bringing ChatGPT to GenAI.mil

Feb 09, 2026 | openai.com

Why it matters:

Integrating ChatGPT into GenAI.mil signifies a pivotal shift toward AI-enhanced national security, highlighting the increasing role of AI in defense strategies and setting a precedent for secure, government-focused AI deployments.

AI Safety & Ethics

UN Announces Independent International AI Science Panel to Guide AI Safety

Feb 04, 2026 | un.org

Why it matters:

The UN’s establishment of an independent AI science panel signifies a critical step toward creating universal AI safety standards, promoting global cooperation, and ensuring that AI advancements align with ethical considerations, which is essential for mitigating risks and fostering trust in AI technologies worldwide.

International AI Safety Report 2026 Published by Global Experts

Feb 03, 2026 | internationalaisafetyreport.org

Why it matters:

The International AI Safety Report 2026 establishes a critical benchmark for global AI governance, equipping policymakers with essential strategies to mitigate risks, thereby shaping safer AI development and fostering international collaboration on ethical AI deployment.

Anthropic’s Responsible Scaling Policy (update)

Feb 10, 2026 | anthropic.com

Why it matters:

Anthropic’s updated Responsible Scaling Policy, with its Sabotage Risk Report for Claude Opus 4.6, highlights the increasing need for transparency and robust safeguards in AI development, setting a precedent for industry-wide accountability and risk management practices.

Dev Tools & Infrastructure

Android Studio Panda 2 (2025.3.2) Canary 3 Now Available

Feb 05, 2026 | androidstudio.googleblog.com

Why it matters:

Enhanced bug fixes and improved code-assistance in Android Studio Panda 2 streamline AI development for mobile apps, enabling developers to build more reliable and efficient applications, thereby accelerating innovation and deployment in the rapidly evolving AI-driven mobile ecosystem.

GitHub Actions Runner Scale Set Client (Public Preview)

Feb 05, 2026 | github.blog

Why it matters:

Customizable autoscaling for GitHub Actions runners enhances CI/CD efficiency, enabling AI developers to handle complex workflows and large-scale projects more effectively, thus accelerating AI model development and deployment.

Milvus 2.6.10 Released with Performance and Security Enhancements

Feb 05, 2026 | milvus.io

Why it matters:

Enhanced security and faster inference in Milvus 2.6.10 streamline AI deployment, crucial for industries relying on vector databases to handle vast data efficiently, while improved stability ensures reliability, making it a strategic update for developers focused on optimizing AI-driven applications.

Recursive Language Models Work, But Not Every Time

Taylor Ortiz — Sat, 07 Feb 2026 06:14:23 GMT

Executive Summary

This research evaluates Recursive Language Models (RLM) from arXiv:2512.24601 through rigorous empirical testing. We tested two RLM implementations: a Custom RLM we built following the paper’s approach, and the DSPy RLM module (dspy.RLM). We compare both against RAG (Retrieval Augmented Generation) and traditional chunking approaches across multiple tasks and models.

We conducted two layers of testing:

Model comparison: We tested both Custom RLM and DSPy RLM across multiple OpenAI models, including standard models (gpt-4o-mini, gpt-4o) and reasoning models (gpt-5-mini, gpt-5.2, gpt-5-nano). This revealed that model selection is critical: standard models scored 0/6 on aggregation tasks while reasoning models scored 6/6 with identical code and prompts.
Variance testing (n=30): After identifying gpt-5-mini as the best-performing model for RLM tasks, we ran Custom RLM, DSPy RLM, and RAG 30 times each with identical inputs to measure variance. Variance captures how much results differ between runs of the same system, and understanding it is essential for deciding whether to deploy these methods in production.

Key Findings

1. Variance is the story.

Multi-document aggregation revealed significant variance in both RLM implementations. Scores ranged from complete failure (0/6) to perfect accuracy (6/6) across 30 identical runs.

2. Task type determines reliability.

Single-document analysis (one book, deep questions) showed lower variance (std=0.75) than multi-document aggregation (six books, synthesizing across all). Both RLM implementations are more reliable for focused analysis than cross-document synthesis.

3. Model selection matters more than method.

Frontier reasoning models (gpt-5-mini, gpt-5.2) succeeded where standard models (gpt-4o, gpt-4o-mini) failed completely. Same code, same prompts, but 0/6 vs 6/6.

4. RAG wins on consistency.

RAG achieved the most stable results on single-document reasoning (std=0.63), but struggled with multi-document aggregation where systematic coverage matters more than semantic similarity.

5. Cost-variance tradeoff.

DSPy RLM costs ~2x more than Custom RLM but shows lower variance on reasoning tasks.

1. Introduction

The Problem with Long Documents

LLMs face a fundamental challenge: context windows have limits. A 2.2 million token corpus cannot be processed directly. Even 700K token documents strain budgets.

The RLM Promise

Recursive Language Models (arXiv:2512.24601) propose an elegant solution:

Store the full document as a variable in a sandboxed Python environment
Let the LLM iteratively generate code to explore the document
Execute code, return results, repeat
The model searches, slices, and reasons programmatically

Theoretical advantage: Instead of processing millions of tokens at once, the model strategically samples relevant sections.

Our Contribution

The original RLM paper reports single-run results on synthetic benchmarks. We contribute:

Statistical rigor: n=30 runs per condition reveals variance hidden by single-run reporting
Real-world tasks: Literary analysis across 2.2M tokens of classic novels
Method comparison: RLM vs RAG vs Chunking on identical tasks
Practical guidance: When to use each approach

Research Questions

How reliable is RLM? (variance across runs)
Under what conditions does RLM excel?
How does model selection affect outcomes?
What are the cost/quality trade-offs?

2. Methodology

Test Corpus

The Mega Corpus combines: War and Peace, Great Expectations, A Tale of Two Cities, Oliver Twist, David Copperfield, and Moby Dick.

Methods Compared

Statistical Design

We ran each condition 30 times with identical inputs.

Why n=30? With 30+ samples, the sampling distribution of the mean typically stabilizes enough to estimate mean and variance reasonably. This is standard in behavioral research for detecting medium effect sizes.

Why temperature=1.0? We wanted to measure natural variability under realistic “creative exploration” settings. Lower temperatures would reduce randomness but wouldn’t eliminate the path-dependence inherent to agentic systems: once the model commits to exploring one section first, its subsequent decisions cascade from there. Temperature=1.0 captures this real-world behavior.

Tasks and Scoring

We designed two tasks to test different capabilities: deep reasoning within a single document, and information aggregation across multiple documents.

Subscribe now

Reasoning Task

Corpus: War and Peace (722K tokens)

Question: “How does Pierre Bezukhov’s understanding of happiness change throughout the novel?”

What we’re measuring: Can the model navigate a massive document, find the relevant sections about Pierre’s character arc, and synthesize them into a coherent answer?

Scoring approach: We identified 8 key terms that a comprehensive answer should reference. These are actual names and terms from the novel:

We scored answers by checking whether these terms appeared (substring matching). If an answer mentioned “Karataev,” we inferred the model had successfully found and referenced that section of the book. Two terms (pierre, happiness) are essentially baselines since they appear in the question itself. The remaining terms test whether the model found the relevant plot points.

Pass threshold: 4/8 terms (finding at least half the key plot points indicates the model successfully navigated the document rather than guessing)

Limitations: This approach rewards finding the right sections and using exact terminology. A model that described “the peasant who changed his worldview” without naming Karataev would receive no credit. However, automated scoring enabled consistent evaluation across 30 runs.

Aggregation Task

Corpus: Mega Corpus (2.2M tokens across 6 novels)

Question: “What is the final fate of the protagonist in each of the 6 books?”

What we’re measuring: Can the model systematically explore multiple documents, identify the protagonist of each, and correctly describe their ending?

Scoring approach: Each book scored 1 point if the answer correctly identified both the protagonist and their fate. For example: Pip in Great Expectations ends up reunited with Estella (or alone, depending on the edition). Scoring was binary per book: partial credit (correct protagonist, wrong fate) was not awarded.

Pass threshold: 3/6 books correct (correctly covering at least half the corpus indicates systematic exploration rather than partial success on one or two books)

3. Results

The Variance Problem

This is the central finding of our research. Identical inputs, identical model, dramatically different outputs. Each dot in the chart above represents one run. The spread tells the story.

What this means: If you ran RLM once on the aggregation task and got 0/6, you might conclude “RLM doesn’t work.” If you got 6/6, you might conclude “RLM is perfect.” Both conclusions would be wrong.

Failure rates tell the deployment story. For aggregation tasks, the probability of near-complete failure (score ≤ 1) was:

Custom RLM: 10% of runs
DSPy RLM: 17% of runs
RAG: 33% of runs

These failure rates matter more than mean scores for production systems. A method with high mean but 17% catastrophic failure rate may be unacceptable for critical applications.

Are the differences statistically significant? With n=30, we can compute 95% confidence intervals:

The CIs for Custom RLM and DSPy RLM overlap. A t-test confirms the difference is not statistically significant (p=0.12). While Custom RLM shows a higher mean, the difference could be due to chance. RAG’s lower performance, however, is statistically significant compared to both RLM variants.

Reasoning Task Results

Key observations:

DSPy RLM achieved highest mean score (6.30) with moderate variance
Chunking is perfectly consistent but 4-9x more expensive
Custom RLM has high variance (scores ranged 0-8)
RAG is cheap and consistent but lower accuracy

Aggregation Task Results

Key observations:

Aggregation shows higher variance than reasoning for all methods
Custom RLM outperformed DSPy RLM on mean score (4.60 vs 3.77)
RAG struggled with multi-document aggregation (33% failure rate). Unlike semantic similarity tasks, aggregation requires systematic coverage with correct book-to-protagonist mapping. RAG’s top-k retrieval pulls the most semantically similar chunks, which may cluster around 2-3 books rather than sampling each of the 6 systematically. This explains RAG’s counterintuitively high failure rate despite its reputation for consistency.
Both RLM variants showed full-range variance (0 to 6)

Model Selection Effect

A striking finding: model capability determines RLM viability.

Why reasoning models succeed: They can plan a systematic exploration strategy before executing. Standard models dive deep into the first interesting thread and exhaust their iteration budget.

Cost Analysis

Total variance testing cost: ~$28 for 120 RLM runs

The Retry Strategy: Best-of-3

If variance is unavoidable, can we mitigate it by running multiple times? We simulated a best-of-3 strategy using our existing 30 runs (taking the max score from each group of 3):

The practical takeaway: Running Custom RLM three times and taking the best result achieves 100% pass rate in our sample on aggregation at ~3× the cost of a single run. This transforms an unreliable method into a deployable one.

4. Discussion

Why Variance Matters

We treat mean score as a measure of capability and failure probability as a measure of reliability. Both matter for deployment decisions, but they answer different questions.

The distributions we observed are heavy-tailed, with occasional catastrophic failures even when mean performance is high. This is why variance matters so much for agentic systems.

Single-run benchmarks are standard practice in AI research. Our findings suggest this practice may systematically mislead:

Cherry-picking risk: Researchers (consciously or not) may report favorable runs
Reproducibility crisis: Others cannot replicate “good” results
Deployment surprise: Production systems encounter the full variance distribution

Recommendation: Report mean and standard deviation from multiple runs, especially for agentic/iterative systems like RLM.

When to Use Each Method

The Reasoning Model Requirement

RLM’s effectiveness depends critically on model capability:

Standard models (gpt-4o, gpt-4o-mini): Cannot execute systematic exploration strategies. Get “stuck” in local optima.
Reasoning models (gpt-5-mini, gpt-5.2): Plan before acting. Enumerate documents before diving deep.

Practical implication: Do not use RLM with standard models for complex tasks. The cost savings are not worth the reliability loss.

Library vs Custom Implementation

We compared two approaches: using DSPy’s built-in RLM module versus building a custom implementation following the paper’s methodology.

Why did our custom implementation outperform on aggregation? Our custom prompts explicitly guided the model to sample the beginning of documents first, understand naming conventions, and systematically enumerate all books before diving deep. DSPy’s generic RLM module lacks this task-specific guidance, which may explain why it excelled at depth (single-document reasoning) but struggled with breadth (multi-document coverage).

Recommendation: For single-document reasoning where consistency matters, use an existing library like DSPy’s RLM module. For multi-document synthesis where mean accuracy matters more than run-to-run variance, building a custom implementation with task-specific prompts may yield better results.

5. Limitations

Literary corpus only: Results may differ on technical, legal, or scientific documents
Training data contamination: These classic novels are almost certainly in the training data of frontier models. We cannot determine how much the models “remember” versus genuinely discover through RLM exploration. Results on proprietary or novel documents may differ.
Single model family: All tests used OpenAI models; other providers may show different patterns
English only: Non-English documents not tested
Scoring subjectivity: Key-term matching is imperfect for nuanced questions

6. Conclusion

RLM delivers on its promise of efficient long-document processing, but with important caveats:

Variance is real and significant.

Plan for it. Run multiple times for important queries.

Model selection is critical.

Reasoning models are not optional; they’re required for reliable RLM.

Task type matters.

RLM excels at single-document reasoning; struggles more with multi-document aggregation.

Tradeoffs are real.

Lower token costs come with higher variance. Chunking’s brute-force consistency has value.

For practitioners: If you need consistent results on stable corpora, invest in RAG infrastructure. If you need flexible ad-hoc queries without setup, use RLM with a reasoning model, but run it multiple times and aggregate results.

For researchers: Report variance. Single-run benchmarks on agentic systems may be systematically misleading.

Appendix: Raw Data

All n=30 results available in CSV format upon request.

What this chart shows: Each panel plots score (y-axis) against run number (x-axis) for one method/task combination. The black horizontal line is the mean. Notice there’s no pattern: run #1 isn’t better or worse than run #30. The variance is truly random, not a warmup effect or degradation over time. This confirms the variance we observed is inherent to the method, not an artifact of our testing procedure.

Research conducted February 2026. Code and data available upon request.

Another Coding Blog

Another Weekly AI Newsletter: Issue 72

Anthropic shipped into legal, small business, healthcare, and AWS in one week.

OpenAI launched a deployment company and put Codex on your phone.

Companies are cutting workers at record revenue to fund AI.

Grok Build, Claude Code, and Cursor all shipped agentic upgrades. LangChain shipped nine products to support them.

⭐ Featured: Thinking Machines built an AI that listens while it talks.

🎙️ Worth a Listen

Quick Hits

Multi-Agent Account Planning That Learns Across Deals

Intro

What you’ll learn

Section 1: The work of account planning

Section 2: What multi-agent in Managed Agents actually is

The shape: coordinator with a roster

Threads: how the system stays organized

Thread lifecycle

Idle threads stay alive, which enables follow-ups

Two kinds of memory

Designing the split

Section 3: The agent architecture

Phase 1: gather context and pull prior records

Phase 2: conditional topic education

Phase 3: synthesis

Phase 3.5: next-best-action selection

Phase 4: parallel recommendation generation

Phase 5: decision recording

Post-meeting: the debrief loop

Section 4: What the platform gives you for observability

Section 5: What multi-agent gives you that a workflow can’t

Section 6: Decision records: the layer that compounds

Recommendation Record schema

Decision Record schema (same shape as RR, with these fields added)

Implicit and explicit capture of enterprise decisions

Cross-account learning in practice (from the actual run)

Why this layer compounds

Section 7: The async loop

The asker: curated questions, not generic prompts

The synthesizer: schema-strict capture in the rep’s voice

The Slack MCP gotcha

The corpus is the integration point

Section 8: The distillation layer

Section 9: What we learned, and when to use this

1. The corpus compounds across runs.

2. The cited_records chain makes every recommendation auditable.

3. The decision step is what makes the system multi-agent.

4. The agents do their own research. Ask them what they found.

5. Schema enforcement needs a code-level check.

When this is the right tool

Another Weekly AI Newsletter: Issue 71

When the gap between what AI says and what it does becomes measurable.

$30B revenue, $200B in compute deals, and three new agent capabilities.

9,000 jobs cut. A union drew a line. And AI beat two doctors on real patients.

Cursor, OpenAI, Perplexity, and LangChain all shipped agentic infrastructure in the same week.

⭐ Featured: Anthropic can now read what Claude is thinking but not saying.

🎙️ Worth a Listen

Quick Hits

Persistent Memory for Claude Managed Agents: What I Found After Three Days of Building

What I was trying to figure out

The four building blocks

Setting things up

Three sessions

Session 1: writing notes from scratch

Session 2: recall

Session 3: modify

Where this got interesting

Layer 1: the model wrote a buggy bash command

Layer 2: the platform correctly flagged the failure

Layer 3: the model ignored the error flag

Layer 4: the destructive action

What the audit log showed

How I got it back

Important Considerations

So when does this make sense?

What I didn’t get to (yet)

So... should you use this?

Another Weekly AI Newsletter: Issue 70

“You can’t just steal a charity.” Elon Musk spent three days on the stand trying to prove it.

$900 billion valuation, 50% less sycophancy, and connectors for every creative tool you use.

OpenAI ended its Microsoft exclusivity and went multi-cloud.

Layer 1: the model wrote a buggy `bash` command