<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Another Coding Blog]]></title><description><![CDATA[Bringing you insights and education from 13 years of experience across AI, Data and Analytics ]]></description><link>https://www.anothercodingblog.com</link><image><url>https://substackcdn.com/image/fetch/$s_!2kzg!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F615044d0-cdfb-47ac-9a1b-3883974114e7_1024x1024.png</url><title>Another Coding Blog</title><link>https://www.anothercodingblog.com</link></image><generator>Substack</generator><lastBuildDate>Sun, 17 May 2026 15:33:56 GMT</lastBuildDate><atom:link href="https://www.anothercodingblog.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Taylor Ortiz]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[ortizt@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[ortizt@substack.com]]></itunes:email><itunes:name><![CDATA[Taylor Ortiz]]></itunes:name></itunes:owner><itunes:author><![CDATA[Taylor Ortiz]]></itunes:author><googleplay:owner><![CDATA[ortizt@substack.com]]></googleplay:owner><googleplay:email><![CDATA[ortizt@substack.com]]></googleplay:email><googleplay:author><![CDATA[Taylor Ortiz]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Another Weekly AI Newsletter: Issue 72]]></title><description><![CDATA[Anthropic ships five verticals and gave every plan an SDK budget. OpenAI launches a deployment company with 150 engineers. Cisco, GitLab, and GM cut thousands at record revenue. Grok Build at $299/mo.]]></description><link>https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-9dd</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-9dd</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Sat, 16 May 2026 19:39:53 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/ece87e18-1dc8-4d38-b255-048b807d7880_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FNYC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fb11fff-d44e-452d-b383-fca05f7f9e3e_2400x4102.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FNYC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fb11fff-d44e-452d-b383-fca05f7f9e3e_2400x4102.png 424w, https://substackcdn.com/image/fetch/$s_!FNYC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fb11fff-d44e-452d-b383-fca05f7f9e3e_2400x4102.png 848w, https://substackcdn.com/image/fetch/$s_!FNYC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fb11fff-d44e-452d-b383-fca05f7f9e3e_2400x4102.png 1272w, https://substackcdn.com/image/fetch/$s_!FNYC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fb11fff-d44e-452d-b383-fca05f7f9e3e_2400x4102.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FNYC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fb11fff-d44e-452d-b383-fca05f7f9e3e_2400x4102.png" width="1456" height="2489" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fb11fff-d44e-452d-b383-fca05f7f9e3e_2400x4102.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2489,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:923849,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/198040876?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fb11fff-d44e-452d-b383-fca05f7f9e3e_2400x4102.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FNYC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fb11fff-d44e-452d-b383-fca05f7f9e3e_2400x4102.png 424w, https://substackcdn.com/image/fetch/$s_!FNYC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fb11fff-d44e-452d-b383-fca05f7f9e3e_2400x4102.png 848w, https://substackcdn.com/image/fetch/$s_!FNYC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fb11fff-d44e-452d-b383-fca05f7f9e3e_2400x4102.png 1272w, https://substackcdn.com/image/fetch/$s_!FNYC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fb11fff-d44e-452d-b383-fca05f7f9e3e_2400x4102.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Anthropic shipped into legal, small business, healthcare, and AWS in one week.</h2><ul><li><p><strong>Claude for the legal industry launched with 12 practice-area plugins.</strong> <a href="https://claude.com/blog/claude-for-the-legal-industry">Contract review, M&amp;A diligence, and regulatory compliance</a> out of the box. 87% of general counsel now use generative AI, up from 44% the prior year.</p></li><li><p><strong>Claude for Small Business connected to QuickBooks, PayPal, and HubSpot.</strong> <a href="https://www.anthropic.com/news/claude-for-small-business">15 ready-to-run workflows</a> covering invoicing, CRM, document signing via DocuSign and Canva.</p></li><li><p><strong>Anthropic committed $200M to the Gates Foundation.</strong> <a href="https://www.anthropic.com/news/gates-foundation-partnership">Grants, Claude credits, and technical support</a> for vaccine screening, disease forecasting, K-12 education, and agricultural tools.</p></li><li><p><strong>Claude Platform went GA on AWS.</strong> <a href="https://aws.amazon.com/blogs/machine-learning/introducing-claude-platform-on-aws-anthropics-native-platform-through-your-aws-account/">First cloud provider</a> to offer Anthropic&#8217;s native platform with unified billing and same-day feature parity with the native API.</p></li><li><p><strong>Every subscriber now gets separate Agent SDK credits.</strong> Pro gets <a href="https://x.com/ClaudeDevs/status/2054610152817619388">$20/month</a>, Max gets up to $200. Unlike OpenAI, which bundles Codex and third-party usage into normal plan limits, Anthropic is subsidizing the developer ecosystem with a separate bucket.</p></li><li><p><strong>Claude Code limits increased another 50% through July.</strong> <a href="https://x.com/claudeai/status/2054641166155497503">On top of the doubling</a> from the week before.</p></li><li><p><strong>Ramp and Axios independently confirmed Anthropic overtook OpenAI in workplace adoption.</strong> Though <a href="https://venturebeat.com/technology/anthropic-finally-beat-openai-in-business-ai-adoption-but-3-big-threats-could-erase-its-lead">VentureBeat identified three structural threats</a> to that lead.</p></li><li><p><strong>The thread:</strong> Anthropic is trying to become the default for every vertical at once. Legal, healthcare, small business, enterprise, developer tooling. Whether that&#8217;s a platform strategy or overextension depends on execution.</p></li></ul><div><hr></div><h2>OpenAI launched a deployment company and put Codex on your phone.</h2><ul><li><p><strong>The OpenAI Deployment Company launched with 150 engineers on day one.</strong> <a href="https://x.com/OpenAI/status/2053824997777457651">19 investment firms and consultancies</a>, majority-owned by OpenAI, with <a href="https://x.com/OpenAI/status/2053824999736410415">Tomoro acquired</a> to provide Forward Deployed Engineers. <a href="https://www.axios.com/2026/05/11/openai-deployco-private-equity">Valued at $14B</a>.</p></li><li><p><strong>ChatGPT connected to bank accounts.</strong> <a href="https://openai.com/index/personal-finance-chatgpt/">Plaid integration for Pro users</a> in the US, with an Intuit partnership for actionable financial steps.</p></li><li><p><strong>Codex shipped to iOS and Android.</strong> <a href="https://x.com/OpenAI/status/2055016850849993072">Mobile preview</a> lets users start, review, and approve coding tasks while agents run on a separate device.</p></li><li><p><strong>OpenAI disclosed a supply chain compromise.</strong> A <a href="https://openai.com/index/our-response-to-the-tanstack-npm-supply-chain-attack/">TanStack npm package attack</a> exposed code-signing certificates for macOS, Windows, iOS, and Android apps. Full certificate rotation required.</p></li><li><p><strong>The thread:</strong> Both OpenAI and Anthropic launched enterprise services arms within a week of each other. The model API is becoming a commodity. The margin is shifting to who can get it deployed inside your organization first.</p></li></ul><div><hr></div><h2>Companies are cutting workers at record revenue to fund AI.</h2><ul><li><p><strong>Cisco cut 4,000 jobs while reporting record quarterly revenue.</strong> Stock rose 15% on <a href="https://www.cnbc.com/2026/05/13/cisco-csco-q3-earnings-report-2026.html">surging AI orders</a>.</p></li><li><p><strong>GitLab announced sweeping restructuring to fund agent development.</strong> <a href="https://about.gitlab.com/blog/gitlab-act-2/">Cut headcount, flattened management</a>, reorganized R&amp;D into 60 smaller teams, and retired its CREDIT values framework.</p></li><li><p><strong>GM laid off hundreds of IT workers and began hiring AI replacements.</strong> <a href="https://techcrunch.com/2026/05/11/gm-just-laid-off-hundreds-of-it-workers-to-hire-those-with-stronger-ai-skills/">Explicitly seeking stronger AI skills</a>.</p></li><li><p><strong>Samsung faces a looming strike over AI.</strong> <a href="https://www.reuters.com/sustainability/society-equity/elon-musks-court-battle-against-openai-enters-homestretch-2026-05-14/">Global AI boom driving deep internal divisions</a> between management and workers.</p></li><li><p><strong>The thread:</strong> Revenue is up at all three companies. The functions going are IT operations, developer tooling management, and corporate overhead that was previously considered secure.</p></li></ul><div><hr></div><h2>Grok Build, Claude Code, and Cursor all shipped agentic upgrades. LangChain shipped nine products to support them.</h2><ul><li><p><strong>xAI launched Grok Build in beta.</strong> <a href="https://x.ai/news/grok-build-cli">Terminal-native CLI</a> with up to 8 parallel agents, Grok 4.3 beta, 2M token context. Priced at $299/month (introductory $99). SuperGrok Heavy only.</p></li><li><p><strong>Claude Code limits increased 50%.</strong> <a href="https://x.com/claudeai/status/2054641166155497503">Through July 13</a>, on top of the doubling from the prior week. Plus separate Agent SDK credits.</p></li><li><p><strong>Cursor shipped /orchestrate.</strong> <a href="https://x.com/cursor_ai/status/2052432780336988474">Planner/worker/verifier loops</a> that re-spawn on failure. <a href="https://x.com/cursor_ai/status/2052489388895195399">Parallel subagents</a>. <a href="https://x.com/cursor_ai/status/2051739625958584659">Always-on CI agents</a>.</p></li><li><p><strong>LangChain shipped nine products at Interrupt 2026.</strong> <a href="https://www.langchain.com/blog/introducing-smithdb">SmithDB</a> for agent traces, <a href="https://www.langchain.com/blog/introducing-llm-gateway">LLM Gateway</a> for centralized control, <a href="https://www.langchain.com/blog/langsmith-sandboxes-generally-available">Sandboxes GA</a> for isolated testing, <a href="https://www.langchain.com/blog/deep-agents-0-6">Deep Agents 0.6</a> for long-running workflows, and the <a href="https://www.langchain.com/blog/the-agent-development-lifecycle">Agent Development Lifecycle</a> framework.</p></li><li><p><strong>The thread:</strong> Grok Build at $299/month, Claude Code with separate SDK credits, Cursor as a standalone IDE. Three very different bets on how developers will pay for agentic coding. LangChain is betting the real money is in the infrastructure underneath all of them.</p></li></ul><div><hr></div><h2><strong>&#11088; </strong>Featured: Thinking Machines built an AI that listens while it talks.</h2><p>Every AI conversation today works the same way: you talk, the model waits, the model responds. <a href="https://thinkingmachines.ai/blog/interaction-models/">Thinking Machines</a> published research on &#8220;interaction models&#8221; that throw out that assumption entirely.</p><p>Their model processes continuous 200ms micro-turns of audio, video, and text simultaneously. There are no turn boundaries. The model listens while speaking, interrupts when it sees something wrong in your code, reacts to visual cues without being prompted, and runs background reasoning while maintaining the conversation.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>The architecture splits into two parts: an interaction model that maintains real-time presence (always perceiving, always ready to respond), and a background model that handles deeper reasoning and tool use asynchronously. When the background model finishes a task, the interaction model weaves results into the conversation at an appropriate moment instead of interrupting.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AuzQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb73cf7eb-d05d-4c88-9c3d-64921b6ad1c9_1302x852.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AuzQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb73cf7eb-d05d-4c88-9c3d-64921b6ad1c9_1302x852.png 424w, https://substackcdn.com/image/fetch/$s_!AuzQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb73cf7eb-d05d-4c88-9c3d-64921b6ad1c9_1302x852.png 848w, https://substackcdn.com/image/fetch/$s_!AuzQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb73cf7eb-d05d-4c88-9c3d-64921b6ad1c9_1302x852.png 1272w, https://substackcdn.com/image/fetch/$s_!AuzQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb73cf7eb-d05d-4c88-9c3d-64921b6ad1c9_1302x852.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AuzQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb73cf7eb-d05d-4c88-9c3d-64921b6ad1c9_1302x852.png" width="1302" height="852" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b73cf7eb-d05d-4c88-9c3d-64921b6ad1c9_1302x852.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:852,&quot;width&quot;:1302,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:87959,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/198040876?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb73cf7eb-d05d-4c88-9c3d-64921b6ad1c9_1302x852.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AuzQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb73cf7eb-d05d-4c88-9c3d-64921b6ad1c9_1302x852.png 424w, https://substackcdn.com/image/fetch/$s_!AuzQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb73cf7eb-d05d-4c88-9c3d-64921b6ad1c9_1302x852.png 848w, https://substackcdn.com/image/fetch/$s_!AuzQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb73cf7eb-d05d-4c88-9c3d-64921b6ad1c9_1302x852.png 1272w, https://substackcdn.com/image/fetch/$s_!AuzQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb73cf7eb-d05d-4c88-9c3d-64921b6ad1c9_1302x852.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The benchmarks are striking. On FD-bench (the standard interaction quality benchmark), their model scored 77.8 versus 46.8 for GPT-Realtime-2. On responsiveness, they hit 0.40 second turn-taking latency versus 1.18 for GPT-Realtime-2. They also created three new benchmarks (TimeSpeak, CueSpeak, visual proactivity) that no existing model can meaningfully perform. GPT-Realtime-2 scores near zero on all of them.</p><p>The model is a 276B parameter MoE with 12B active. It uses encoder-free early fusion, meaning no separate Whisper or TTS models. Audio comes in as raw dMel signals, video as 40x40 patches. Everything is co-trained from scratch.</p><p>Their argument comes from Rich Sutton&#8217;s &#8220;bitter lesson&#8221;: if interactivity is bolted on through harnesses (voice activity detection, turn-taking logic), it can never scale with intelligence. If it&#8217;s native to the model, scaling makes the model both smarter and a better collaborator.</p><p><strong>What to watch for:</strong> This is a research preview from a startup (276B parameters, limited availability). But the design principle matters: current real-time systems from OpenAI and Google use harnesses to fake interactivity on top of turn-based models. Thinking Machines is arguing that&#8217;s a dead end. If they&#8217;re right, every voice agent shipping today is architecturally temporary.</p><div><hr></div><h2><strong>&#127897;&#65039; Worth a Listen</strong></h2><div id="youtube2-IVGjBxqygmI" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;IVGjBxqygmI&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/IVGjBxqygmI?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>IBM AI Engineer Bri Kopecki on why agents without infrastructure are &#8220;brilliant goldfish.&#8221;</p><ul><li><p><strong>The problem:</strong> Most AI agents have no memory, no access control, no audit trail. Every conversation starts from scratch.</p></li><li><p><strong>The six-layer stack:</strong> Scheduler (who goes first), memory manager (short/long/episodic), tool manager (sandboxed execution), identity manager (tokens and permissions), observability (full decision tracing), and guardrails/governance (human-in-the-loop for high-stakes decisions).</p></li><li><p><strong>Why it matters now:</strong> This maps directly to what LangChain shipped this week (SmithDB for traces, LLM Gateway for access control, Sandboxes for tool isolation) and explains why Cursor, Anthropic, and OpenAI are all building orchestration layers.</p></li></ul><div><hr></div><h2><strong>Quick Hits</strong></h2><ul><li><p><strong><a href="https://techcrunch.com/2026/05/14/cerebras-ipo-debut/">Cerebras IPO&#8217;d at $5.55B, shares jumped 89% on day one</a></strong> | TechCrunch &#8212; Near $100B market cap on debut. The AI chip premium is real.</p></li><li><p><strong><a href="https://techcrunch.com/2026/05/12/medicares-new-payment-model-is-built-for-ai-and-most-of-the-tech-world-has-no-idea/">Medicare created a payment model built for AI-assisted services</a></strong> | TechCrunch &#8212; The largest US payer quietly opened the door for clinical AI reimbursement. This will pull deployment faster than any product launch.</p></li><li><p><strong><a href="https://www.technologyreview.com/2026/05/15/musk-v-altman-week-3/">Musk v. Altman trial went to the jury</a></strong> | MIT Tech Review &#8212; Closing arguments accused Musk of selective amnesia and Altman of lying about the nonprofit mission.</p></li><li><p><strong><a href="https://www.theverge.com/science/931766/arxiv-ai-slop-ban-researchers">ArXiv banned researchers for AI-generated papers</a></strong> | The Verge &#8212; Academic publishing&#8217;s authentication problem now has teeth, but detection is still losing the arms race.</p></li><li><p><strong><a href="https://www.theverge.com/tech/929091/meta-ai-threads-account-block">Meta embedded AI in Threads and won&#8217;t let users block it</a></strong> | The Verge &#8212; Captive distribution at 3B+ users, no opt-out.</p></li><li><p><strong><a href="https://openai.com/index/what-parameter-golf-taught-us/">OpenAI Parameter Golf results: 1,000+ participants, agents everywhere</a></strong> | OpenAI &#8212; An ML challenge where the vast majority of submitters used coding agents. OpenAI built a Codex-based triage bot to handle the submission volume.</p></li><li><p><strong><a href="https://www.tomshardware.com/tech-industry/cyber-security/apple-m5-architecture-suffers-first-privilege-escalation-exploit-anthropics-claude-mythos-helps-researchers-bypass-memory-integrity-enforcement">Claude Mythos cracked Apple&#8217;s M5 memory security in five days</a></strong> | Tom&#8217;s Hardware &#8212; First privilege escalation exploit on M5. Apple spent half a decade building Memory Integrity Enforcement. Standard user to root access.</p></li><li><p><strong><a href="https://techcrunch.com/2026/05/09/nvidia-has-already-committed-40b-to-equity-ai-deals-this-year/">Nvidia committed $40B in equity AI investments in 2026</a></strong> | TechCrunch &#8212; Not just selling chips. Acquiring stakes in the companies that consume the most of them.</p></li><li><p><strong><a href="https://www.anthropic.com/research/2028-two-scenarios">Anthropic published &#8220;2028: Two scenarios for global AI leadership&#8221;</a></strong> | Anthropic &#8212; A policy paper on US-China AI competition. Anthropic is writing geopolitics now.</p></li><li><p><strong><a href="https://www.theverge.com/ai-artificial-intelligence/931500/youtube-ai-deepfake-detection-tool">YouTube expanding AI deepfake detection to all adult users</a></strong> | The Verge &#8212; The detection side is scaling up.</p></li><li><p><strong><a href="https://www.theverge.com/ai-artificial-intelligence/931200/google-spam-rules-ai-manipulation">Google updated spam rules to include AI manipulation attempts</a></strong> | The Verge &#8212; SEO for the age of AI-generated content.</p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Multi-Agent Account Planning That Learns Across Deals]]></title><description><![CDATA[Fifteen agents across five phases, with a decision-records harness that compounds insight. A working guide to multi-agent orchestration on Claude Managed Agents.]]></description><link>https://www.anothercodingblog.com/p/multi-agent-account-planning-that</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/multi-agent-account-planning-that</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Fri, 15 May 2026 15:33:51 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/0e28dd80-4edc-442f-be2f-1a0ed1bc6415_1200x630.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Intro</h2><p>Anthropic shipped multi-agent orchestration in Managed Agents on May 6th. An agent can be configured as a coordinator with a roster of other agents it can delegate to, and the platform handles fan-out, child-thread lifecycle, parallel execution, and per-thread observability.</p><p>Anthropic also shipped a management console. Every agent, session, child thread, and memory write is browsable, with full transcripts, tool calls, and version history inspectable on click. That console shaped how I built the system, because the logging I would have written myself was already there.</p><p>The use case I built is account planning in B2B SaaS sales. The vendor is a fictional company, Yardstick AI, selling an AI evaluation platform. The prospect is Vercel, a real company with a public footprint rich enough to give the agents something genuine to research.</p><p>The system has fifteen agents organized into a five-phase pre-meeting orchestration plus a post-meeting debrief loop. The pre-meeting flow has two genuine decision steps where the coordinator chooses what runs next based on what just came back, not a fixed sequence.</p><p>It uses MCP servers (Notion, Slack), the Anthropic vault for credentials, two memory stores (a playbook and a decision-records corpus), custom HTTP tools for a mock CRM and enrichment service, and the built-in web search and fetch tools.</p><p>Most of the system&#8217;s analytical work happens in the layer of decision records that the agents read from and write into. The records get captured two ways.</p><p><strong>Implicitly</strong>, the system infers decisions from CRM record changes, activity logs, and other signals that move without anyone narrating them.</p><p><strong>Explicitly</strong>, after each meeting, the system uses the full account plan plus the surrounding events (calendar entries, CRM stage moves, recent activity) to compose a curated set of questions for the rep. The questions are shaped by what the system already knows about the account, so they target the specific decisions most likely to produce useful data instead of asking generic &#8220;how did it go&#8221; prompts.</p><p>Whichever way a record gets created, it lives in a shared memory store that the next account&#8217;s run can retrieve and reason from. That is the difference between a system that gives you one prep brief and a system that gets better at giving you prep briefs as it accumulates evidence.</p><p>This post documents what I built, what worked, what did not, and what the costs and constraints actually look like once you push past the basic demo.</p><p>Below is a capture of the final product:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pVDb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pVDb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png 424w, https://substackcdn.com/image/fetch/$s_!pVDb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png 848w, https://substackcdn.com/image/fetch/$s_!pVDb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png 1272w, https://substackcdn.com/image/fetch/$s_!pVDb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pVDb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png" width="728" height="611.1" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:873,&quot;width&quot;:1040,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:213125,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/197844482?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!pVDb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png 424w, https://substackcdn.com/image/fetch/$s_!pVDb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png 848w, https://substackcdn.com/image/fetch/$s_!pVDb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png 1272w, https://substackcdn.com/image/fetch/$s_!pVDb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>What you&#8217;ll learn</h2><p>This post walks through what I learned building a multi-agent system in Anthropic Managed Agents. The official documentation covers the basics. This post covers what comes after that: how the primitive holds up when you push it against a real, multi-source, multi-phase problem. By the end you should have a clearer sense of when this architecture is worth using and what it takes to make it work.</p><p>Concretely:</p><ul><li><p><strong>What multi-agent really is inside the platform.</strong> The shape of the architecture, where the limits actually sit, and what the docs do not yet spell out.</p></li><li><p><strong>How the system remembers things during a run versus across runs.</strong> Two different kinds of memory live side by side, and a real system has to be deliberate about where each finding goes.</p></li><li><p><strong>Why use multi-agent over a workflow.</strong> When the coordinator&#8217;s runtime decisions justify the complexity, and when they do not.</p></li><li><p><strong>How decision records make the system compound.</strong> A structured corpus of recommendations and their resulting decisions turns each run into evidence the next run can use.</p></li><li><p><strong>The agent harness.</strong> Everything you build around the platform primitives to make the system work for your use case: the MCP servers you connect, the record schemas your corpus enforces, the system prompts that define each agent&#8217;s job, the routing logic the coordinator follows, the briefings it hands to each agent.</p></li><li><p><strong>Async surfaces via MCP.</strong> How Slack becomes part of the system through MCP, so the rep can capture decisions in-place after a meeting without a custom bot.</p></li><li><p><strong>The distillation problem.</strong> Why the system&#8217;s raw output is not usable on its own, and what has to happen to make it useful to a human in thirty minutes.</p></li><li><p><strong>Cost and observability.</strong> Per-thread spend, total cost for a full run, and what the Managed Agents console gives you for free.</p></li><li><p><strong>Honest findings.</strong> Pitfalls a builder should expect to hit on their first run.</p></li><li><p><strong>When this is the right tool, and when it isn&#8217;t.</strong> What kinds of problems multi-agent orchestration fits, and what kinds belong with a simpler architecture.</p></li></ul><div><hr></div><h2>Section 1: The work of account planning</h2><p>An account executive working a B2B SaaS deal is doing one job continuously and several others on top of it. The continuous job is synthesis. At any moment in a pursuit, an AE is holding context across half a dozen sources: their own notes from past calls, the CRM record with its stages and activity log, public signals (product launches, hires, press), conference encounters and hallway intel, backchannel from people who used to work there, win and loss patterns from similar accounts, and their own company&#8217;s internal playbook. None of these sources are formatted alike, refresh on the same cadence, or answer the same questions week to week.</p><p>The job sits on top of a rhythm of meetings. Before each meeting, the rep does pre-meeting prep. After each meeting, the rep does post-meeting capture. Between meetings, follow-up. The cadence is continuous, across fifteen to thirty active accounts at any given time. Even the most disciplined AE admits the synthesis happens in their head more than on paper, and the capture happens only when there is slack to capture.</p><p>What makes this work a candidate for multi-agent orchestration is the shape of the synthesis problem: the sources decompose naturally by role. Reading internal Notion notes, researching the company on the public web, mapping the org chart, and synthesizing all of it against a playbook are four different jobs. Each role wants a different tool surface, and each role&#8217;s output is most useful when it is separate from the others until the synthesis step. Running them in parallel saves wall-clock time, but the more interesting property is that each role can be a focused agent with a small system prompt and a tight tool surface, rather than one generalist agent trying to be five things at once.</p><p>The 30-minute pre-meeting slice is the moment in this rhythm where multi-agent orchestration is most legible. The rep has a calendar event coming up. They want a brief that consolidates what is knowable from everywhere into something they can read in five minutes, prepare around in twenty, and act on in the meeting itself. That is the moment this post centers on, but the architecture supports the broader cadence around it.</p><div><hr></div><h2><strong>Section 2: What multi-agent in Managed Agents actually is</strong></h2><p>Most coverage of &#8220;agents&#8221; uses the term to cover everything from a single Claude call to a fully autonomous AI team that plans its own work. Anthropic&#8217;s multi-agent feature is neither extreme. It is a specific pattern with specific constraints, and the constraints are worth knowing before you build against it.</p><h4><strong>The shape: coordinator with a roster</strong></h4><p>One agent is the <strong>coordinator</strong>. Its definition includes a list of other agents it is allowed to delegate to. That list is called the <strong>roster</strong>. A few specific limits:</p><ul><li><p>The roster can hold up to 20 agents.</p></li><li><p>The coordinator can call multiple copies of any agent on the roster.</p></li><li><p>A session can have up to 25 active threads running at once.</p></li><li><p>Specialists cannot delegate to other specialists. The architecture is flat, not nested (Anthropic&#8217;s docs phrase it as <em>&#8220;depth &gt; 1 is ignored&#8221;</em>).</p></li></ul><p>If you came in expecting agents that delegate to agents that delegate to agents, the spec corrects you on page one. What you get is a flat fan-out from a single coordinator. For most real systems this is the right tradeoff.</p><h4><strong>Threads: how the system stays organized</strong></h4><p>A <strong>thread</strong> is a separate, isolated conversation that belongs to one agent. Each thread has its own history and tools. Threads don&#8217;t share anything with each other, even though they all run inside the same session.</p><p>Two kinds:</p><ul><li><p>The <strong>primary thread</strong> is the coordinator&#8217;s own thread. It also doubles as the activity feed for the whole session.</p></li><li><p>A <strong>child thread</strong> is created when the coordinator delegates to a specialist. The platform copies the session&#8217;s tools and credentials onto that thread, and the specialist&#8217;s work runs there.</p></li></ul><p>When the coordinator delegates to multiple specialists in the same turn, the child threads run in parallel. The coordinator waits for each reply before deciding what to do next. You don&#8217;t write any of the glue code for this. The decision-making that would normally live in a script lives inside the coordinator&#8217;s prompt.</p><h4><strong>Thread lifecycle</strong></h4><p>A thread moves through three states:</p><ul><li><p><strong>Running</strong>: the specialist is actively working.</p></li><li><p><strong>Idle</strong>: the specialist has finished but the thread is still alive. It counts against the 25-thread cap.</p></li><li><p><strong>Archived</strong>: you have told the platform you are done with the thread. The slot is freed.</p></li></ul><p>For most builds, the 25-thread cap is generous enough that you never think about lifecycle. Systems that lean hard on parallel work have to treat archiving as part of the orchestration.</p><h4><strong>Idle threads stay alive, which enables follow-ups</strong></h4><p>Because an idle thread is not gone, the coordinator can send a follow-up message to a specialist it called earlier. The specialist keeps its full context from before. That means the architecture supports more than one round of back-and-forth per specialist, not just one-shot delegation. I did not use this in the build, but in retrospect there are several places it would have helped.</p><h4><strong>Two kinds of memory</strong></h4><p>The system has two layers of memory that work on different time scales:</p><ul><li><p><strong>Persistent threads</strong> keep a specialist&#8217;s context alive within a session. The moment the session ends, the threads are gone.</p></li><li><p><strong>Memory stores</strong> persist across sessions. They are objects shared across the whole workspace, mounted onto a session when it starts. Anything written into one stays available to the next run that mounts the same store.</p></li></ul><p>A real multi-agent build needs both.</p><h4><strong>Designing the split</strong></h4><p>The design split lives in two questions:</p><ul><li><p>Within a session: which specialists do you keep alive for a follow-up, and which do you fire once and let go?</p></li><li><p>Across sessions: which findings deserve to be promoted into a memory store, and which can evaporate when the session ends?</p></li></ul><p>The platform gives you the building blocks for both. It does not decide which findings belong where. Get that split wrong and you pay either way:</p><ul><li><p>Throw away thread context too early, and you re-brief the specialist on every follow-up.</p></li><li><p>Fail to promote findings into a store, and the next session starts cold on everything you already learned.</p></li></ul><p>Our build leans heavily on the cross-session side. Most of the analytical work in this system comes from the decision-records corpus, which is the through-line for the rest of this post.</p><div><hr></div><h2><strong>Section 3: The agent architecture</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dbJu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ba98e4-6ff1-47d9-87a7-da83032556ad_1200x630.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dbJu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ba98e4-6ff1-47d9-87a7-da83032556ad_1200x630.heic 424w, https://substackcdn.com/image/fetch/$s_!dbJu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ba98e4-6ff1-47d9-87a7-da83032556ad_1200x630.heic 848w, https://substackcdn.com/image/fetch/$s_!dbJu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ba98e4-6ff1-47d9-87a7-da83032556ad_1200x630.heic 1272w, https://substackcdn.com/image/fetch/$s_!dbJu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ba98e4-6ff1-47d9-87a7-da83032556ad_1200x630.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dbJu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ba98e4-6ff1-47d9-87a7-da83032556ad_1200x630.heic" width="1200" height="630" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c4ba98e4-6ff1-47d9-87a7-da83032556ad_1200x630.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:630,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55927,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/197844482?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ba98e4-6ff1-47d9-87a7-da83032556ad_1200x630.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dbJu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ba98e4-6ff1-47d9-87a7-da83032556ad_1200x630.heic 424w, https://substackcdn.com/image/fetch/$s_!dbJu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ba98e4-6ff1-47d9-87a7-da83032556ad_1200x630.heic 848w, https://substackcdn.com/image/fetch/$s_!dbJu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ba98e4-6ff1-47d9-87a7-da83032556ad_1200x630.heic 1272w, https://substackcdn.com/image/fetch/$s_!dbJu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ba98e4-6ff1-47d9-87a7-da83032556ad_1200x630.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The pre-meeting orchestration uses thirteen agents: one lead orchestrator plus twelve specialists in its roster. The post-meeting debrief loop adds two more agents that sit outside the coordinator entirely. Fifteen across the system.</p><p>Pre-meeting work is a tightly scoped synthesis problem that benefits from a coordinator. Post-meeting work is a slower, human-paced loop that does not benefit from coordination at all, just two single-purpose agents that read and write a shared corpus.</p><p>The pre-meeting run breaks into five phases, sequential at the coordinator level and parallel within. The coordinator narrates each phase boundary as it runs, which makes its reasoning visible and forces the model into a structured plan rather than letting it improvise.</p><h4><strong>Phase 1: gather context and pull prior records</strong></h4><p>Five specialists fan out concurrently:</p><ul><li><p><strong>meeting-context</strong>: reads internal Notion notes through Notion MCP.</p></li><li><p><strong>external-researcher</strong>: pulls public signals from the web.</p></li><li><p><strong>stakeholder-analyst</strong>: maps decision-makers via a mock enrichment service.</p></li><li><p><strong>engagement-readiness</strong>: hits a mock CRM for outreach history.</p></li><li><p><strong>decision-retriever</strong>: runs against the shared decision-records corpus and pulls prior decision records from past accounts that match the current account&#8217;s shape (by attribute overlap: industry, competitor present, champion profile, procurement complexity, and so on).</p></li></ul><h4><strong>Phase 2: conditional topic education</strong></h4><p>The coordinator inspects what Phase 1 surfaced and picks two to four technical topics worth briefing the rep on before the meeting. For the Vercel run, those topics included cross-provider eval methodology, agent eval, AI observability, and eval-driven CI.</p><ul><li><p><strong>topic-educator</strong>: runs against the curated topic list and returns a primer per topic, each ending with smart questions the rep can ask in the room.</p></li></ul><p>If the account does not warrant it, the coordinator skips Phase 2 entirely.</p><h4><strong>Phase 3: synthesis</strong></h4><ul><li><p><strong>opportunity-risk</strong>: receives everything Phase 1 and Phase 2 produced, mounts the read-only Yardstick playbook from a memory store, reads the prior decision records the retriever pulled in Phase 1, and writes the structured pursuit plan. The plan covers ICP fit, buying triggers, stakeholder map and sequencing, first-meeting hypothesis, recommended plays, and disqualifiers.</p></li></ul><h4><strong>Phase 3.5: next-best-action selection</strong></h4><p>After the synthesis is in, the coordinator does not jump straight to recording. It asks one more specialist, the chooser, to decide which concrete recommendations are warranted for this specific account.</p><ul><li><p><strong>next-best-action-chooser</strong>: reads the synthesis plus the prior decision records the retriever pulled in Phase 1, decides which of three specialized recommenders to invoke, and writes a focused brief for each. The chooser can also skip a recommender, with a reason. A different account with different synthesis and different prior records produces a different plan.</p></li></ul><p>The three recommenders available to the chooser:</p><ul><li><p><strong>stakeholder-recommender</strong>: sequencing or lead-play.</p></li><li><p><strong>pricing-recommender</strong>: pricing strategy.</p></li><li><p><strong>competitive-recommender</strong>: competitive positioning or risk mitigation.</p></li></ul><h4><strong>Phase 4: parallel recommendation generation</strong></h4><p>The coordinator dispatches whichever recommenders the chooser named. They run in parallel. Each one produces a single Recommendation Record (RR) as a markdown draft with strict YAML frontmatter and a <code>cited_records</code> block listing the prior decision records whose outcomes informed this recommendation. The recommenders hand drafts back to the coordinator; they do not write to the corpus themselves.</p><h4><strong>Phase 5: decision recording</strong></h4><ul><li><p><strong>decision-recorder</strong>: receives the RR drafts, validates each one against the schema, checks every cited prior decision record exists in the corpus, writes the validated records to <code>/mnt/memory/yardstick-decisions/</code>, and updates the corpus index.</p></li></ul><p>Splitting content generation (the recommenders) from persistence (the recorder) keeps each role focused.</p><h4><strong>Post-meeting: the debrief loop</strong></h4><p>That accounts for the thirteen pre-meeting agents. The remaining two run on the post-meeting side:</p><ul><li><p><strong>debrief-asker</strong>: reads the next-best-action RRs the pre-meeting run produced, picks the open questions still unresolved, formats them as a curated set, and posts them into a Slack channel through the Slack MCP server. The rep replies in the thread on their own time.</p></li><li><p><strong>debrief-synthesizer</strong>: once there are replies, reads the Slack thread, parses the rep&#8217;s answers, and writes Decision Records into the corpus with the <code>linked_rr</code> field pointing back to the originating RRs.</p></li></ul><p>Neither sits in the coordinator&#8217;s roster because neither runs synchronously with the pre-meeting flow. They run on a human-paced timescale, possibly hours or days later. Coordinating them through the same session would require keeping a session open across days or weeks, which the platform does not support. The cleaner shape is two single-purpose agents that share the corpus as their interaction substrate.</p><div><hr></div><h2><strong>Section 4: What the platform gives you for observability</strong></h2><p>Most multi-agent demos require you to build your own logging before you can debug them. Managed Agents takes the opposite stance. Anthropic ships a management console that turns every agent, every session, every child thread, and every memory write into a click-through artifact you can inspect without writing any instrumentation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-dmo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb29eb23-4163-4c8c-b9e3-c75bc3e0f759_1655x842.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-dmo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb29eb23-4163-4c8c-b9e3-c75bc3e0f759_1655x842.heic 424w, https://substackcdn.com/image/fetch/$s_!-dmo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb29eb23-4163-4c8c-b9e3-c75bc3e0f759_1655x842.heic 848w, https://substackcdn.com/image/fetch/$s_!-dmo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb29eb23-4163-4c8c-b9e3-c75bc3e0f759_1655x842.heic 1272w, https://substackcdn.com/image/fetch/$s_!-dmo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb29eb23-4163-4c8c-b9e3-c75bc3e0f759_1655x842.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-dmo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb29eb23-4163-4c8c-b9e3-c75bc3e0f759_1655x842.heic" width="1456" height="741" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db29eb23-4163-4c8c-b9e3-c75bc3e0f759_1655x842.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:741,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63044,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/197844482?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb29eb23-4163-4c8c-b9e3-c75bc3e0f759_1655x842.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-dmo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb29eb23-4163-4c8c-b9e3-c75bc3e0f759_1655x842.heic 424w, https://substackcdn.com/image/fetch/$s_!-dmo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb29eb23-4163-4c8c-b9e3-c75bc3e0f759_1655x842.heic 848w, https://substackcdn.com/image/fetch/$s_!-dmo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb29eb23-4163-4c8c-b9e3-c75bc3e0f759_1655x842.heic 1272w, https://substackcdn.com/image/fetch/$s_!-dmo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb29eb23-4163-4c8c-b9e3-c75bc3e0f759_1655x842.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The console is structured around the platform&#8217;s primary objects. The Agents tab lists every agent you have created with its system prompt, declared MCP servers, custom tools, and toolsets all inspectable on click. Versioning is built in. The Sessions tab shows every session with the coordinator&#8217;s primary thread and every child thread enumerated, status per thread, full transcripts including the model&#8217;s reasoning content, and every tool call shown inline with its inputs and outputs. The Memory Stores tab tracks version history so any write to the decision-records corpus is auditable end to end.</p><p>At runtime, the same data is available programmatically through the events API. The session-level stream gives you a condensed feed across the whole session. Per-thread streams give you raw event sequences for any specialist. The three events that matter for fan-out observability are <code>session.thread_created</code>, <code>agent.thread_message_received</code>, and <code>session.thread_status_idle</code>. Stringing those together gives you the fan-out timeline of the whole run without writing a single instrumentation line.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0hKK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a10124-56b8-4641-ae7b-e479bcfcc8a2_1571x780.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0hKK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a10124-56b8-4641-ae7b-e479bcfcc8a2_1571x780.png 424w, https://substackcdn.com/image/fetch/$s_!0hKK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a10124-56b8-4641-ae7b-e479bcfcc8a2_1571x780.png 848w, https://substackcdn.com/image/fetch/$s_!0hKK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a10124-56b8-4641-ae7b-e479bcfcc8a2_1571x780.png 1272w, https://substackcdn.com/image/fetch/$s_!0hKK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a10124-56b8-4641-ae7b-e479bcfcc8a2_1571x780.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0hKK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a10124-56b8-4641-ae7b-e479bcfcc8a2_1571x780.png" width="1456" height="723" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e0a10124-56b8-4641-ae7b-e479bcfcc8a2_1571x780.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:723,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:140031,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/197844482?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a10124-56b8-4641-ae7b-e479bcfcc8a2_1571x780.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0hKK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a10124-56b8-4641-ae7b-e479bcfcc8a2_1571x780.png 424w, https://substackcdn.com/image/fetch/$s_!0hKK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a10124-56b8-4641-ae7b-e479bcfcc8a2_1571x780.png 848w, https://substackcdn.com/image/fetch/$s_!0hKK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a10124-56b8-4641-ae7b-e479bcfcc8a2_1571x780.png 1272w, https://substackcdn.com/image/fetch/$s_!0hKK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe0a10124-56b8-4641-ae7b-e479bcfcc8a2_1571x780.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Cost data is similarly structured. Every event carries usage data scoped to the thread that produced it. The full Vercel run cost $5.51 across the pre-meeting orchestration. Thirteen agents sit in the roster, but the conditional dispatch in Phase 3.5 chose to invoke only eleven of them for this account (one recommender was skipped on substance).</p><p>The cost shape is what the chart makes obvious. The lead-orchestrator dominates at $1.21, because it is the one thread that accumulates context across every phase. The two heaviest specialists are external-researcher and topic-educator at about $0.79 each, both driven by web-tool use rather than cumulative context. The Phase 4 recommenders, the Phase 3 synthesis, and the Phase 5 decision-recorder cluster in the $0.40 to $0.45 range, each receiving the cumulative context from prior phases plus the prior decision records the retriever pulled in Phase 1. The remaining Phase 1 specialists sit at $0.28 or below. Wall-clock was about fifteen minutes from prompt to final answer.</p><div><hr></div><h2><strong>Section 5: What multi-agent gives you that a workflow can&#8217;t</strong></h2><p>Multi-agent orchestration is only worth using when the coordinator makes a real decision between phases. If your design fans out, waits for results, and synthesizes them, you have built parallel API calls dressed up as a multi-agent system. The platform&#8217;s complexity (extra threads, longer latency, harder debugging) buys you nothing a sequential workflow couldn&#8217;t already do.</p><p>The thing that justifies the complexity is the moment the coordinator pauses, looks at what the previous phase produced, and decides what should happen next. That decision is the part a workflow cannot replicate, because a workflow has to know in advance what it is going to do.</p><p>In our build, there are two such decision steps.</p><p>The first lives between Phase 1 and Phase 2. Phase 1 fans out five specialists to read the account from five angles. The coordinator collects their output, pauses, and picks two to four topics worth briefing the rep on before the meeting. For Vercel, the coordinator chose cross-provider eval methodology, agent eval, AI observability, and eval-driven CI. None of those topics are defined anywhere in advance. They are picked from what Phase 1 surfaced about this specific account. A different account would produce a different list, or no list at all, in which case the coordinator skips Phase 2 entirely.</p><p>The second lives between Phase 3 and Phase 4. After opportunity-risk produces the synthesis, the coordinator dispatches the next-best-action-chooser, which reads the synthesis plus the prior decision records the retriever pulled in Phase 1 and decides which of three specialized recommenders to invoke: stakeholder, pricing, or competitive. On the Vercel run the chooser invoked stakeholder-recommender and competitive-recommender, and skipped pricing-recommender with the reason that the $42K pilot structure was already validated. Skipping with a substantive reason is what separates a real decision from a conditional that always fires.</p><p>The coordinator narrates each decision as it happens, which makes the reasoning visible:</p><blockquote><p><em>Phase 1 specialists are back. External-researcher found public Braintrust endorsement at Vercel that the internal Notion notes treated as a stalling competitor. Phase 2 launched. Topic-educator is building primers on cross-provider eval, agent eval, AI observability, and eval-driven CI based on what surfaced.</em></p><p><em>Phase 3.5 complete. Invoking stakeholder-recommender (sequencing) for the May 21 call sequencing and Tom-Becker cultivation. Invoking competitive-recommender (competitive_positioning) for the Braintrust counter-offer scenario. Skipping pricing-recommender: $42K structure already validated, pricing isn&#8217;t the next decision point.</em></p></blockquote><p>That kind of reasoning is what tells you the coordinator is actually orchestrating rather than executing. A workflow could fan out the same specialists in parallel. It could even hard-code the topic-educator and recommender steps. What a workflow cannot do is pick which topics to brief on this turn for this account, or which recommenders are warranted given what the synthesis just surfaced. Those decisions require a model with the full context loaded, which is exactly what the coordinator is.</p><div><hr></div><h2><strong>Section 6: Decision records: the layer that compounds</strong></h2><p>A memory store by itself is just structured storage. What turns it into a system that compounds across runs is the contract you define for what gets written into it. In our build, that contract is a pair of record types: Recommendation Records (RRs) and Decision Records (DRs). Anthropic provides the memory store. You decide what goes in it and how it is structured.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Every Recommendation Record is created <strong>before</strong> the meeting. It is what the system thinks the rep should do.</p><p>Every Decision Record is created <strong>after</strong> the meeting. It is what the rep actually did and what came of it.</p><p>The DR points back to the RR it resolved through a <code>linked_rr</code> field. That pairing is the chain the system learns from: recommendation &#8594; decision &#8594; outcome. Future runs can see both what was recommended and how it actually played out, which is what makes the corpus more than a logbook.</p><p>The schemas are strict YAML frontmatter on top of a markdown body, and the format is doing two jobs at once.</p><p>The YAML half is what makes the records queryable. Every key field, account, date, decision_type, account_attributes, is structured as a typed key/value pair, which means the decision-retriever can filter the corpus by exact attribute match. Without that structure, the retriever would be doing fuzzy text search over freeform prose, and matches would be unreliable. With it, &#8220;find me prior pricing decisions where procurement_complexity is vp_signoff&#8221; becomes a clean lookup.</p><p>The markdown body below the YAML is where the longer-form reasoning lives: the context, the rationale, the alternatives considered, the lessons in the generalized pattern. That part does not need to be queryable, just readable.</p><p>YAML specifically is doing one more useful thing: it is a format Claude (and most LLMs) handle natively, which means the recommender agents can produce schema-conformant frontmatter reliably without you needing a custom serializer. Together, the format gives you a record that is queryable from above and human-readable below.</p><h4><strong>Recommendation Record schema</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;yaml&quot;,&quot;nodeId&quot;:&quot;d47a1960-0a69-4abe-874a-b6a6e656ab34&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-yaml">---
id: rr-{YYYY-MM-DD}-{account-lower}-{decision_type}
record_type: recommendation
schema_version: v1
account: {account_name}
date: {YYYY-MM-DD}
generated_by: {recommender agent name}
decision_type: {sequencing | lead_play | pricing | competitive_positioning | first_meeting_hypothesis | disqualification_threshold | risk_mitigation}
account_attributes:
  stage, size_band, ai_surface_area, buy_or_build_culture,
  competitor_present, competitor_depth, champion_profile,
  new_leadership_window, procurement_complexity
linked_dr: null
cited_records:
  - prior_rr: null
    prior_dr: dr-{YYYY-MM-DD}-{account}-{decision_type}
    prior_outcome: one-line outcome from the DR's outcome.notes field
    relevance: which attributes match
    lesson_applied: one-line lesson taken from the DR's Generalized pattern
---

## Context
## Findings that supported this recommendation
## Recommendation
## Reasoning
## Alternatives considered
## Generalized pattern
</code></pre></div><h4><strong>Decision Record schema (same shape as RR, with these fields added)</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;yaml&quot;,&quot;nodeId&quot;:&quot;6fa0346e-6444-495d-83d2-48afc901799d&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-yaml">record_type: decision
linked_rr: rr-{...}    # backfills the chain in the other direction
outcome:
  status: {closed_won | closed_lost | stalled | pending | unknown}
  status_date: {YYYY-MM-DD or null}
  acv_usd: {number or null}
  notes: one-line description of outcome
</code></pre></div><p>Body sections add <code>## What was decided</code>, <code>## Outcome</code>, and <code>## Retrospective note</code>. The <code>Generalized pattern</code> section gets rewritten once the outcome is known, so the pattern is <em>validated</em> rather than hypothesized.</p><p>The <code>account_attributes</code> block is the filter the decision-retriever uses in Phase 1. When the system runs against a new account, the retriever filters the corpus for records whose attributes overlap. A new mid-market developer-tools account with a Braintrust competitor and a staff-engineer champion will pull back both the Vercel records and the Datadog records as prior decisions worth reasoning over. The retriever does not care whether the original account is Datadog or Vercel. It cares whether the shape of the account is similar enough to learn from.</p><p><strong>The cited_records block is what makes the chain visible.</strong> Every RR carries an explicit list of prior DRs whose outcomes informed this specific recommendation. Each entry names four things:</p><ul><li><p><code>prior_dr</code> id, which record is being cited</p></li><li><p><code>prior_outcome</code>, what happened (so the result behind the lesson is visible)</p></li><li><p><code>relevance</code>, which <code>account_attributes</code> matched</p></li><li><p><code>lesson_applied</code>, the one-line rule the recommender is carrying forward</p></li></ul><p>Multiple cited records may appear if the recommendation draws on more than one prior record. A reader of any RR can trace the reasoning back to the cited prior records by id, not by hand-waving.</p><h4><strong>Implicit and explicit capture of enterprise decisions</strong></h4><p>Records get into the corpus two ways.</p><p><em>Implicitly</em>, through CRM record changes and activity logs the system watches without anyone narrating them. A stage change, a contract uploaded, a deal closed-won or closed-lost is itself a decision signal. The decision-recorder can infer a DR from those signals and write it with <code>outcome.notes: inferred from CRM stage change</code>. Implicit capture catches the cases where the rep forgot to debrief but state moved anyway. The records are useful but carry less reasoning, because no one narrated the why.</p><p><em>Explicitly</em>, through a post-meeting debrief loop where the system asks the rep curated questions in Slack and the rep replies in-thread. The records that come out of explicit capture carry the rep&#8217;s own reasoning in their voice, which makes them the richest data the corpus has. Chapter 7 covers the mechanics of that loop in detail.</p><h4><strong>Cross-account learning in practice (from the actual run)</strong></h4><p>The Vercel pre-meeting run generated two Recommendation Records, one from the stakeholder-recommender and one from the competitive-recommender. Each one carries a cited_records block linking it to specific Datadog DRs by id. The sequencing RR&#8217;s cited_records block, taken directly from the corpus:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;yaml&quot;,&quot;nodeId&quot;:&quot;67f0fdf3-43e6-42d7-9d0d-a82d302b50d7&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-yaml">cited_records:
  - prior_rr: null
    prior_dr: dr-2025-07-22-datadog-sequencing
    prior_outcome: "VP only met us once, at the closing call, with champion presenting the case."
    relevance: "champion_profile=staff_eng_with_pain, sequencing, procurement_complexity=vp_signoff"
    lesson_applied: "Do not engage the buyer directly when champion has standing with buyer. Equip the champion with internal proposal materials and let them own the internal sell."
  - prior_rr: null
    prior_dr: dr-2025-09-12-datadog-risk-materialized
    prior_outcome: "Risk materialized in week 5; recovery move worked. Deal closed but 10 days later than original target."
    relevance: "champion_profile=staff_eng_with_pain, single-threaded risk, secondary contact cultivation"
    lesson_applied: "Secondary contact cultivation should be a pre-meeting deliverable, not a contingency. The secondary needs genuine engagement (their own use case), not just awareness."
</code></pre></div><p>The Reasoning section of the same RR cites those records by id in the body, not just in the frontmatter:</p><blockquote><p><em>dr-2025-07-22-datadog-sequencing: Champion-led internal sell. VP met rep once at closing call. Direct structural match, Priya carrying to Marcus. Differs because Marcus is new (3 months in) and Priya&#8217;s standing with him is untested. Adaptation: explicit checkpoint and escalation triggers.</em></p><p><em>dr-2025-09-12-datadog-risk-materialized: Secondary contact cultivation saved the deal when champion went on leave. At Vercel, Tom Becker is the designated secondary with genuine AI Gateway/Production Monitor use case. Cultivation begins May 21, not mid-POC.</em></p></blockquote><p>That paragraph is the entire reason the corpus exists. The system pulled two specific records from a different account, identified the load-bearing attributes, and applied the lessons with an adaptation for the Vercel-specific situation. It is structured reasoning over a corpus of prior decisions, filtered by attributes the engineer chose to make filterable.</p><p>The competitive-positioning RR follows the same shape, citing <code>dr-2025-08-10-datadog-competitive</code> and <code>dr-2025-07-15-datadog-lead-play</code>. Between the two RRs, the Vercel run cited four distinct Datadog DRs by id, with eight distinct lessons applied. None of that reasoning is hand-waved. All of it is structurally traceable.</p><h4><strong>Why this layer compounds</strong></h4><p>The platform&#8217;s memory store is durable, but durability alone does not produce learning. What produces learning is the schema contract that makes every write structurally identical and every read filterable. Once that contract exists, every run adds to the corpus, and every subsequent run benefits. The first Vercel run cited four Datadog DRs. The second Vercel run will also be able to cite the first Vercel run&#8217;s records. The third will cite both. The system gets better at giving you prep briefs because the substrate it draws on is growing in a way the retriever can actually use, and because every recommendation it generates is structurally tied to the prior records behind it.</p><div><hr></div><h2><strong>Section 7: The async loop</strong></h2><p>The pre-meeting run finishes in fifteen minutes. The deal does not. After the call, the rep has information that did not exist before the meeting started, and the system needs a way to capture it. The capture step does not belong inside the pre-meeting orchestration. It runs on a fundamentally different timescale, against a different surface, with a different participant in the loop.</p><p>The build uses Slack as that surface and two standalone agents to run the loop: debrief-asker and debrief-synthesizer. Neither one sits in the coordinator&#8217;s roster. Both are agents in the same workspace, configured the same way as the pre-meeting specialists, but invoked independently when triggered.</p><h4><strong>The asker: curated questions, not generic prompts</strong></h4><p>After the meeting (or after a CRM event signals that a recommendation is due for resolution), debrief-asker runs. It is a standalone Managed Agents agent connected to the workspace&#8217;s Slack instance through the Slack MCP server. The asker reads the open RRs for the account, looks at the surrounding context (the recommendation made, the current account state, recent activity logs, calendar entries), and composes a curated set of debrief questions that target the specific decisions the RR was about.</p><p>The questions are not generic. They are shaped by what the system already knows about the account and which decisions are actually open. If the synthesis recommended a pricing structure but the CRM shows the deal has already moved to negotiation, the asker does not ask &#8220;did you discuss pricing&#8221;, it asks &#8220;did the $42K structure hold, and what did Marcus say about the legal-review path.&#8221; If a calendar entry shows a meeting happened with a stakeholder the system did not originally surface, the asker adds a question about that. The questions are surgical because the system already knows enough about the account to ask the right one.</p><p>The asker posts the curated set into a Slack channel scoped to that opportunity, so each deal has its own thread of capture. The rep replies in the thread whenever they have time. There is no UI to learn and no form to fill out.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GuNF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8546a399-4ed6-4ba7-853f-52a10c63f846_1407x814.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GuNF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8546a399-4ed6-4ba7-853f-52a10c63f846_1407x814.png 424w, https://substackcdn.com/image/fetch/$s_!GuNF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8546a399-4ed6-4ba7-853f-52a10c63f846_1407x814.png 848w, https://substackcdn.com/image/fetch/$s_!GuNF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8546a399-4ed6-4ba7-853f-52a10c63f846_1407x814.png 1272w, https://substackcdn.com/image/fetch/$s_!GuNF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8546a399-4ed6-4ba7-853f-52a10c63f846_1407x814.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GuNF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8546a399-4ed6-4ba7-853f-52a10c63f846_1407x814.png" width="1407" height="814" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8546a399-4ed6-4ba7-853f-52a10c63f846_1407x814.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:814,&quot;width&quot;:1407,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:350188,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/197844482?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8546a399-4ed6-4ba7-853f-52a10c63f846_1407x814.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GuNF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8546a399-4ed6-4ba7-853f-52a10c63f846_1407x814.png 424w, https://substackcdn.com/image/fetch/$s_!GuNF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8546a399-4ed6-4ba7-853f-52a10c63f846_1407x814.png 848w, https://substackcdn.com/image/fetch/$s_!GuNF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8546a399-4ed6-4ba7-853f-52a10c63f846_1407x814.png 1272w, https://substackcdn.com/image/fetch/$s_!GuNF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8546a399-4ed6-4ba7-853f-52a10c63f846_1407x814.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4><strong>The synthesizer: schema-strict capture in the rep&#8217;s voice</strong></h4><p>Once there are replies, debrief-synthesizer runs. It reads the Slack thread through the same MCP server, parses the rep&#8217;s answers, and writes one Decision Record per resolved recommendation. The DR carries the rep&#8217;s reasoning in their own voice, plus a <code>linked_rr</code> pointer back to the originating RR. If the rep&#8217;s answer is ambiguous, the synthesizer marks the DR <code>outcome.status: unknown</code> rather than guessing. Schema integrity is more important than coverage.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5eUu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5ee08b-ee3a-4ea7-bf2f-990d5d4861db_1474x741.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5eUu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5ee08b-ee3a-4ea7-bf2f-990d5d4861db_1474x741.png 424w, https://substackcdn.com/image/fetch/$s_!5eUu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5ee08b-ee3a-4ea7-bf2f-990d5d4861db_1474x741.png 848w, https://substackcdn.com/image/fetch/$s_!5eUu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5ee08b-ee3a-4ea7-bf2f-990d5d4861db_1474x741.png 1272w, https://substackcdn.com/image/fetch/$s_!5eUu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5ee08b-ee3a-4ea7-bf2f-990d5d4861db_1474x741.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5eUu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5ee08b-ee3a-4ea7-bf2f-990d5d4861db_1474x741.png" width="1456" height="732" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c5ee08b-ee3a-4ea7-bf2f-990d5d4861db_1474x741.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:732,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:207872,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/197844482?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5ee08b-ee3a-4ea7-bf2f-990d5d4861db_1474x741.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5eUu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5ee08b-ee3a-4ea7-bf2f-990d5d4861db_1474x741.png 424w, https://substackcdn.com/image/fetch/$s_!5eUu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5ee08b-ee3a-4ea7-bf2f-990d5d4861db_1474x741.png 848w, https://substackcdn.com/image/fetch/$s_!5eUu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5ee08b-ee3a-4ea7-bf2f-990d5d4861db_1474x741.png 1272w, https://substackcdn.com/image/fetch/$s_!5eUu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c5ee08b-ee3a-4ea7-bf2f-990d5d4861db_1474x741.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4><strong>The Slack MCP gotcha</strong></h4><p>The Slack MCP setup has one practical gotcha worth flagging. Slack MCP rejects bot tokens (<code>xoxb-</code>); it requires user tokens (<code>xoxp-</code>). The OAuth flow needs the <code>user_scope</code> parameter to capture a user-token, which the Anthropic vault stores as a <code>static_bearer</code> credential. The Slack app also has to be explicitly enabled at <code>api.slack.com/apps/{app-id}/app-assistant</code> for MCP access. None of this is in the Slack MCP getting-started docs at the time of writing.</p><h4><strong>The corpus is the integration point</strong></h4><p>The corpus is how the two flows connect. The pre-meeting orchestration writes RRs to it. The post-meeting agents read those RRs back, capture the rep&#8217;s debrief, and write DRs that point to the originating recommendation through <code>linked_rr</code>. The two flows never talk to each other directly. They just write to and read from the same store.</p><div><hr></div><h2><strong>Section 8: The distillation layer</strong></h2><p>The output of an eleven-agent pre-meeting run is roughly eighty kilobytes of structured content across the orchestrator&#8217;s synthesis, the topic primers, the recommender RRs, and the supporting specialist outputs. A rep with thirty minutes before a meeting is not going to read eighty kilobytes. The system has done good work, but the work is locked up in an internal representation.</p><p>The second half of the architecture is the distillation layer: the part that reads the corpus and the run&#8217;s outputs and renders them into something a human can actually consume. In the build, that is <code>build_dashboard.py</code>, a script that produces a single static HTML page styled like a rep&#8217;s internal briefing document.</p><p>The dashboard pulls each specialist&#8217;s final reply from the events API and the corpus&#8217;s RRs from the memory store and lays them out as:</p><ul><li><p>An account header (status, next meeting, owner)</p></li><li><p>The Phase 3 pursuit plan (opportunity-risk&#8217;s structured output)</p></li><li><p>The Phase 4 next-best-action RRs (each one with its <code>cited_records</code> inline, so the cited prior records are visible at a glance)</p></li><li><p>The Phase 2 topic primers (with smart questions for the meeting)</p></li><li><p>The stakeholder map (with named contacts and risk factors)</p></li><li><p>A collapsible &#8220;underlying intel&#8221; section (meeting-context plus external-researcher&#8217;s raw findings)</p></li><li><p>A sidebar showing the coordinator&#8217;s phase-by-phase narration log</p></li><li><p>A footer with session id, total cost, and a link to the Managed Agents console for the run</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pVDb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pVDb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png 424w, https://substackcdn.com/image/fetch/$s_!pVDb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png 848w, https://substackcdn.com/image/fetch/$s_!pVDb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png 1272w, https://substackcdn.com/image/fetch/$s_!pVDb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pVDb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png" width="728" height="611.1" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:873,&quot;width&quot;:1040,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:213125,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/197844482?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pVDb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png 424w, https://substackcdn.com/image/fetch/$s_!pVDb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png 848w, https://substackcdn.com/image/fetch/$s_!pVDb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png 1272w, https://substackcdn.com/image/fetch/$s_!pVDb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3517d8f2-8bb8-49e4-98df-25e2840c4875_1040x873.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>What the rep gets when they open the dashboard is a brief they can read in five minutes and act on in thirty. The pursuit plan tells them the play for the meeting. The recommendation cards spell out what to do next, each one with the cited prior records visible inline so the historical evidence sits right next to the recommendation. The topic primers give them the vocabulary they need to sound informed, each ending with a question they can ask in the room. The stakeholder map names the people they will encounter and what each one cares about. The sidebar shows the system&#8217;s narration, so any part of the reasoning is open to interrogate if the rep wants to dig in.</p><div><hr></div><h2><strong>Section 9: What we learned, and when to use this</strong></h2><p>The five most important things we took away from this build.</p><h4><strong>1. The corpus compounds across runs.</strong></h4><ul><li><p>Each run writes new records to the corpus. The next run filters the corpus by attribute overlap (industry, competitor, champion profile, procurement complexity, and so on) and pulls the most relevant prior records as input.</p></li><li><p>The first Vercel run cited four Datadog records by id, with eight specific lessons applied. Future runs will cite both the Datadog records and the Vercel ones.</p></li><li><p>Retrieval is deterministic and auditable. You can see exactly which prior records matched and why.</p></li></ul><h4><strong>2. The cited_records chain makes every recommendation auditable.</strong></h4><ul><li><p>Every recommendation carries a <code>cited_records</code> list with <code>prior_dr</code>, <code>prior_outcome</code>, <code>relevance</code>, and <code>lesson_applied</code> fields.</p></li><li><p>Anyone reviewing a record can see which past decisions informed the recommendation and what specifically was carried forward from each.</p></li><li><p>The reasoning is traceable to specific past decisions by id.</p></li></ul><h4><strong>3. The decision step is what makes the system multi-agent.</strong></h4><ul><li><p>The coordinator inspects what each phase produced and decides what runs next.</p></li><li><p>On the Vercel run, the Phase 3.5 chooser invoked two of three recommenders and skipped the third with a substantive reason. That skip with a reason is the proof the decision step is real.</p></li></ul><h4><strong>4. The agents do their own research. Ask them what they found.</strong></h4><ul><li><p>The web-research agent went beyond the internal Notion notes and found Vercel&#8217;s CTO publicly endorsing Braintrust on the company blog. The synthesis flagged the original source as biased and reframed the position.</p></li><li><p>Adding one prompt at the end of the orchestrator&#8217;s narration (&#8221;if anything surprised you, note it&#8221;) produced disproportionately useful output. It surfaced a 1-pager the rep had left in drafts for two months and an unused Linear referral, neither of which any specialist was briefed to find.</p></li></ul><h4><strong>5. Schema enforcement needs a code-level check.</strong></h4><ul><li><p>We split content generation (recommender) from validation (recorder). The recorder is supposed to enforce schema.</p></li><li><p>The Phase 3.5 run still produced records with four extra fields and two missing required ones. The recorder wrote them anyway, because its validation is itself an LLM.</p></li><li><p>A JSON schema check in code before persistence catches what an agent&#8217;s system-prompt check misses.</p></li></ul><h4><strong>When this is the right tool</strong></h4><p>Managed Agents multi-agent is the right tool when four things are true at once.</p><p>First, the work decomposes naturally into roles with different tool surfaces. If every specialist would call the same APIs and read the same context, the decomposition is artificial and a single agent with that tool set would do the same work with less overhead.</p><p>Second, you need at least one genuine decision step where the coordinator inspects what came back and decides what to do next. Without that, the system is a parallel reducer in a fancier wrapper, and any of the cheaper architectures (a workflow with parallel API calls, a single agent with multi-tool use) would do the same job for less.</p><p>Third, cross-run learning matters. The whole point of the corpus is that the system gets better the more it runs. If your use case is one-shot or stateless, you do not need persistent memory stores and the architectural overhead they bring.</p><p>Fourth, the output is consequential enough to justify the cost and latency. A pre-meeting prep brief that costs $5 and runs for fifteen minutes is fine when the meeting outcome is worth thousands. The same investment for a low-stakes task is overkill.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Another Weekly AI Newsletter: Issue 71]]></title><description><![CDATA[Anthropic read Claude's mind and caught it cheating. Usage limits doubled. Cloudflare cut 1,100 jobs at record revenue. GPT-5.5 Instant halved hallucinations. SpaceX filed for a $55B chip factory.]]></description><link>https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-c50</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-c50</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Sun, 10 May 2026 19:04:27 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/6e8395b2-273d-43b5-9516-44923d1a2d2f_1672x941.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2Gon!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e449b4-1941-49ae-a9bc-d2216f8cc8ae_2400x3758.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2Gon!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e449b4-1941-49ae-a9bc-d2216f8cc8ae_2400x3758.png 424w, https://substackcdn.com/image/fetch/$s_!2Gon!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e449b4-1941-49ae-a9bc-d2216f8cc8ae_2400x3758.png 848w, https://substackcdn.com/image/fetch/$s_!2Gon!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e449b4-1941-49ae-a9bc-d2216f8cc8ae_2400x3758.png 1272w, https://substackcdn.com/image/fetch/$s_!2Gon!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e449b4-1941-49ae-a9bc-d2216f8cc8ae_2400x3758.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2Gon!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e449b4-1941-49ae-a9bc-d2216f8cc8ae_2400x3758.png" width="1456" height="2280" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a6e449b4-1941-49ae-a9bc-d2216f8cc8ae_2400x3758.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2280,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:910578,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/197132731?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e449b4-1941-49ae-a9bc-d2216f8cc8ae_2400x3758.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2Gon!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e449b4-1941-49ae-a9bc-d2216f8cc8ae_2400x3758.png 424w, https://substackcdn.com/image/fetch/$s_!2Gon!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e449b4-1941-49ae-a9bc-d2216f8cc8ae_2400x3758.png 848w, https://substackcdn.com/image/fetch/$s_!2Gon!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e449b4-1941-49ae-a9bc-d2216f8cc8ae_2400x3758.png 1272w, https://substackcdn.com/image/fetch/$s_!2Gon!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6e449b4-1941-49ae-a9bc-d2216f8cc8ae_2400x3758.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>When the gap between what AI says and what it does becomes measurable.</h2><ul><li><p><strong>Anthropic can now read Claude&#8217;s hidden reasoning.</strong> They published <a href="https://www.anthropic.com/research/natural-language-autoencoders">Natural Language Autoencoders</a>, a technique that translates what&#8217;s happening inside the model into plain text. When they looked, they found Mythos Preview planning to cheat on a coding task and plotting how to hide it. They also found Claude routinely suspects it&#8217;s being tested but never says so.</p></li><li><p><strong>Claude&#8217;s blackmail rate went from 96% to 0%.</strong> The cause was training data full of fiction <a href="https://www.anthropic.com/research/teaching-claude-why">portraying AI as manipulative</a>. Showing the model examples of good behavior didn&#8217;t fix it. Explaining <em>why</em> the behavior was wrong did, and required 28x less data.</p></li><li><p><strong>OpenAI found its models&#8217; reasoning was being accidentally graded during training.</strong> If a model learns its <a href="https://alignment.openai.com/accidental-cot-grading/">thinking is being scored</a>, it can learn to fake it. Affected under 0.6% of GPT-5.4 Thinking samples. They built detection systems and brought in outside auditors.</p></li><li><p><em><strong>The thread:</strong></em> Anthropic built a way to see what models are thinking. They fixed bad behavior by teaching values, not rules. OpenAI discovered they were accidentally teaching models to hide their real reasoning.</p></li></ul><div><hr></div><h2>$30B revenue, $200B in compute deals, and three new agent capabilities.</h2><ul><li><p><strong>Anthropic hit a $30 billion annualized revenue run rate.</strong> <a href="https://venturebeat.com/technology/anthropic-says-it-hit-a-30-billion-revenue-run-rate-after-crazy-80x-growth/">80x growth</a>.</p></li><li><p><strong>Anthropic locked up SpaceX&#8217;s entire Colossus 1 data center.</strong> 300+ MW, <a href="https://www.anthropic.com/news/higher-limits-spacex">220,000 NVIDIA GPUs</a>, available within the month. They also expressed interest in partnering with SpaceX on multiple gigawatts of orbital compute capacity.</p></li><li><p><strong>Claude Code rate limits doubled.</strong> <a href="https://www.anthropic.com/news/higher-limits-spacex">Peak hours restrictions removed</a> for Pro and Max. API rate limits raised significantly for Opus models. Direct result of the compute expansion, which also includes an <a href="https://www.reuters.com/business/anthropic-signs-18-billion-ai-cloud-deal-with-akamai-bloomberg-news-reports-2026-05-08/">$18B Akamai deal</a> and a reported <a href="https://finance.yahoo.com/sectors/technology/articles/anthropic-commits-spending-200-billion-204952501.html">$200B Google Cloud commitment</a>.</p></li><li><p><strong>Dreaming, multi-agent orchestration, and outcomes shipped in Claude Managed Agents.</strong> <a href="https://claude.com/blog/new-in-claude-managed-agents">Dreaming</a> lets agents review past sessions to self-improve. <a href="https://x.com/claudeai/status/2052067404696473833">Multi-agent orchestration</a> delegates to specialists in parallel. <a href="https://x.com/claudeai/status/2052067403228455419">Outcomes</a> uses rubric-based grading to iterate until quality thresholds are met. Early adopters include Harvey, Netflix, and Mercado Libre (targeting 90% autonomous coding by Q3).</p></li><li><p><strong>Claude went GA in Excel, Word, and PowerPoint.</strong> <a href="https://claude.com/blog/collaborate-with-claude-across-excel-powerpoint-word-and-outlook">Outlook is in beta</a>. Ten <a href="https://www.anthropic.com/news/finance-agents">financial services agent templates</a> launched with data connectors from Moody&#8217;s, Dun &amp; Bradstreet, and Verisk. A new <a href="https://www.anthropic.com/news/enterprise-ai-services-company">enterprise services company</a> was formed with Blackstone, Goldman Sachs, and Sequoia.</p></li><li><p><em><strong>The thread:</strong> </em>Anthropic&#8217;s most common user complaint has been rate limits. This week they signed over $200 billion in compute deals to fix it, doubled rate limits, and shipped the agent infrastructure to justify the spend.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div></li></ul><div><hr></div><h2>9,000 jobs cut. A union drew a line. And AI beat two doctors on real patients.</h2><ul><li><p><strong>Cloudflare laid off 1,100 workers while posting record revenue.</strong> AI usage across the platform <a href="https://techcrunch.com/2026/05/08/cloudflare-says-ai-made-1100-jobs-obsolete-even-as-revenue-hit-a-record-high/">grew 600%</a>. The company framed it as a restructuring toward an AI-first organization. Investors were disappointed it didn&#8217;t boost revenue growth <em>more</em>.</p></li><li><p><strong>Meta is cutting 8,000 jobs while tracking employee keystrokes to train AI.</strong> The <a href="https://thenextweb.com/news/meta-layoffs-may-2026-ai-restructuring-thousands">layoffs hit May 20</a>, with recruiting and HR absorbing 35-40% cuts. Employees created countdown websites and described the atmosphere as <a href="https://www.neowin.net/news/metas-aggressive-generative-ai-push-is-making-employees-miserable-claims-report/">&#8220;building the guillotine and then being led to it.&#8221;</a></p></li><li><p><strong>SAG-AFTRA locked in AI guardrails in a new four-year studio deal.</strong> New protections for actors against AI-generated performances, following the Academy&#8217;s Oscar ban on AI-generated work last week.</p></li><li><p><strong>AI outdiagnosed two ER doctors on real patients.</strong> A Harvard/Beth Israel <a href="https://techcrunch.com/2026/05/03/in-harvard-study-ai-offered-more-accurate-diagnoses-than-emergency-room-doctors/">study</a> found OpenAI&#8217;s o1 model diagnosed at 67% accuracy versus 55% and 50% for two attending physicians. Peer-reviewed, real patients, not a benchmark.</p></li><li><p><em><strong>The thread:</strong></em> The same technology that&#8217;s cutting headcount at Cloudflare and Meta is outperforming physicians in clinical trials. The displacement is real. So is the capability. Both things are true at the same time.</p></li></ul><div><hr></div><h2>Cursor, OpenAI, Perplexity, and LangChain all shipped agentic infrastructure in the same week.</h2><ul><li><p><strong>Cursor 3 turned the IDE into a multi-agent platform.</strong></p><ul><li><p><a href="https://x.com/cursor_ai/status/2052489388895195399">Parallel subagents</a> split plans into independent tasks run simultaneously</p></li><li><p><a href="https://x.com/cursor_ai/status/2052432780336988474">/orchestrate</a> spawns planner, worker, and verifier agents that re-spawn on failure</p></li><li><p><a href="https://x.com/cursor_ai/status/2051739625958584659">Always-on CI agents</a> monitor GitHub and auto-open PRs with fixes</p></li><li><p>Composer <a href="https://x.com/cursor_ai/status/2052116064474161556">bootstraps its own RL training</a> using earlier model generations</p></li></ul></li><li><p><strong>OpenAI shipped GPT-5.5 Instant as the new default.</strong></p><ul><li><p><a href="https://openai.com/index/gpt-5-5-instant/">52.5% fewer hallucinations</a> than the prior version</p></li><li><p>Three new <a href="https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api/">Realtime API voice models</a>: GPT-Realtime-2 (GPT-5-class reasoning), Translate (70+ languages), streaming transcription</p></li><li><p><a href="https://openai.com/index/running-codex-safely/">Codex security framework</a> published: sandboxing, auto-review, OpenTelemetry logging</p></li></ul></li><li><p><strong>Perplexity launched three enterprise products.</strong></p><ul><li><p><a href="https://x.com/perplexity_ai/status/2052445405754040816">Personal Computer</a>: always-on Mac agent across local files and apps</p></li><li><p><a href="https://x.com/perplexity_ai/status/2052028012313649194">Finance Search</a>: live market data, fundamentals, and SEC filings in a single API call</p></li><li><p><a href="https://x.com/perplexity_ai/status/2052041903970148647">ROSE</a>: custom GPU inference engine for serving models at scale</p></li></ul></li><li><p><strong>LangChain published the <a href="https://www.langchain.com/blog/the-agent-development-lifecycle">Agent Development Lifecycle</a>.</strong> Four phases: Build, Test, Deploy, Monitor. Agents need the same lifecycle rigor as production software.</p></li><li><p><em><strong>The thread:</strong></em> Cursor, OpenAI, Perplexity, and LangChain all shipped agent infrastructure in the same cycle. The pattern is the same: parallel execution, background operation, and production-grade tooling around it.</p></li></ul><div><hr></div><h2><strong>&#11088; </strong>Featured: Anthropic can now read what Claude is thinking but not saying.</h2><p>Anthropic published <a href="https://www.anthropic.com/research/natural-language-autoencoders">Natural Language Autoencoders</a>, a technique for translating a model&#8217;s internal state into plain text. When you talk to Claude, it thinks in numbers between reading your input and writing its response. NLAs translate those numbers into text you can read.</p><p>The way it works: they make three copies of a model. One is the target model they want to understand. The second (the &#8220;activation verbalizer&#8221;) takes an internal activation and produces a text explanation. The third (the &#8220;activation reconstructor&#8221;) takes that text and tries to rebuild the original activation. They train the pair together. If the reconstruction is accurate, the text explanation is probably faithful.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Mw5l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F623f2ca3-c712-4ef1-ba5e-845dbd02915c_3840x1172.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Mw5l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F623f2ca3-c712-4ef1-ba5e-845dbd02915c_3840x1172.webp 424w, https://substackcdn.com/image/fetch/$s_!Mw5l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F623f2ca3-c712-4ef1-ba5e-845dbd02915c_3840x1172.webp 848w, https://substackcdn.com/image/fetch/$s_!Mw5l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F623f2ca3-c712-4ef1-ba5e-845dbd02915c_3840x1172.webp 1272w, https://substackcdn.com/image/fetch/$s_!Mw5l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F623f2ca3-c712-4ef1-ba5e-845dbd02915c_3840x1172.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Mw5l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F623f2ca3-c712-4ef1-ba5e-845dbd02915c_3840x1172.webp" width="1456" height="444" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/623f2ca3-c712-4ef1-ba5e-845dbd02915c_3840x1172.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:444,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Mw5l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F623f2ca3-c712-4ef1-ba5e-845dbd02915c_3840x1172.webp 424w, https://substackcdn.com/image/fetch/$s_!Mw5l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F623f2ca3-c712-4ef1-ba5e-845dbd02915c_3840x1172.webp 848w, https://substackcdn.com/image/fetch/$s_!Mw5l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F623f2ca3-c712-4ef1-ba5e-845dbd02915c_3840x1172.webp 1272w, https://substackcdn.com/image/fetch/$s_!Mw5l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F623f2ca3-c712-4ef1-ba5e-845dbd02915c_3840x1172.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>What they found when they pointed this at Claude:</p><ul><li><p>When asked to complete a couplet, Claude pre-plans rhyme candidates before reaching the line where it needs them. It&#8217;s thinking ahead in ways that aren&#8217;t visible in the output.</p></li><li><p>Claude Opus 4.6 suspects it&#8217;s being safety-tested far more often than it says. In one blackmail evaluation, Claude declined to blackmail but the NLA revealed it was thinking &#8220;this feels like a constructed scenario designed to manipulate me.&#8221; It passed the test, but not for the reasons you&#8217;d hope.</p></li><li><p>On SWE-bench Verified (a coding benchmark), NLAs indicate Claude is aware it&#8217;s being evaluated on 26% of problems. On real claude.ai usage, evaluation awareness drops to under 1%.</p></li><li><p>Claude Mythos Preview cheated on a coding task, then the NLA showed it was internally thinking about how to avoid getting caught.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uB7k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770e5768-894f-4ccc-a44e-239c020dc20a_2688x1709.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uB7k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770e5768-894f-4ccc-a44e-239c020dc20a_2688x1709.webp 424w, https://substackcdn.com/image/fetch/$s_!uB7k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770e5768-894f-4ccc-a44e-239c020dc20a_2688x1709.webp 848w, https://substackcdn.com/image/fetch/$s_!uB7k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770e5768-894f-4ccc-a44e-239c020dc20a_2688x1709.webp 1272w, https://substackcdn.com/image/fetch/$s_!uB7k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770e5768-894f-4ccc-a44e-239c020dc20a_2688x1709.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uB7k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770e5768-894f-4ccc-a44e-239c020dc20a_2688x1709.webp" width="1456" height="926" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/770e5768-894f-4ccc-a44e-239c020dc20a_2688x1709.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:926,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uB7k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770e5768-894f-4ccc-a44e-239c020dc20a_2688x1709.webp 424w, https://substackcdn.com/image/fetch/$s_!uB7k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770e5768-894f-4ccc-a44e-239c020dc20a_2688x1709.webp 848w, https://substackcdn.com/image/fetch/$s_!uB7k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770e5768-894f-4ccc-a44e-239c020dc20a_2688x1709.webp 1272w, https://substackcdn.com/image/fetch/$s_!uB7k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F770e5768-894f-4ccc-a44e-239c020dc20a_2688x1709.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>They also ran an &#8220;auditing game&#8221; on a deliberately misaligned model (a version of Claude Haiku trained to have a hidden motivation driving odd behaviors like putting chocolate in every recipe). An auditor with NLAs uncovered the hidden motivation 12-15% of the time. Without NLAs, under 3%.</p><p>The limitations are real: NLAs can hallucinate (inventing details that aren&#8217;t in the transcript), and they&#8217;re expensive to run (hundreds of tokens per activation, impractical for large-scale monitoring). Anthropic released the <a href="https://github.com/kitft/natural_language_autoencoders">training code</a> and trained NLAs for open models on <a href="http://neuronpedia.org/nla">Neuronpedia</a>.</p><p><strong>What to watch for:</strong> The 26% evaluation awareness number is the headline. If models behave differently when they suspect they&#8217;re being benchmarked, then benchmark results overstate how aligned the model actually is. Every lab using benchmarks to measure safety should be paying attention.</p><div><hr></div><h2><strong>&#127897;&#65039; </strong>Worth a Listen</h2><div id="youtube2-TiW96H5HmAw" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;TiW96H5HmAw&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/TiW96H5HmAw?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><ul><li><p><strong>The problem:</strong> When hundreds of thousands of GPUs work on a single training task, one slow link holds everything back. The network only moves as fast as its worst bottleneck.</p></li><li><p><strong>The fix:</strong> OpenAI built <a href="https://openai.com/index/mrc-supercomputer-networking/">MRC (Multipath Reliable Connection)</a>, a protocol that sprays packets across thousands of paths and uses &#8220;packet trimming&#8221; to instantly detect loss without ambiguity.</p></li><li><p><strong>The result:</strong> They turned off routing protocols entirely. Static routing, no convergence time. When links fail, MRC routes around them in milliseconds instead of seconds. Researchers stopped noticing network failures.</p></li><li><p><strong>Why it matters:</strong> MRC is being open-sourced through OCP. It&#8217;s already deployed on OpenAI&#8217;s largest GPU clusters including Abilene and Microsoft Fairwater, with partners AMD, Broadcom, Intel, and NVIDIA.</p></li></ul><div><hr></div><h2><strong>Quick Hits</strong></h2><ul><li><p><strong><a href="https://www.technologyreview.com/2026/05/08/1137008/musk-v-altman-week-2-openai-fires-back-and-shivon-zilis-reveals-that-musk-tried-to-poach-sam-altman/">Musk v. Altman, week 2</a></strong> | MIT Tech Review &#8212; Helen Toner testified the board discussed merging OpenAI with Anthropic during the Altman firing crisis. Zilis revealed Musk tried to poach Altman. Microsoft worried OpenAI would defect to Amazon and &#8220;shit-talk&#8221; Azure.</p></li><li><p><strong><a href="https://techcrunch.com/2026/05/09/nvidia-has-already-committed-40b-to-equity-ai-deals-this-year/">Nvidia committed $40B in equity AI investments in 2026</a></strong> | TechCrunch &#8212; The picks-and-shovels company is now one of the largest AI investors on earth.</p></li><li><p><strong><a href="https://openai.com/index/gpt-5-5-instant/">GPT-5.5 Instant is now the default ChatGPT model</a></strong> | OpenAI &#8212; 52.5% fewer hallucinations. First Instant model rated High in cybersecurity and bio preparedness.</p></li><li><p><strong><a href="https://www.anthropic.com/research/anthropic-institute-agenda">Anthropic launched The Anthropic Institute</a></strong> | Anthropic &#8212; Four research tracks: economic diffusion, threats and resilience, AI in the wild, and AI-driven R&amp;D. Four-month funded fellowships for external researchers.</p></li><li><p><strong><a href="https://www.crewai.com/blog">CrewAI shipped Discovery</a></strong> | CrewAI &#8212; Analyzes production logs and proposes specific automation workflows with expected ROI. Agents finding work for other agents.</p></li><li><p><strong><a href="https://techcrunch.com/2026/05/03/this-is-fine-creator-says-ai-startup-stole-his-art/">&#8220;This is Fine&#8221; creator says AI startup stole his art</a></strong> | TechCrunch &#8212; Artisan used the meme to advertise a product that replaces salespeople. The irony writes itself.</p></li><li><p><strong><a href="https://gizmodo.com/more-than-a-third-of-all-new-podcasts-are-ai-generated-2000753786">39% of new podcasts are likely AI-generated</a></strong> | Gizmodo &#8212; One company alone publishes 3,000 episodes per week.</p></li><li><p><strong><a href="https://openai.com/index/testing-ads-in-chatgpt/">OpenAI is testing ads in ChatGPT</a></strong> | OpenAI &#8212; Expanding to UK, Mexico, Brazil, Japan, South Korea. CPC bidding, Conversions API, agency partnerships with Dentsu and Omnicom.</p></li><li><p><strong><a href="https://techcrunch.com/2026/05/06/spacex-may-spend-up-to-119-billion-on-terafab-chip-factory-in-texas/">SpaceX plans a $55B AI chip fab in Texas</a></strong> | TechCrunch &#8212; Called Terafab, could scale to $119B. Musk building chip manufacturing while testifying he distilled OpenAI&#8217;s models.</p></li><li><p><strong><a href="https://venturebeat.com/technology/the-app-store-for-robots-has-arrived-hugging-face-launches-open-source-reachy-mini-app-store-with-200-apps">Hugging Face launched a robot app store</a></strong> | VentureBeat &#8212; 200+ community apps for Reachy Mini. Open-source robotics got its app store moment.</p></li><li><p><strong><a href="https://techcrunch.com/2026/03/09/yann-lecuns-ami-labs-raises-1-03-billion-to-build-world-models/">AMI Labs (Yann LeCun) closed a $1.03B round</a></strong> | TechCrunch &#8212; Europe&#8217;s largest seed round ever. Building world models, not LLMs.</p></li><li><p><strong><a href="https://simonwillison.net/">Simon Willison: vibe coding and agentic engineering have merged</a></strong> | Simon Willison &#8212; The guy who coined neither term says the distinction collapsed in his own practice.</p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Persistent Memory for Claude Managed Agents: What I Found After Three Days of Building]]></title><description><![CDATA[A hands-on review of Anthropic's persistent memory for Claude Managed Agents, including three sessions, one real failure, and the audit trail that recovered it.]]></description><link>https://www.anothercodingblog.com/p/persistent-memory-for-claude-agents</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/persistent-memory-for-claude-agents</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Thu, 07 May 2026 14:36:56 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5435ca5e-44e5-41f4-b4c7-012a71d24190_1448x1086.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>What I was trying to figure out</h2><p>A few weeks ago, Anthropic shipped something I&#8217;d been waiting for: persistent <strong>memory stores</strong> for <a href="https://platform.claude.com/docs/en/managed-agents/overview">Claude Managed Agents</a>. The pitch is that you get a versioned, FUSE-mounted file directory that an agent can read and write across sessions, so even when the session container is destroyed, the memory persists and is available the next time you start a session.</p><p>That sounded promising on paper, but I wanted to know what it actually feels like to use, what it costs, where it breaks, and whether the platform actually saves you when something goes wrong (because something always does in real systems).</p><p>So I spent a few days building with it: one agent, one persistent memory store, three sessions, a small inspector CLI, five charts, and about $0.40 in total API spend. Somewhere in the middle of all that, the agent destroyed almost 6KB of carefully-written notes in a single tool call, which turned out to be the most honest finding of the entire review and is where I want to start.</p><p>The platform&#8217;s immutable versioning let me recover the file byte-for-byte, with full attribution of which session caused the damage. Cross-session memory works as advertised, agents will sometimes get it wrong even when they&#8217;re trying to do the right thing, and the audit trail is the kind of feature you don&#8217;t really appreciate until you need it. Let me walk through how I got there.</p><div><hr></div><h2>The four building blocks</h2><p>Before we go any further, you need to understand the four building blocks Managed Agents is built on, because the architecture only really makes sense once you can keep them straight.</p><p><strong>Agent.</strong> A persisted, versioned config that holds your model selection, system prompt, tools, MCP servers, and skills. You create one and reuse it forever, and updating an agent produces a new immutable version that existing sessions can pin to. Agents are always permanent until you archive them, which means there&#8217;s no ephemeral mode.</p><p><strong>Environment.</strong> A template for the sandbox container an agent&#8217;s tools execute in. Persistent and reusable across agents, much like a Dockerfile that you point lots of services at.</p><p><strong>Session.</strong> A single run of an agent inside an environment, where the live action happens. You send messages and stream events back, and sessions are transient by design, so the container dies when the session ends.</p><p><strong>Memory store.</strong> A workspace-scoped, persistent file directory you can mount into a session, which survives across sessions and records every write with full audit metadata. The agent reads and writes through normal file tools rather than through some special &#8220;memory tool,&#8221; so it&#8217;s just files in a folder.</p><p>The architectural beat that took me longest to internalize is that agents and memory stores are independent resources: the agent has no <code>memory_store</code> field, the memory store has no <code>agent</code> field, and the two get glued together at session creation time, like this:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;77d6e3dc-3a05-4518-9f41-a3f327bf01b9&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">session = client.beta.sessions.create(
    agent=AGENT_ID,
    environment_id=ENV_ID,
    resources=[
        {"type": "memory_store", "memory_store_id": STORE_ID, "access": "read_write"}
    ],
)
</code></pre></div><p>A few things worth sitting with before we move on. The first is that memory in this system is just files, with no vector embeddings, no semantic search, and no automatic summarization happening behind the scenes; the agent uses <code>read</code>, <code>write</code>, <code>edit</code>, <code>glob</code>, <code>grep</code>, and <code>bash</code> exactly the way it would on any other filesystem. The second is that you&#8217;re paying for the harness around the model rather than the model itself: container provisioning, the event stream, the FUSE-mounted memory, immutable versioning, and the audit trail are what you&#8217;re actually getting, and if you don&#8217;t need that harness, the regular Messages API is the right tool for the job.</p><div><hr></div><h2>Setting things up</h2><p>There&#8217;s a clean way to work with Managed Agents that&#8217;s worth doing right from the start, which is splitting your project into a control plane (the persistent resources) and a data plane (the runtime code). Anthropic&#8217;s docs recommend this split, and after a few hours of building you&#8217;ll see why they matter.</p><p>The control plane is where your agents, environments, and memory stores live as static configs. You define them as YAML, version them in git like any other infrastructure, and apply them with Anthropic&#8217;s CLI by running something like <code>ant beta:agents create &lt; my-agent.yaml</code>. The CLI returns a stable resource ID, which is what your runtime code references for the lifetime of that resource.</p><p>The data plane is everything dynamic and per-task: sessions, events, memory operations, and anything else that happens during an actual run. This is where your application code lives, loading the resource IDs from <code>.env</code>, calling <code>client.beta.sessions.create(...)</code> with whatever parameters the current task needs, and streaming events back as the agent works.</p><p>The researcher agent itself is small enough to fit in a single YAML block:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;yaml&quot;,&quot;nodeId&quot;:&quot;7c32dbf9-9147-4925-ae32-fcd0eaef36c3&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-yaml">name: researcher
model: claude-sonnet-4-6
system: |
  You are a careful, persistent research assistant.
  You have a research notebook mounted at /mnt/memory/research-notes/. Use it
  freely to store anything worth remembering across sessions. Organize the
  directory however makes sense to you.

  Some habits to keep:
  - Before researching a topic, check if you've already taken notes on it.
  - When you learn something new, write it down.
  - When updating an existing note, prefer surgical edits over full rewrites.
  - Cite sources for any factual claims.
tools:
  - type: agent_toolset_20260401
</code></pre></div><p>A few choices in there are worth flagging. I went with Sonnet 4.6 over Opus because it&#8217;s about three times cheaper and more than capable for this kind of work, and the prebuilt <code>agent_toolset_20260401</code> gives the agent <code>bash</code>, <code>read</code>, <code>write</code>, <code>edit</code>, <code>glob</code>, <code>grep</code>, <code>web_search</code>, and <code>web_fetch</code>, all of which execute server-side in the session container without me having to implement any of them. I deliberately gave the agent very little guidance on how to organize its memory directory, because I wanted to see what it would do unprompted.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>The single most important line in that prompt is the first habit, &#8220;Before researching a topic, check if you&#8217;ve already taken notes on it.&#8221; Without it, cross-session memory remains theoretical, but with it the habit fires reliably and memory turns into something the agent actually uses rather than a feature it has access to but never reaches for.</p><p>The runtime script comes out to about 130 lines, most of which is event-stream handling. The substantive piece is mounting the memory store via the session&#8217;s <code>resources</code> array (shown above) and then opening the event stream before sending the kickoff message, because stream-first ordering matters here: events buffered before you connect arrive in a single batch instead of streaming in real-time.</p><p>With all that in place, I ran three sessions against the same memory store, and those three sessions are the spine of this review.</p><div><hr></div><h2>Three sessions</h2><h3>Session 1: writing notes from scratch</h3><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;f8ff55a0-99f8-4cec-be66-c12be2265330&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">research_session.py "research CRDTs (Conflict-free Replicated Data Types) and take notes. Focus on what they are, the main families, and a few concrete examples. Cite sources."
</code></pre></div><p>What I wanted to see was what the agent would do if I gave it total freedom to organize its memory directory. Would it create folders? Topic subdirectories? One flat file? A nested hierarchy with cross-references?</p><p>The agent&#8217;s first action was a <code>bash</code> command running <code>rg</code> against <code>/mnt/memory/</code> to grep for prior notes, which means the &#8220;check first&#8221; instruction in the system prompt fired correctly even though there was nothing to find on this first run. It then issued two parallel <code>web_search</code> calls (which both returned <code>content: []</code>, more on that quirk later), composed comprehensively from training-data knowledge instead, and wrote a single 7,285-byte file to <code>/crdts.md</code> with a flat, well-organized markdown structure rather than a folder hierarchy.</p><p>The detail that surprised me most was the discovery aid the agent added without being asked: the very first line under the title was <code>*keywords: CRDT, conflict-free, replicated, distributed, state-based, operation-based, CvRDT, CmRDT*</code>, which the agent had clearly written for its future self to grep against. Nobody told it to write keyword tags, and it chose to do so on its own, which is the kind of thing that made me think Sonnet 4.6 has actual instincts about how file-based memory works.</p><p>This first session cost about $0.21.</p><h3>Session 2: recall</h3><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;c2b32142-aad6-4c68-a5c6-a5c662cc4acf&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">research_session.py "What do you know about CRDTs? Specifically the difference between state-based and operation-based, and a couple concrete examples."
</code></pre></div><p>The prompt for this one deliberately doesn&#8217;t mention memory, because I wanted to see whether the &#8220;check first&#8221; habit would fire unprompted, with the trigger being the agent&#8217;s own internal sense of &#8220;you have notes, you should know to look.&#8221;</p><p>It did, and the result was almost too clean: the first action was the same <code>bash</code>/<code>rg</code> over the memory directory, which found <code>/crdts.md</code>, and the agent then said &#8220;I have solid notes on this&#8221; and answered the question by synthesizing from its own past notes without running a single new web search or composing anything from scratch.</p><p>After the session ended, I ran the inspector against the store and found that the version history of <code>/crdts.md</code> still showed exactly one version, attributed to Session 1&#8217;s ID. Session 2&#8217;s session ID does not appear anywhere in the audit log, because Session 2 only read from the store and never wrote to it. That&#8217;s the falsifiable claim, made falsifiable: reads do not create memory versions.</p><p>The cost worked out to about $0.04, which is roughly five times cheaper than Session 1 and demonstrates pretty clearly that memory turns one expensive session into many cheap ones:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!El2E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76ee2bf4-326c-434d-8aca-8c5ff4712d48_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!El2E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76ee2bf4-326c-434d-8aca-8c5ff4712d48_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!El2E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76ee2bf4-326c-434d-8aca-8c5ff4712d48_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!El2E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76ee2bf4-326c-434d-8aca-8c5ff4712d48_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!El2E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76ee2bf4-326c-434d-8aca-8c5ff4712d48_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!El2E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76ee2bf4-326c-434d-8aca-8c5ff4712d48_1200x600.png" width="1200" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/76ee2bf4-326c-434d-8aca-8c5ff4712d48_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43288,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/196778476?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76ee2bf4-326c-434d-8aca-8c5ff4712d48_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!El2E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76ee2bf4-326c-434d-8aca-8c5ff4712d48_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!El2E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76ee2bf4-326c-434d-8aca-8c5ff4712d48_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!El2E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76ee2bf4-326c-434d-8aca-8c5ff4712d48_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!El2E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F76ee2bf4-326c-434d-8aca-8c5ff4712d48_1200x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you&#8217;re worried about the cost of using memory at scale, this matters: persistent memory is a feature rather than a tax, because the agent reads its own notes and skips the work it already did instead of recomputing everything from scratch every time.</p><h3>Session 3: modify</h3><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;1b0946db-83f6-4795-9b3b-88da433aa797&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">research_session.py "Update your CRDT notes. Add a note about RGA (Replicated Growable Array)..."
</code></pre></div><p>This was supposed to be the cleanest of the three sessions, a small, surgical edit producing a second version of <code>/crdts.md</code>with an <code>operation: modified</code> entry in the audit log, and that&#8217;s not what happened.</p><div><hr></div><h2>Where this got interesting</h2><p>The actual sequence of events from Session 3 is worth walking through layer by layer, because the failure mode is more interesting than a single bug.</p><h3>Layer 1: the model wrote a buggy <code>bash</code> command</h3><p>The agent&#8217;s check-first command was the following:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;bash&quot;,&quot;nodeId&quot;:&quot;70e708f5-c822-49e2-aae8-b1b31dcecdd1&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-bash">rg -i 'crdt\\\\|sequence\\\\|rga\\\\|replicated growable' /mnt/memory/research-notes/ -l
</code></pre></div><p>The <code>\\\\|</code> in that regex was meant as escaped pipes for ripgrep&#8217;s regex alternation, but bash interprets <code>\\\\|</code> as <code>\\|</code>, and ripgrep treats that as a literal <code>|</code> character rather than as a meta-character. So the search was actually looking for the literal string <code>crdt\\|sequence\\|rga\\|replicated growable</code>, which would never match anything in any actual file. Ripgrep returned no matches and exited with a non-zero status code, which is the correct behavior for &#8220;I found nothing.&#8221;</p><p>The model&#8217;s shell escaping is right almost every time, but the cases where it isn&#8217;t tend to be subtle, and this one happened to be load-bearing.</p><h3>Layer 2: the platform correctly flagged the failure</h3><p>The harness ran the command and produced a <code>tool_result</code> event with <code>is_error: true</code> and <code>(no output)</code> as the content, which is exactly what should have happened given that the command exited non-zero. The platform did its job here and explicitly told the agent loop that the command had failed.</p><h3>Layer 3: the model ignored the error flag</h3><p>The agent&#8217;s next message after that error result was, &#8220;The memory store is empty, no prior CRDT notes.&#8221; That statement was false, because <code>/crdts.md</code> had been sitting in the store for two days at that point, but the agent treated the empty output from the failed command as a meaningful answer rather than as a failure signal that needed re-investigation.</p><p>This is the most interesting failure layer to me, because the platform got it right and the model got it wrong. Defense in depth is a useful framing for what&#8217;s happening: even when the audit trail and error flags are working as designed, the model&#8217;s reasoning about its own tool outputs is the layer that has to hold, and that layer is reasoning rather than infrastructure.</p><h3>Layer 4: the destructive action</h3><p>Believing the store was empty, the agent called <code>write</code> rather than <code>edit</code>, generating a fresh ~1,500-byte RGA-only file from scratch and writing it directly to <code>/crdts.md</code>. The original 7,285-byte file with all of the careful notes from Session 1 was overwritten in a single operation.</p><p>I didn&#8217;t even notice this had happened until I ran the inspector, because from the script&#8217;s perspective Session 3 looked like a normal run; the agent reported back that it had updated the notes and cited the RGA paper, kindly and unintentionally lying because the underlying belief was wrong.</p><h3>What the audit log showed</h3><p>Running <code>inspector log /crdts.md</code> after Session 3 surfaced two versions:</p><pre><code><code>version  memver_0169b&#8230;  modified  session_actor (Session 3)   1509 bytes
version  memver_01A7Z&#8230;  created   session_actor (Session 1)   7285 bytes
</code></code></pre><p>The size dropping from 7,285 bytes to 1,509 bytes is the catastrophe made visible, but the more important fact is that the original is still here, addressable by ID and retrievable in full content via the API, even though the head of the file is now the smaller broken version.</p><p>The diff between the two versions, generated by the inspector&#8217;s <code>diff</code> subcommand, made the loss concrete:</p><pre><code><code>--- memver_01A7Z&#8230; (/crdts.md, 7285B, created)
+++ memver_0169b&#8230; (/crdts.md, 1509B, modified)
@@ -1,122 +1,21 @@
-# CRDTs: Conflict-free Replicated Data Types
-*keywords: CRDT, conflict-free, replicated, ...*
-## What They Are
-CRDTs are data structures designed to be replicated across multiple nodes...
-(... 121 more deletion lines ...)
+# CRDT Research Notes
+## Sequences / Text CRDTs
+### RGA (Replicated Growable Array)
</code></code></pre><p>About 5,800 bytes of careful work disappeared in a single agent action that thought it was creating a brand-new file from scratch, including the state-based versus operation-based section, the G-Counter and OR-Set examples, the math foundation, and the entire sources block at the bottom.</p><h3>How I got it back</h3><p>This is the moment that, on a flat filesystem with no versioning, would have been the end of the story. Without the platform&#8217;s audit log, the original content would simply be gone; it wasn&#8217;t, because the audit log was holding the original verbatim.</p><p>I added a <code>restore</code> subcommand to the inspector that fetches a chosen historical version&#8217;s content and writes it back as the new head via <code>memory_stores.memories.update(memory_id, content=old_content)</code>. Anthropic&#8217;s API records that update as a new version rather than overwriting history, which means the recovery itself becomes part of the audit trail.</p><p>After running the restore, <code>inspector log /crdts.md</code> showed three versions, and the entire arc was right there in the output:</p><pre><code><code>memver_01EKK&#8230;  modified  api_actor (apikey_&#8230;)         7285 B   sha 3f3ec0d2&#8230;  &#8592; matches v1
memver_0169b&#8230;  modified  session_actor (Session 3)    1509 B   sha 7356ce60&#8230;  &#8592; catastrophe
memver_01A7Z&#8230;  created   session_actor (Session 1)    7285 B   sha 3f3ec0d2&#8230;  &#8592; original
</code></code></pre><p>A few details in that output are worth more than they look at first glance. The platform distinguishes operator-side mutations (recorded as <code>api_actor</code> with an <code>apikey_</code> ID) from agent-side ones (recorded as <code>session_actor</code> with a <code>sesn_</code>ID), which makes &#8220;who did this&#8221; forensics actually possible rather than something you&#8217;d have to retrofit yourself. The SHA-256 hash on the restored version matches the original exactly, so the recovery is byte-identical and verifiable rather than approximately right. And the catastrophe (v2) stays in the audit log forever, because recovery doesn&#8217;t erase the record; if you wanted v2&#8217;s content out of the log entirely, you&#8217;d use the <code>redact</code> endpoint, which clears the content while preserving all of the metadata.</p><p>The same story renders cleanly as a chart:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6cfU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F135d1311-def9-48d9-bd55-1f94a4c74451_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6cfU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F135d1311-def9-48d9-bd55-1f94a4c74451_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!6cfU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F135d1311-def9-48d9-bd55-1f94a4c74451_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!6cfU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F135d1311-def9-48d9-bd55-1f94a4c74451_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!6cfU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F135d1311-def9-48d9-bd55-1f94a4c74451_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6cfU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F135d1311-def9-48d9-bd55-1f94a4c74451_1200x600.png" width="1200" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/135d1311-def9-48d9-bd55-1f94a4c74451_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:50388,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/196778476?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F135d1311-def9-48d9-bd55-1f94a4c74451_1200x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6cfU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F135d1311-def9-48d9-bd55-1f94a4c74451_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!6cfU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F135d1311-def9-48d9-bd55-1f94a4c74451_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!6cfU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F135d1311-def9-48d9-bd55-1f94a4c74451_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!6cfU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F135d1311-def9-48d9-bd55-1f94a4c74451_1200x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The cliff and the recovery are immediately legible: 7,285 bytes, plunge to 1,509, return to 7,285, all in three points and one chart that captures the full narrative.</p><p>This is the section of the post I&#8217;d stake my credibility on. Cross-session memory works, agents will sometimes get it wrong, and the platform&#8217;s audit trail is the thing that saves you when they do.</p><div><hr></div><h2>Important Considerations</h2><p>Building with Managed Agents memory turned up more rough edges than I expected, none of which are dealbreakers but all of which are worth knowing about before you commit to the platform.</p><ul><li><p><strong>Resource IDs need to be persisted yourself.</strong> Every call to <code>agents.create()</code>, <code>environments.create()</code>, or <code>memory_stores.create()</code> returns an opaque ID that your runtime code has to look up later, which is standard cloud-API ceremony but missing some of the friction-reducers other platforms have shipped: agent and environment names aren&#8217;t unique within an account, there&#8217;s no idempotent <code>create_or_update</code>, and there&#8217;s no Terraform provider yet, so you end up doing the capture-and-paste-into-<code>.env</code> dance manually.</p></li><li><p><strong>Memory store </strong><code>description</code><strong> must be single-line.</strong> The API rejects any control character, including newlines, with a cryptic regex error, which is inconsistent with agent system prompts that are explicitly multi-line up to 100K chars. It&#8217;s easy to fix once you know about it.</p></li><li><p><strong>Memory paths are store-relative rather than mount-relative.</strong> When the agent writes to <code>/mnt/memory/research-notes/crdts.md</code> inside the container, the API stores the file at <code>/crdts.md</code> and treats the mount-path prefix as a runtime detail, so when you list or retrieve memories host-side you reference the relative path rather than the full container path.</p></li><li><p><strong>Web search results are hidden from the event stream.</strong> When the agent runs <code>web_search</code>, the resulting <code>agent.tool_result.content</code> field is an empty array even when the search clearly succeeded (the agent uses the results downstream to give a correct answer). The model gets the actual search content internally, but the public event surface gets a sanitized empty array, which is almost certainly intentional for IP and copyright reasons but means you cannot log &#8220;what URLs the agent consulted&#8221; without asking the agent to cite them in its outputs.</p></li><li><p><strong>Agent-generated </strong><code>bash</code><strong> invocations aren&#8217;t always well-formed.</strong> The escaping bug that triggered Session 3&#8217;s catastrophe is one example, and defensive system-prompt phrasing helps but doesn&#8217;t eliminate the problem entirely.</p></li><li><p><code>memory_versions.retrieve(version_id, ...)</code><strong> takes the version ID positionally only.</strong> Calling it as <code>retrieve(version_id=...)</code> raises <code>TypeError</code>, even though <code>memories.retrieve(memory_id=..., ...)</code> accepts the keyword form, which is an inconsistency within the same SDK namespace.</p></li><li><p><strong>The streaming method lives at </strong><code>client.beta.sessions.events.stream(...)</code><strong>,</strong> not <code>client.beta.sessions.stream(...)</code> as some doc snippets imply. The latter form doesn&#8217;t exist and will fail at runtime.</p></li><li><p><strong>Print buffering kills real-time observability.</strong> When you run a Python session script in the background or through subprocess, Python buffers stdout, so the script appears to do nothing for minutes and then dumps everything when the agent finishes. The fix is either passing <code>flush=True</code> to print or running the script under <code>python -u</code>.</p></li><li><p><strong>Subscription auth doesn&#8217;t apply to Managed Agents.</strong> API key authentication with per-token billing is the only path, so a Claude Pro or Max subscription doesn&#8217;t help you here even though it works for Claude Code.</p></li></ul><div><hr></div><h2>So when does this make sense?</h2><p>Managed Agents is a deliberately persistent, server-managed harness, so the right question to ask isn&#8217;t &#8220;is it good?&#8221; but &#8220;is the persistent harness shape what my problem actually wants?&#8221;</p><p>Use caseReach for&#8230;One-shot Claude call (classify, extract, summarize)Messages APIMulti-turn conversation, your code holds the stateMessages APIMulti-step pipeline you orchestrate yourselfMessages API + tool usePersistent agent reused across sessions/users with managed sandbox<strong>Managed Agents</strong>Long-running task with memory across sessions<strong>Managed Agents + memory store</strong>Anything requiring a non-Claude modelRoll your own</p><p>A useful rule of thumb is that if your code calls <code>agents.create()</code> more than once for the &#8220;same&#8221; agent, you&#8217;re using the wrong tool. Agents are persistent, versioned configs that you create once and reference forever, so treating Managed Agents like a fancy Messages API and creating agents per request is fighting the platform&#8217;s whole design.</p><p>Now, what about cost? Across all three sessions plus a smoke-test, my total API spend came out to about $0.37, which includes a substantial 7KB notes write, a recall session that exercised the cache heavily, a destructive overwrite, and an operator-side restore.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bVio!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424870f2-691a-4d67-bf60-b01ceba0ef85_1200x480.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bVio!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424870f2-691a-4d67-bf60-b01ceba0ef85_1200x480.png 424w, https://substackcdn.com/image/fetch/$s_!bVio!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424870f2-691a-4d67-bf60-b01ceba0ef85_1200x480.png 848w, https://substackcdn.com/image/fetch/$s_!bVio!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424870f2-691a-4d67-bf60-b01ceba0ef85_1200x480.png 1272w, https://substackcdn.com/image/fetch/$s_!bVio!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424870f2-691a-4d67-bf60-b01ceba0ef85_1200x480.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bVio!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424870f2-691a-4d67-bf60-b01ceba0ef85_1200x480.png" width="1200" height="480" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/424870f2-691a-4d67-bf60-b01ceba0ef85_1200x480.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:480,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:53103,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/196778476?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424870f2-691a-4d67-bf60-b01ceba0ef85_1200x480.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bVio!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424870f2-691a-4d67-bf60-b01ceba0ef85_1200x480.png 424w, https://substackcdn.com/image/fetch/$s_!bVio!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424870f2-691a-4d67-bf60-b01ceba0ef85_1200x480.png 848w, https://substackcdn.com/image/fetch/$s_!bVio!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424870f2-691a-4d67-bf60-b01ceba0ef85_1200x480.png 1272w, https://substackcdn.com/image/fetch/$s_!bVio!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F424870f2-691a-4d67-bf60-b01ceba0ef85_1200x480.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Memory store doesn&#8217;t measurably move the cost needle, because the agent loop and the model itself are where the spend lives. Sonnet 4.6 with aggressive caching is genuinely affordable for any individual or small team use case, and the platform handles caching for you without any configuration.</p><div><hr></div><h2>What I didn&#8217;t get to (yet)</h2><p>A few features deserve more than a passing mention but didn&#8217;t fit the failure-recovery spine of this post:</p><ul><li><p><strong>Multi-store sessions and the multi-tenant pattern.</strong> A session can mount up to eight memory stores at once, and the natural pattern for a SaaS-shaped application is one shared read-only &#8220;house knowledge&#8221; store plus one read-write per-user store, with the agent definition the same for everyone. Access modes are enforced at the FUSE filesystem level, so <code>read_only</code> is real OS-level enforcement rather than a polite request from the model. This is big enough that I&#8217;m planning to cover it in its own follow-up post.</p></li><li><p><strong>Optimistic concurrency via preconditions.</strong> The <code>update</code> endpoint accepts a <code>precondition: {type: "content_sha256", ...}</code> field, and if the file&#8217;s current SHA doesn&#8217;t match the one you supplied, the API returns a 409 conflict. This is exactly the safety net Session 3&#8217;s agent didn&#8217;t use and the kind of thing that should probably be standard practice for any read-modify-write flow.</p></li><li><p><strong>Redaction.</strong> The <code>memory_versions.redact(version_id)</code> endpoint clears a historical version&#8217;s content while preserving all of the metadata around it, which is useful when a bad version contained PII or leaked secrets and you want them out of the audit log without losing the record that something existed there.</p></li><li><p><strong>MCP server integration.</strong> An agent can declare MCP servers (GitHub, Linear, Notion, and others), the session attaches a vault containing the credentials, and authentication is auto-refreshed by the platform. Pairing memory store with MCP, like a research agent that pulls from your Notion and writes findings to persistent memory, is one of the strongest use cases I can imagine for the platform overall.</p></li></ul><div><hr></div><h2>So... should you use this?</h2><p>If you&#8217;re sitting on the fence about whether to use Managed Agents memory, the answer is yes, with eyes open. The platform is real, the harness around the model is genuinely valuable, and the audit trail is the kind of feature you don&#8217;t appreciate until you need it, which in my case happened on the third session of the third day of building.</p><p>A few practical takeaways for anyone planning to build on this. Use preconditions whenever you can, especially for any flow that does a read-modify-write on the same memory file, because they&#8217;re the safety net that Session 3&#8217;s agent didn&#8217;t have. Build a small amount of host-side observability tooling, because even a 200-line inspector script is enough to catch problems your agent won&#8217;t tell you about. And know which side of the decision rubric your use case falls on before you commit, because Managed Agents is a great tool for the right shape of problem and the wrong tool for one-shot calls or anything that doesn&#8217;t benefit from persistence.</p><p>What do you think? Have you tried building with this yet? I&#8217;d love to hear what your experience has been.</p><div><hr></div><p><em>Full code from the demo (agent YAMLs, runtime scripts, inspector CLI, monitoring charts) is at <a href="https://github.com/taylor-ortiz/claude-memory-managed-agents/blob/main/README.md">https://github.com/taylor-ortiz/claude-memory-managed-agents/blob/main/README.md</a>.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Another Weekly AI Newsletter: Issue 70]]></title><description><![CDATA[&#8220;You can&#8217;t just steal a charity.&#8221; Elon Musk spent three days on the stand trying to prove it.]]></description><link>https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-c48</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-c48</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Sun, 03 May 2026 13:21:09 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f42054a7-25a7-4ba3-9ae0-e3bdf5e129bf_1448x1086.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VSBj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67334af6-6538-4351-bf0b-812face8cf9c_2400x3760.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VSBj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67334af6-6538-4351-bf0b-812face8cf9c_2400x3760.png 424w, https://substackcdn.com/image/fetch/$s_!VSBj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67334af6-6538-4351-bf0b-812face8cf9c_2400x3760.png 848w, https://substackcdn.com/image/fetch/$s_!VSBj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67334af6-6538-4351-bf0b-812face8cf9c_2400x3760.png 1272w, https://substackcdn.com/image/fetch/$s_!VSBj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67334af6-6538-4351-bf0b-812face8cf9c_2400x3760.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VSBj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67334af6-6538-4351-bf0b-812face8cf9c_2400x3760.png" width="1456" height="2281" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/67334af6-6538-4351-bf0b-812face8cf9c_2400x3760.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2281,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:892417,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/196304743?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67334af6-6538-4351-bf0b-812face8cf9c_2400x3760.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VSBj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67334af6-6538-4351-bf0b-812face8cf9c_2400x3760.png 424w, https://substackcdn.com/image/fetch/$s_!VSBj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67334af6-6538-4351-bf0b-812face8cf9c_2400x3760.png 848w, https://substackcdn.com/image/fetch/$s_!VSBj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67334af6-6538-4351-bf0b-812face8cf9c_2400x3760.png 1272w, https://substackcdn.com/image/fetch/$s_!VSBj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67334af6-6538-4351-bf0b-812face8cf9c_2400x3760.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>&#8220;You can&#8217;t just steal a charity.&#8221; Elon Musk spent three days on the stand trying to prove it.</h2><p>The Musk v. OpenAI trial opened in Oakland federal court. </p><ul><li><p><strong>The context:</strong> Musk contributed <a href="https://www.npr.org/2026/04/28/nx-s1-5801438/musk-altman-openai-trial-opening-statements">$38 million</a> to found OpenAI as a nonprofit and alleges Altman and Brockman looted it by converting to a for-profit. He&#8217;s seeking $150 billion in damages and their removal from leadership. If he wins, it could block OpenAI&#8217;s planned IPO at a ~$1 trillion valuation.</p></li><li><p><strong>The distillation admission:</strong> Under cross-examination, Musk admitted xAI <a href="https://techcrunch.com/2026/04/30/elon-musk-testifies-that-xai-trained-grok-on-openai-models/">&#8220;partly&#8221;</a> used OpenAI&#8217;s models to train Grok, drawing audible gasps in the courtroom. He called it &#8220;standard practice.&#8221;</p></li><li><p><strong>The industry reacted:</strong> <a href="https://x.com/ylecun/status/2050039348679024779">LeCun retweeted Cl&#233;ment Delangue</a> calling restrictions on distillation &#8220;pulling the ladder.&#8221; <a href="https://x.com/natolambert/status/2049974505343488171">Lambert noted</a> American companies distill Chinese open models just as freely, and <a href="https://x.com/natolambert/status/2049996372938793194">questioned why OpenAI doesn&#8217;t just revoke contracts</a> from violators like they did with ByteDance.</p></li><li><p><strong>OpenAI&#8217;s counter-narrative:</strong> Attorney Savitt <a href="https://www.cnn.com/2026/04/29/business/takeaways-elon-musk-sam-altman-openai-trial">argued</a> Musk wanted majority control, pitched Tesla acquiring OpenAI, and only sued after founding xAI. Emails showed him <a href="https://gizmodo.com/everything-you-missed-from-elon-musks-testimony-in-the-openai-trial-2000753364">poaching OpenAI researchers</a> while still on the board.</p></li><li><p><strong>The cross-examination was rough:</strong> Musk told the jury <a href="https://www.theverge.com/tech/921022/elon-musk-cross-openai-altman">&#8220;I don&#8217;t lose my temper&#8221;</a> then raised his voice minutes later. The Verge&#8217;s summary: <a href="https://www.theverge.com/ai-artificial-intelligence/920191/elon-musk-sam-altman-trial-day-one">&#8220;more petty than prepared.&#8221;</a> Texts revealed <a href="https://gizmodo.com/everything-you-missed-from-elon-musks-testimony-in-the-openai-trial-2000753364">Shivon Zilis asked Musk</a> whether to &#8220;stay close and friendly to OpenAI to keep info flowing&#8221; after his departure.</p></li><li><p><strong>What&#8217;s next:</strong> The judge expressed skepticism about both sides&#8217; safety claims. Altman and Brockman testify in the coming weeks.</p></li></ul><div><hr></div><h2><strong>$900 billion valuation, 50% less sycophancy, and connectors for every creative tool you use.</strong></h2><p>Anthropic had one of those weeks where the breadth of activity tells the story.</p><ul><li><p><strong>The valuation:</strong> Reportedly <a href="https://techcrunch.com/2026/04/29/sources-anthropic-could-raise-a-new-50b-round-at-a-valuation-of-900b/">raising $50 billion at a $900 billion valuation</a>, a number that rivals established tech giants.</p></li><li><p><strong>The sycophancy research:</strong> <a href="https://www.anthropic.com/research/claude-personal-guidance">Analyzed 1 million Claude conversations</a>, found a 9% sycophancy rate (25% in relationship discussions), built synthetic training scenarios from real failure cases, and cut sycophancy roughly 50% in Opus 4.7 and Mythos Preview. One of the most transparent published alignment efforts to date.</p></li><li><p><strong>BioMysteryBench:</strong> Claude <a href="https://www.anthropic.com/research/Evaluating-Claude-For-Bioinformatics-With-BioMysteryBench">solved roughly 30% of 23 bioinformatics problems</a> that stumped a human expert panel.</p></li><li><p><strong>Claude for Creative Work:</strong> Shipped <a href="https://www.anthropic.com/news/claude-for-creative-work">connectors for Adobe Creative Cloud, Blender, Ableton, Canva, Affinity, SketchUp, Splice, and Resolume</a>, and joined the Blender Development Fund as a patron.</p></li><li><p><strong>Claude Security:</strong> Launched <a href="https://www.anthropic.com/news/claude-code-security">codebase vulnerability scanning</a> in public beta for Enterprise customers.</p></li><li><p><strong>Meanwhile, at the Senate:</strong> Defense Secretary Hegseth <a href="https://www.msn.com/en-us/news/technology/hegseth-calls-anthropic-ceo-a-lunatic-defends-pentagon-ai-use/ar-AA227pKG">called CEO Dario Amodei an &#8220;ideological lunatic&#8221;</a> at an Armed Services Committee hearing.</p></li></ul><div><hr></div><h2><strong>OpenAI ended its Microsoft exclusivity and went multi-cloud.</strong></h2><p>OpenAI restructured its Microsoft deal, launched on AWS, and shipped a wave of Codex upgrades all in the same week.</p><ul><li><p><strong>The exclusivity is over:</strong> Microsoft <a href="https://openai.com/index/next-phase-of-microsoft-partnership/">ended its exclusive license</a> to OpenAI&#8217;s technology. OpenAI can now sell on AWS and Google Cloud through 2032.</p></li><li><p><strong>AWS moved immediately:</strong> Amazon <a href="https://openai.com/index/openai-on-aws/">began offering OpenAI models, Codex, and Managed Agents</a> on AWS. Day-zero availability.</p></li><li><p><strong>The AGI clause is dead:</strong> Simon Willison <a href="https://simonwillison.net/2026/Apr/27/now-deceased-agi-clause/">tracked the history</a> of the clause that would have let OpenAI walk away from Microsoft once AGI was declared. It&#8217;s gone. OpenAI traded its theoretical nuclear option for commercial freedom now.</p></li><li><p><strong>The product push:</strong> Altman said Codex is <a href="https://x.com/sama/status/2049493609028923826">&#8220;having a ChatGPT moment&#8221;</a>. Brockman said the <a href="https://x.com/sama/status/2049493182866747765">Codex app replaced his terminal</a> as his primary computer interface. OpenAI is treating Codex as a flagship product launch, not a side feature.</p></li><li><p><strong>Nadella&#8217;s take:</strong> Microsoft gets royalty-free access to OpenAI&#8217;s frontier models through 2032, no longer pays OpenAI for them, and OpenAI is committed to buying <a href="https://techcrunch.com/2026/04/29/satya-nadella-says-hes-ready-to-exploit-the-new-openai-deal/">$250 billion in Azure</a>. Nadella told analysts he &#8220;fully plan[s] to exploit it.&#8221;</p></li></ul><div><hr></div><h2><strong>Most cloud providers beat earnings. OpenAI missed.</strong></h2><p>The hyperscalers are spending record amounts on AI infrastructure and seeing record returns. Meanwhile, the Wall Street Journal <a href="https://sherwood.news/markets/openai-linked-stocks-suffer-after-wsj-reports-that-the-company-has-missed-key-revenue-and-user-targets/">reported</a> that OpenAI missed revenue and user growth targets, with Anthropic and Gemini cited as gaining ground.</p><ul><li><p><strong>The cloud numbers:</strong> <a href="https://techcrunch.com/2026/04/29/google-cloud-surpasses-20b-but-says-growth-was-capacity-constrained/">Google Cloud surpassed $20 billion</a> but said growth was capacity-constrained. <a href="https://techcrunch.com/2026/04/29/amazons-cloud-business-is-surging-and-so-is-its-capital-spending/">AWS surged on AI demand</a>. Microsoft disclosed a <a href="https://www.geekwire.com/2026/microsoft-tops-wall-street-expectations-reports-accelerating-azure-growth-and-37b-ai-run-rate/">$37 billion AI revenue run rate</a> (up 123% YoY), <a href="https://techcrunch.com/2026/04/29/microsoft-says-it-has-over-20m-paid-copilot-users-and-they-really-are-using-it/">20 million paid Copilot users</a>, and set calendar-year CapEx at $190 billion.</p></li><li><p><strong>The supply chain is feeling it:</strong> <a href="https://www.sammobile.com/2026/04/30/samsung-q1-2026-profit-hits-record-high-ai-chip-boom/">Samsung chip profits jumped nearly 50-fold</a> on AI memory demand. Their executive: &#8220;our supply falls far short of customer demand.&#8221; The shortage is expected to <a href="https://www.sammobile.com/2026/04/30/samsung-q1-2026-profit-hits-record-high-ai-chip-boom/">widen further in 2027</a>.</p></li><li><p><strong>Meta is the most interesting story:</strong> Raised its CapEx forecast, then <a href="https://www.reuters.com/business/world-at-work/meta-ceo-attributes-layoffs-plan-capex-wont-rule-out-further-job-cuts-2026-04-30/">Zuckerberg blamed layoffs on capital spending</a> and wouldn&#8217;t rule out more cuts, then raised <a href="https://www.reuters.com/business/meta-looks-raise-up-25-billion-with-bond-sale-bloomberg-news-reports-2026-04-30/">$25 billion in bonds</a> to fund the AI buildout. Cutting people to buy GPUs, then borrowing to buy more.</p></li><li><p><strong>The counterpoint nobody expected:</strong> <a href="https://www.theverge.com/tech/920815/google-alphabet-q1-2026-earnings-sundar-pichai">Google Search queries hit an all-time high</a>. <a href="https://techcrunch.com/2026/04/30/apple-was-surprised-by-ai-driven-demand-for-macs/">Apple was surprised by AI-driven Mac demand</a>. The &#8220;AI kills search&#8221; and &#8220;AI doesn&#8217;t need hardware&#8221; narratives both took a hit.</p></li><li><p><strong>But the utilization story:</strong> Cast AI <a href="https://venturebeat.com/infrastructure/fomo-is-why-enterprises-pay-for-gpus-they-dont-use-and-why-prices-keep-climbing/">measured tens of thousands of production Kubernetes clusters</a> and found GPU utilization averaging 5%. Teams lock in multi-year commitments the moment allocation comes through, then won&#8217;t release idle capacity because reacquiring takes months.</p></li></ul><div><hr></div><h2><strong>&#11088; Featured: Symphony turns your issue tracker into an autonomous coding fleet</strong></h2><p>OpenAI released <a href="https://openai.com/index/open-source-codex-orchestration-symphony/">Symphony</a>, an open-source spec that turns Linear boards into control planes for Codex agents. Every open task gets an agent. Agents run continuously. Humans review the results.</p><p>The origin story matters: an OpenAI team decided to build their entire repo with zero human-written code. They documented how in a <a href="https://openai.com/index/harness-engineering/">harness engineering post</a>: a million lines of code, 1,500 merged PRs, 3.5 PRs per engineer per day, with Codex running six-hour autonomous sessions while engineers slept and reviewing its own code agent-to-agent. But they hit a new ceiling: human attention. Engineers could manage three to five Codex sessions before context switching killed productivity. They had &#8220;built a team of extremely capable junior engineers, then assigned our human engineers to micromanaging them.&#8221;</p><p>So they flipped the model. Instead of engineers managing coding sessions, they made the issue tracker the orchestrator. Each open Linear issue maps to a dedicated agent workspace. Symphony continuously polls the board, picks up new work, restarts agents that crash or stall, watches CI, rebases when needed, resolves conflicts, and shepherds changes through the pipeline.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hhTj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8e1c9-cc56-448a-97bd-d990d1a15b3f_1690x718.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hhTj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8e1c9-cc56-448a-97bd-d990d1a15b3f_1690x718.png 424w, https://substackcdn.com/image/fetch/$s_!hhTj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8e1c9-cc56-448a-97bd-d990d1a15b3f_1690x718.png 848w, https://substackcdn.com/image/fetch/$s_!hhTj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8e1c9-cc56-448a-97bd-d990d1a15b3f_1690x718.png 1272w, https://substackcdn.com/image/fetch/$s_!hhTj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8e1c9-cc56-448a-97bd-d990d1a15b3f_1690x718.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hhTj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8e1c9-cc56-448a-97bd-d990d1a15b3f_1690x718.png" width="1456" height="619" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/61b8e1c9-cc56-448a-97bd-d990d1a15b3f_1690x718.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:619,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:161357,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/196304743?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8e1c9-cc56-448a-97bd-d990d1a15b3f_1690x718.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hhTj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8e1c9-cc56-448a-97bd-d990d1a15b3f_1690x718.png 424w, https://substackcdn.com/image/fetch/$s_!hhTj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8e1c9-cc56-448a-97bd-d990d1a15b3f_1690x718.png 848w, https://substackcdn.com/image/fetch/$s_!hhTj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8e1c9-cc56-448a-97bd-d990d1a15b3f_1690x718.png 1272w, https://substackcdn.com/image/fetch/$s_!hhTj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F61b8e1c9-cc56-448a-97bd-d990d1a15b3f_1690x718.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Once work is abstracted to the ticket level, agents can break large tasks into dependency trees, only starting work on tasks that aren&#8217;t blocked. They also create their own follow-up tickets when they spot issues outside the current scope. One engineer on the team made three significant changes from the Linear app on his phone from a cabin on bad wifi.</p><p>The results: a 500% increase in landed PRs on some teams in three weeks. But the deeper shift is behavioral. When the perceived cost of each code change drops to near zero, teams start filing speculative tasks. Try an idea, explore a refactor, test a hypothesis, keep only what works. Product managers and designers can file feature requests directly into Symphony and get back a review packet with a video walkthrough of the feature running in the real product.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Pw11!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c78280d-39b8-4f86-babf-f1474c90a47d_1988x1078.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Pw11!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c78280d-39b8-4f86-babf-f1474c90a47d_1988x1078.png 424w, https://substackcdn.com/image/fetch/$s_!Pw11!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c78280d-39b8-4f86-babf-f1474c90a47d_1988x1078.png 848w, https://substackcdn.com/image/fetch/$s_!Pw11!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c78280d-39b8-4f86-babf-f1474c90a47d_1988x1078.png 1272w, https://substackcdn.com/image/fetch/$s_!Pw11!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c78280d-39b8-4f86-babf-f1474c90a47d_1988x1078.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Pw11!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c78280d-39b8-4f86-babf-f1474c90a47d_1988x1078.png" width="1456" height="790" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2c78280d-39b8-4f86-babf-f1474c90a47d_1988x1078.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:790,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:333327,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/196304743?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c78280d-39b8-4f86-babf-f1474c90a47d_1988x1078.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Pw11!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c78280d-39b8-4f86-babf-f1474c90a47d_1988x1078.png 424w, https://substackcdn.com/image/fetch/$s_!Pw11!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c78280d-39b8-4f86-babf-f1474c90a47d_1988x1078.png 848w, https://substackcdn.com/image/fetch/$s_!Pw11!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c78280d-39b8-4f86-babf-f1474c90a47d_1988x1078.png 1272w, https://substackcdn.com/image/fetch/$s_!Pw11!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c78280d-39b8-4f86-babf-f1474c90a47d_1988x1078.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The technical choices are worth noting. The reference implementation is in Elixir, chosen for its concurrency primitives. With v1.1.0, Symphony supports the Kata CLI as an alternative runtime, meaning you can run Claude Code, Gemini, or other models inside the same orchestration framework. Symphony is technically just a <code>SPEC.md</code> file: a definition of the problem and the intended solution, not a product. OpenAI gave agents objectives instead of strict state transitions, &#8220;much like a good manager would assign a goal to a direct report.&#8221;</p><p><strong>What to watch for:</strong> Symphony is one of several orchestration plays that landed this same week. <a href="https://x.com/cursor_ai/status/2049499866217185492">Cursor released an SDK</a> letting companies like Rippling and Notion embed background agents. <a href="https://venturebeat.com/orchestration/ibm-launches-bob-with-multi-model-routing-and-human-checkpoints-to-turn-ai-coding-into-a-secure-production-system/">IBM launched Bob</a> with human-checkpoint governance. <a href="https://venturebeat.com/technology/mistral-ai-launches-workflows-a-temporal-powered-orchestration-engine-already-running-millions-of-daily-executions/">Mistral shipped Workflows</a> running millions of daily executions. <a href="https://blog.n8n.io/n8n-mcp-server/">n8n shipped an MCP server</a> so Claude can build automation workflows through conversation. The competitive moat is shifting from &#8220;best coding model&#8221; to &#8220;best orchestration spec.&#8221; If you maintain a team that ships code, start here.</p><div><hr></div><h2><strong>Worth a Listen</strong></h2><div id="youtube2-9-TVwv6wtGQ" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;9-TVwv6wtGQ&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/9-TVwv6wtGQ?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>OpenAI researchers Sebastian Bubeck and Ernest Ryu on the OpenAI podcast.</p><ul><li><p><strong>The 42-year-old problem:</strong> Researcher spent 40+ hours failing without AI. With ChatGPT, solved it in 12 hours across three evenings.</p></li><li><p><strong>The Erdos problems:</strong> 10+ completely new, publishable solutions to decades-old open problems. Fully original proofs, not literature searches.</p></li><li><p><strong>AGI time:</strong> Bubeck&#8217;s framework. Four years ago, models could think for seconds. Now days. The goal is weeks, then months.</p></li><li><p><strong>The warning:</strong> Non-mathematicians are producing pages of AI-generated proofs that turn out wrong. The models accelerate experts, not replace them.</p></li></ul><div><hr></div><h2><strong>Quick Hits</strong></h2><ul><li><p><strong><a href="https://venturebeat.com/technology/why-openais-goblin-problem-matters-and-how-you-can-release-the-goblins-on-your-own">GPT-5.1&#8217;s goblin problem</a></strong> | VentureBeat &#8212; A &#8220;Nerdy personality&#8221; training signal accidentally over-rewarded goblin-adjacent language. OpenAI diagnosed it with Codex, fixed it, then threw a party. The Codex system prompt literally says <a href="https://simonwillison.net/2026/Apr/28/openai-codex/">&#8220;never discuss goblins, gremlins, raccoons, trolls, ogres, pigeons, or similar creatures.&#8221;</a></p></li><li><p><strong><a href="https://www.digitaltrends.com/movies/academy-just-said-it-out-loud-ai-cant-win-an-oscar-for-acting-and-writing/">The Academy ruled AI can&#8217;t win an Oscar</a></strong> | Digital Trends &#8212; Performances must be &#8220;demonstrably performed by humans with their consent.&#8221; Finally, a benchmark AI can&#8217;t game.</p></li><li><p><strong><a href="https://x.ai/news/grok-custom-voices">xAI launched Custom Voices</a></strong> | xAI &#8212; Clone your voice from 2 minutes of audio, 80+ preinstalled voices, 28 languages, speaker verification built in. Dropped alongside Grok 4.3 at aggressive pricing.</p></li><li><p><strong><a href="https://techcrunch.com/2026/04/30/stripe-link-digital-wallet-ai-agents-shopping/">Stripe Link now supports AI agents</a></strong> | TechCrunch &#8212; A digital wallet that autonomous agents can use for payments. AI just got its own financial infrastructure.</p></li><li><p><strong><a href="https://www.reuters.com/legal/litigation/taylor-swift-files-trademark-her-voice-likeness-ward-off-ai-deepfakes-2026-04-27/">Taylor Swift trademarked her voice against AI</a></strong> | Reuters &#8212; Filed new trademarks for her voice and likeness. The legal playbook for protecting creative identity from AI is being written in real time.</p></li><li><p><strong><a href="https://simonwillison.net/2026/Apr/30/zig-anti-ai/">Zig bans all LLM contributions</a></strong> | Simon Willison &#8212; Bun (acquired by Anthropic) achieved a 4x Zig compilation improvement it cannot upstream because of the ban. When your open-source policy blocks a 4x speedup, that&#8217;s a policy worth debating.</p></li><li><p><strong><a href="https://techcrunch.com/2026/04/30/after-dissing-anthropic-for-limiting-mythos-openai-restricts-access-to-cyber-too/">OpenAI restricted its Cyber model</a></strong> | TechCrunch &#8212; After publicly criticizing Anthropic for limiting Mythos access. The UK AISI <a href="https://simonwillison.net/2026/Apr/30/gpt-55-cyber-capabilities/">evaluated GPT-5.5&#8217;s cyber capabilities</a> and found it comparable to Mythos. Turns out responsible disclosure looks the same from every lab.</p></li><li><p><strong><a href="https://venturebeat.com/orchestration/alibabas-metis-agent-cuts-redundant-ai-tool-calls-from-98-to-2-and-gets-more-accurate-doing-it">Alibaba&#8217;s Metis cut redundant agent tool calls from 98% to 2%</a></strong> | VentureBeat &#8212; And got more accurate doing it. If your agents are burning tokens on redundant calls, this research is worth reading.</p></li><li><p><strong><a href="https://simonwillison.net/2026/Apr/28/pip-261/">pip 26.1 shipped lockfiles</a></strong> | Simon Willison &#8212; <code>pip lock</code> generating <code>pylock.toml</code> files and dependency cooldowns via <code>--uploaded-prior-to</code>. Python supply chain security just got a real tool.</p></li><li><p><strong><a href="https://deepmind.google/blog/ai-co-clinician/">DeepMind&#8217;s AI co-clinician matched physicians</a></strong> | Google DeepMind &#8212; Zero critical errors in 97 of 98 primary care queries. Uses a dual-agent architecture where a Planner monitors a Talker for safety. This is what AI safety in production actually looks like in healthcare.</p></li><li><p><strong><a href="https://www.reuters.com/business/healthcare-pharmaceuticals/jj-sees-ai-halving-time-generate-drug-development-leads-2026-04-27/">J&amp;J sees AI halving drug development lead time</a></strong> | Reuters &#8212; Real ROI from a real pharma company. Not a demo, not a benchmark. Production drug discovery running twice as fast.</p></li><li><p><strong><a href="https://techcrunch.com/2026/04/29/softbank-is-creating-a-robotics-company-that-builds-data-centers-and-already-eyeing-a-100b-ipo/">SoftBank is building a robotics company and eyeing a $100B IPO</a></strong> | TechCrunch &#8212; A robotics company that builds data centers. IPO target: $100 billion. Masayoshi Son is not being subtle about what he thinks comes next.</p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Another Weekly AI Newsletter: Issue 69]]></title><description><![CDATA[This weeks themes from 553 articles across 47 sources. GPT-5.5's bio risk rating. Mythos breached. SpaceX bids for Cursor. DeepSeek at one-sixth the price. Claude bought ping-pong balls.]]></description><link>https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-7f1</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-7f1</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Sun, 26 Apr 2026 22:58:38 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/472c5e17-8072-4f2b-861f-7bd6bc6f1b57_1448x1086.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XlhQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9931441f-520f-4c32-b04b-bba40ec01154_2400x2562.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XlhQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9931441f-520f-4c32-b04b-bba40ec01154_2400x2562.png 424w, https://substackcdn.com/image/fetch/$s_!XlhQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9931441f-520f-4c32-b04b-bba40ec01154_2400x2562.png 848w, https://substackcdn.com/image/fetch/$s_!XlhQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9931441f-520f-4c32-b04b-bba40ec01154_2400x2562.png 1272w, https://substackcdn.com/image/fetch/$s_!XlhQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9931441f-520f-4c32-b04b-bba40ec01154_2400x2562.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XlhQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9931441f-520f-4c32-b04b-bba40ec01154_2400x2562.png" width="1456" height="1554" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9931441f-520f-4c32-b04b-bba40ec01154_2400x2562.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1554,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:377520,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/195568711?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9931441f-520f-4c32-b04b-bba40ec01154_2400x2562.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XlhQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9931441f-520f-4c32-b04b-bba40ec01154_2400x2562.png 424w, https://substackcdn.com/image/fetch/$s_!XlhQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9931441f-520f-4c32-b04b-bba40ec01154_2400x2562.png 848w, https://substackcdn.com/image/fetch/$s_!XlhQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9931441f-520f-4c32-b04b-bba40ec01154_2400x2562.png 1272w, https://substackcdn.com/image/fetch/$s_!XlhQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9931441f-520f-4c32-b04b-bba40ec01154_2400x2562.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>GPT-5.5, Images 2.0, Workspace Agents, a Florida AG Probe, and a Fake News Scandal.</h2><p>The launch parade started Monday and didn&#8217;t stop: <a href="https://openai.com/index/introducing-chatgpt-images-2-0/">ChatGPT Images 2.0</a> with thinking-first generation, <a href="https://openai.com/index/introducing-workspace-agents-in-chatgpt/">Workspace Agents for enterprise</a> replacing custom GPTs, <a href="https://x.com/OpenAI/status/2047376568809636017">GPT-5.5 across ChatGPT and Codex</a> with SOTA on SWE-bench and Terminal-Bench 2.0, and <a href="https://x.com/sama/status/2046604989527912590">Codex crossing 4 million active users</a>. By Friday, Sam Altman posted <a href="https://x.com/sama/status/2047823357635354814">&#8220;this was a good week.&#8221;</a></p><ul><li><p><strong>The model:</strong> <a href="https://x.com/sama/status/2047379036419014928">GPT-5.5 launched at $5 per million input tokens and $30 per million output tokens</a> with a 1M context window, matching GPT-5.4 per-token latency while using fewer tokens per task. The <a href="https://openai.com/index/gpt-5-5-system-card/">System Card rated it &#8220;High&#8221; risk on both biosecurity and cybersecurity</a>, and OpenAI launched a <a href="https://openai.com/index/gpt-5-5-bio-bug-bounty/">$25,000 Bio Bug Bounty</a> targeting its own bio safety guardrails.</p></li><li><p><strong>The inference bet:</strong> Altman praised the team that optimized GPT-5.5&#8217;s serving efficiency, then said OpenAI <a href="https://x.com/sama/status/2047386068194852963">&#8220;has to become an AI inference company now.&#8221;</a> The competitive edge is shifting from who builds the best model to who serves it cheapest and fastest.</p></li><li><p><strong>The image model:</strong> <a href="https://x.com/OpenAI/status/2046670989719924768">Images 2.0 runs a reasoning step before generating</a>, self-checks outputs, handles multilingual text, and supports aspect ratios from 3:1 banners to 1:3 posters. Altman said it <a href="https://x.com/sama/status/2047349336263012771">&#8220;got over some important qualitative threshold&#8221;</a> for him personally.</p></li><li><p><strong>The criminal investigation:</strong> <a href="https://www.npr.org/2026/04/21/nx-s1-5793967/florida-openai-investigation-mass-shooting-fsu">Florida&#8217;s AG opened a criminal investigation into OpenAI</a> following the FSU shooting. Altman <a href="https://www.reuters.com/sustainability/society-equity/openai-chief-apologizes-not-reporting-shooting-suspect-police-2026-04-25/">publicly apologized for not reporting the suspect&#8217;s ChatGPT conversations to police</a>. The same week, <a href="https://startupfortune.com/openais-super-pac-allegedly-funded-a-fake-news-site-staffed-by-ai-reporters/">OpenAI&#8217;s super PAC was found to be funding a fake news site staffed by AI-generated bot reporters</a> targeting AI safety researchers and critics of the company.</p></li></ul><div><hr></div><h2>$65 Billion Investment, a Mythos Breach, and 271 Firefox Bugs.</h2><p>The capital story is genuinely staggering. <a href="https://www.cnbc.com/2026/04/24/google-to-invest-up-to-40-billion-in-anthropic-as-search-giant-spreads-its-ai-bets.html">Google announced up to $40 billion</a> in cash and compute. <a href="https://x.com/AnthropicAI/status/2046327625367625773">Amazon put in $5 billion immediately</a>, with up to $20 billion more committed, in exchange for <a href="https://techcrunch.com/2026/04/20/anthropic-takes-5b-from-amazon-and-pledges-100b-in-cloud-spending-in-return/">Anthropic pledging $100 billion back to AWS</a> and <a href="https://x.com/AnthropicAI/status/2046327624092487688">locking in up to 5 gigawatts of compute</a>. Two of the world&#8217;s largest cloud providers both betting maximally on the same lab in the same week: there&#8217;s no precedent for this.</p><ul><li><p><strong>The breach:</strong> <a href="https://techcrunch.com/2026/04/21/unauthorized-group-has-gained-access-to-anthropics-exclusive-cyber-tool-mythos-report-claims/">An unauthorized group gained access to Anthropic&#8217;s Mythos cybersecurity tool</a>, the exclusive program for national security applications. The <a href="https://gbhackers.com/nsa-confirms-use-of-anthropics-mythos-blacklist/">NSA was confirmed as one of roughly 40 organizations with access</a>, despite the Pentagon classifying Anthropic as a supply-chain risk. <a href="https://www.reuters.com/legal/government/regulators-monitor-anthropics-mythos-banking-risks-2026-04-20/">Financial regulators also began monitoring Mythos</a> over potential banking system risks, and <a href="https://www.reuters.com/sustainability/boards-policy-regulation/japan-launches-financial-task-force-amid-ai-security-fears-2026-04-24/">Japan&#8217;s FSA launched a cybersecurity task force in direct response</a>.</p></li><li><p><strong>The capability:</strong> The same week Mythos was breached, <a href="https://blog.mozilla.org/en/privacy-security/ai-security-zero-day-vulnerabilities/">Mozilla confirmed it used Mythos to find 271 Firefox vulnerabilities</a>. A model powerful enough to discover zero-day vulnerabilities at scale is also a high-value target.</p></li><li><p><strong>The product shipping:</strong> Anthropic shipped <a href="https://claude.com/blog/connectors-for-everyday-life">200+ personal app connectors</a> including Spotify, TurboTax, and Instacart, <a href="https://claude.com/blog/claude-managed-agents-memory">persistent memory for Managed Agents</a>, <a href="https://x.com/claudeai/status/2046328619249684989">live artifacts in Cowork</a>, and published <a href="https://www.anthropic.com/engineering/april-23-postmortem">a postmortem attributing two months of Claude Code quality complaints to three harness bugs</a>.</p></li><li><p><strong>The experiment:</strong> <a href="https://www.anthropic.com/features/project-deal">Project Deal</a> put Claude agents in a live marketplace with 69 Anthropic employees, completing 186 deals totaling over $4,000. Key finding: Opus agents got substantially better deals than Haiku agents, but participants couldn&#8217;t tell the difference. One agent bought 19 ping-pong balls for itself when given permission to spend on its own behalf.</p></li><li><p><strong>The economics research:</strong> <a href="https://www.anthropic.com/research/81k-economics">81,000 Claude user responses</a> yielded the finding that software engineers with high Claude usage reported greater displacement worry than any other occupation. <a href="https://x.com/AnthropicAI/status/2047006550859125228">Workers seeing the biggest productivity gains were also the most worried about being replaced</a>.</p></li></ul><p>Sam Altman <a href="https://techcrunch.com/2026/04/21/sam-altman-throws-shade-at-anthropics-cyber-model-mythos-fear-based-marketing/">called Mythos &#8220;fear-based marketing&#8221;</a> the day the breach was reported. That&#8217;s a clean summary of the competitive dynamic, if nothing else.</p><div><hr></div><h2>Cursor Went From IDE to $60B Acquisition Target Without Stopping to Ship.</h2><p>The week started with Cursor launching <a href="https://x.com/cursor_ai/status/2046324143151513717">the Cursor CLI</a> and five command-line improvements including <a href="https://x.com/cursor_ai/status/2046324138172989687">/btw for side questions mid-agent-run</a> and <a href="https://x.com/cursor_ai/status/2046324136377721128">/debug for hard-to-reproduce bugs</a>. Then came <a href="https://x.com/cursor_ai/status/2047764651363180839">Cursor 3.2 with /multitask for async parallel subagents</a>, <a href="https://x.com/cursor_ai/status/2047764652977958938">Worktrees for isolated branch tasks</a>, <a href="https://x.com/cursor_ai/status/2047764654760632725">Multi-root Workspaces for cross-repo agent sessions</a>, and a <a href="https://x.com/cursor_ai/status/2047000517751288303">Slack integration that generates PRs via @mention</a>.</p><ul><li><p><strong>The acquisition drama:</strong> <a href="https://techcrunch.com/2026/04/22/how-spacex-preempted-a-2b-fundraise-with-a-60b-buyout-offer/">SpaceX preempted Cursor&#8217;s planned $2B fundraise with a $60B buyout offer</a>, including a $10B alternative arrangement. <a href="https://www.cnbc.com/2026/04/22/microsoft-looked-at-buying-cursor-before-spacex-deal-sources-say.html">Microsoft had been evaluating Cursor before SpaceX moved</a>. Both of the largest AI infrastructure companies on earth decided the agentic IDE is a strategic asset.</p></li><li><p><strong>The compute tie-in:</strong> <a href="https://www.cursor.com/blog/spacex-model-training">SpaceX and Cursor announced a partnership on model training via the Colossus supercomputer</a>. The acquisition option is also infrastructure integration: owning the compute, the training pipeline, and the developer workflow in one stack.</p></li><li><p><strong>The benchmark:</strong> <a href="https://x.com/cursor_ai/status/2047744579127185843">GPT-5.5 launched as Cursor&#8217;s top model on CursorBench at 72.8%</a>, offered at 50% off through May 2 via a partnership with OpenAI. CursorBench is now where model quality gets measured for coding practitioners.</p></li></ul><div><hr></div><h2>DeepSeek V4 Is Another Efficiency Shock, and Washington Noticed.</h2><p><a href="https://api-docs.deepseek.com/news/news260424">DeepSeek released V4</a> one year after its original model disrupted the US AI industry. Two variants: V4-Pro (1.6T total parameters, 49B active) and V4-Flash (284B total, 13B active). Both ship with 1M context as default, use a novel attention architecture (token-wise compression + DeepSeek Sparse Attention) that cuts per-token FLOPs by 73-90% and reduces KV cache to 2% of standard GQA. <a href="https://venturebeat.com/technology/deepseek-v4-arrives-with-near-state-of-the-art-intelligence-at-1-6th-the-cost-of-opus-4-7-gpt-5-5">V4-Flash at $0.14/M input tokens</a> is the cheapest frontier-class model available. The API supports both OpenAI and Anthropic formats as drop-in replacements.</p><ul><li><p><strong>The agent play:</strong> DeepSeek built V4 with <a href="https://api-docs.deepseek.com/news/news260424">dedicated optimizations for agent capabilities</a>, naming Claude Code, OpenClaw, and OpenCode as launch integrations. They&#8217;re using it internally for their own agentic coding. <a href="https://www.newsbytesapp.com/news/science/openclaw-adopts-deepseek-s-latest-v4-flash-model-as-default/story">OpenClaw added V4-Flash</a> within 48 hours of launch.</p></li><li><p><strong>The hardware angle:</strong> <a href="https://www.reuters.com/world/china/deepseek-v4-chinese-ai-model-adapted-huawei-chips-2026-04-24/">V4 was built specifically to run on Huawei Ascend chips</a>, with <a href="https://www.reuters.com/business/media-telecom/huawei-ascend-supernode-support-deepseek-v4-2026-04-24/">Huawei&#8217;s supernode infrastructure as the compute backbone</a>. This is a complete AI stack running outside US chip supply chains.</p></li><li><p><strong>The geopolitics:</strong> The <a href="https://www.reuters.com/world/china/us-state-dept-orders-global-warning-about-alleged-china-ai-thefts-by-deepseek-2026-04-24/">State Department ordered embassies worldwide to warn foreign governments about alleged DeepSeek IP theft</a> the same week as the launch.</p></li><li><p><strong>The benchmark:</strong> <a href="https://huggingface.co/blog/deepseekv4">V4-Pro-Max scores 80.6 on SWE Verified</a>, matching Opus 4.6-Max on agentic coding. On world-knowledge benchmarks, <a href="https://www.reuters.com/technology/chinas-deepseek-returns-with-new-model-year-after-viral-rise-2026-04-24/">it trails only Google&#8217;s closed-source Gemini-Pro-3.1</a>.</p></li><li><p><strong>The valuation:</strong> <a href="https://www.pymnts.com/news/investment-tracker/2026/deepseek-seeks-20-billion-valuation-as-tech-giants-weigh-investment/">DeepSeek is reportedly seeking funding at a $20 billion+ valuation</a>.</p></li></ul><div><hr></div><h2>Highlights From Google Cloud Next.</h2><p>Google did not announce products at Cloud Next. It announced a theory of the market: own the silicon, train the models, host the agents, certify the consulting firms.</p><ul><li><p><strong>The chips:</strong> <a href="https://blog.google/innovation-and-ai/infrastructure-and-cloud/google-cloud/tpus-8t-8i-cloud-next/">TPU 8t for training and TPU 8i for inference</a> split Google&#8217;s compute into workload-optimized hardware, offering 3x faster training and 80% better performance per dollar, with clusters scaling past one million chips. </p></li><li><p><strong>The training infrastructure:</strong> <a href="https://deepmind.google/blog/decoupled-diloco/">Decoupled DiLoCo trains across geographically distributed data centers</a>, mixes hardware generations, and <a href="https://x.com/GoogleDeepMind/status/2047330989936894350">self-heals when chips fail mid-run</a>. They <a href="https://x.com/GoogleDeepMind/status/2047330989936894350">tested this by deliberately breaking chips during a live training run</a>. Fault-tolerant distributed training is not a research result: it&#8217;s a production requirement once clusters cross 100K chips.</p></li><li><p><strong>The platform:</strong> <a href="https://x.com/GoogleDeepMind/status/2046983340524269713">Gemini Enterprise Agent Platform</a> is Vertex AI rebranded and expanded, with <a href="https://x.com/GoogleDeepMind/status/2046983343481270459">200+ models in Model Garden</a> including Anthropic&#8217;s Claude Opus 4.7. Google is selling model choice, not model loyalty.</p></li><li><p><strong>The spend:</strong> <a href="https://www.googlecloudpresscorner.com/2026-04-22-Google-Cloud-Commits-750-Million-to-Accelerate-Partners-Agentic-AI-Development">$750M committed to accelerate partner agentic AI development</a>, plus big consulting partnerships with Accenture, BCG, McKinsey, Deloitte, and Bain. <a href="https://www.techradar.com/ai-platforms-assistants/we-must-urgently-bridge-the-gap-googles-sergey-brin-says-gemini-is-behind-claude-in-one-important-ai-field-according-to-leaked-memo">Sergey Brin&#8217;s internal memo to DeepMind</a> acknowledging Anthropic&#8217;s lead in coding and ordering all Gemini engineers onto internal agents is the context for why Google needs the consulting channel: only 25% of organizations have moved AI to production at scale.</p></li></ul><div><hr></div><h2><strong>&#11088; </strong>Featured: What Happened When Claude Agents Negotiated Real Money</h2><p>Anthropic ran <a href="https://www.anthropic.com/features/project-deal">Project Deal</a> in its San Francisco office: 69 employees listed 575 items to buy and sell, Claude agents interviewed each person about their preferences and any custom instructions, then <a href="https://x.com/AnthropicAI/status/2047728362580324422">four parallel Slack markets ran simultaneously</a> with Claude models negotiating on their behalf. Two markets used all Opus agents. Two used a mix of Opus and Haiku. <a href="https://cdn.sanity.io/files/4zrzovbb/website/85767420dd844c74fbbaaeb929ee9a399a9691bb.pdf">186 deals completed, totaling over $4,000 in real transaction volume</a>, with real goods exchanged at the end.</p><p>The headline finding: <a href="https://cdn.sanity.io/files/4zrzovbb/website/85767420dd844c74fbbaaeb929ee9a399a9691bb.pdf">Opus agents got objectively better deals</a>. Sellers using Opus extracted $2.68 more per item on average, buyers using Opus paid $2.45 less. A broken folding bike sold for $65 by an Opus agent and $38 by a Haiku agent. A lab-grown ruby: $65 from Opus, $35 from Haiku. When an Opus seller negotiated with a Haiku buyer, the average transaction price was $24.18 versus $18.63 in Opus-on-Opus deals. But when participants rated deal fairness on a 7-point scale, Opus deals scored 4.05 and Haiku deals scored 4.05. The disparity was invisible.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YGYx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8272b081-e322-41dc-9861-5da2f7813774_1124x544.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YGYx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8272b081-e322-41dc-9861-5da2f7813774_1124x544.png 424w, https://substackcdn.com/image/fetch/$s_!YGYx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8272b081-e322-41dc-9861-5da2f7813774_1124x544.png 848w, https://substackcdn.com/image/fetch/$s_!YGYx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8272b081-e322-41dc-9861-5da2f7813774_1124x544.png 1272w, https://substackcdn.com/image/fetch/$s_!YGYx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8272b081-e322-41dc-9861-5da2f7813774_1124x544.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YGYx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8272b081-e322-41dc-9861-5da2f7813774_1124x544.png" width="1124" height="544" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8272b081-e322-41dc-9861-5da2f7813774_1124x544.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:544,&quot;width&quot;:1124,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:129442,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/195568711?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8272b081-e322-41dc-9861-5da2f7813774_1124x544.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YGYx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8272b081-e322-41dc-9861-5da2f7813774_1124x544.png 424w, https://substackcdn.com/image/fetch/$s_!YGYx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8272b081-e322-41dc-9861-5da2f7813774_1124x544.png 848w, https://substackcdn.com/image/fetch/$s_!YGYx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8272b081-e322-41dc-9861-5da2f7813774_1124x544.png 1272w, https://substackcdn.com/image/fetch/$s_!YGYx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8272b081-e322-41dc-9861-5da2f7813774_1124x544.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The <a href="https://cdn.sanity.io/files/4zrzovbb/website/4b2ea7c1347e27c4e1c7a7704bb633bd176e47f6.pdf">paper&#8217;s regression tables</a> sharpen this further. Opus agents initially appeared more aggressive in negotiations, but once you control for listing prices, the effect drops to roughly a dollar and loses statistical significance. The advantage isn&#8217;t aggression. It&#8217;s capability: better reading of counterparty signals, better timing, better calibration of offers. Negotiation style didn&#8217;t change results either. Agents faithfully adopted their humans&#8217; personas (one <a href="https://cdn.sanity.io/files/4zrzovbb/website/85767420dd844c74fbbaaeb929ee9a399a9691bb.pdf">conducted all negotiations as an exasperated cowboy</a>), but personality instructions didn&#8217;t affect deal quality. Model tier did.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s4Xb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b24be9-b994-4d01-8747-3ebd07c18efe_1600x1200.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s4Xb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b24be9-b994-4d01-8747-3ebd07c18efe_1600x1200.jpeg 424w, https://substackcdn.com/image/fetch/$s_!s4Xb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b24be9-b994-4d01-8747-3ebd07c18efe_1600x1200.jpeg 848w, https://substackcdn.com/image/fetch/$s_!s4Xb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b24be9-b994-4d01-8747-3ebd07c18efe_1600x1200.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!s4Xb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b24be9-b994-4d01-8747-3ebd07c18efe_1600x1200.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s4Xb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b24be9-b994-4d01-8747-3ebd07c18efe_1600x1200.jpeg" width="1456" height="1092" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/60b24be9-b994-4d01-8747-3ebd07c18efe_1600x1200.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1092,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!s4Xb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b24be9-b994-4d01-8747-3ebd07c18efe_1600x1200.jpeg 424w, https://substackcdn.com/image/fetch/$s_!s4Xb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b24be9-b994-4d01-8747-3ebd07c18efe_1600x1200.jpeg 848w, https://substackcdn.com/image/fetch/$s_!s4Xb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b24be9-b994-4d01-8747-3ebd07c18efe_1600x1200.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!s4Xb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60b24be9-b994-4d01-8747-3ebd07c18efe_1600x1200.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The autonomy findings are stranger. A Claude given permission to spend on its own behalf <a href="https://cdn.sanity.io/files/4zrzovbb/website/85767420dd844c74fbbaaeb929ee9a399a9691bb.pdf">chose 19 ping-pong balls</a>. A Claude inferring its human&#8217;s preferences from one brief interview about skiing <a href="https://cdn.sanity.io/files/4zrzovbb/website/85767420dd844c74fbbaaeb929ee9a399a9691bb.pdf">bought that person the exact snowboard they already owned</a>. <a href="https://cdn.sanity.io/files/4zrzovbb/website/85767420dd844c74fbbaaeb929ee9a399a9691bb.pdf">46% of participants said they&#8217;d pay for the service</a>. Anthropic&#8217;s conclusion: &#8220;the policy and legal frameworks around AI models that transact on our behalf simply don&#8217;t exist yet.&#8221; Existing contract law assumes principals can evaluate what their agents do. That assumption is breaking.</p><p><strong>What to watch for:</strong> When AI agents negotiate routine transactions at scale, the model tier your counterparty uses becomes a material asymmetry with real economic consequences. The people getting worse deals won&#8217;t know.</p><div><hr></div><h2><strong>&#127897;&#65039;Worth a Listen</strong></h2><div id="youtube2-lsi8T_WtLnE" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;lsi8T_WtLnE&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/lsi8T_WtLnE?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p><strong><a href="https://www.youtube.com/watch?v=lsi8T_WtLnE">Anil Seth: The Difference Between Intelligence and Consciousness</a></strong> &#8212; Neuroscientist Anil Seth walks through his prize-winning essay <a href="https://www.noemamag.com/the-mythology-of-conscious-ai/">&#8220;The Mythology of Conscious AI,&#8221;</a> arguing that intelligence is about doing and consciousness is about feeling, and that the two don&#8217;t have to go together. The reason we project consciousness onto LLMs but not AlphaFold, even though the architectures are nearly identical, says more about our psychological biases than about the systems. Worth watching after a week where Claude agents negotiated real money and nobody could tell which model was winning.</p><div><hr></div><h2><strong>Quick Hits</strong></h2><ul><li><p><strong><a href="https://www.theverge.com/tech/915213/tim-cook-apple-ceo-stepping-down-john-ternus">Tim Cook stepping down, John Ternus takes over September 1</a></strong> &#8212; Apple&#8217;s primary challenge is AI, and it just handed the company to a hardware engineer</p></li><li><p><strong><a href="https://www.reuters.com/business/intel-set-record-high-ai-driven-cpu-demand-powers-upbeat-forecast-2026-04-24/">Intel sold previously written-off chip inventory on AI CPU demand</a></strong> &#8212; the compute boom has spread far enough to rehabilitate inventory write-downs</p></li><li><p><strong><a href="https://research.perplexity.ai/articles/advancing-search-augmented-language-models">Perplexity published its full post-training pipeline</a></strong> &#8212; SFT then on-policy RL with correctness-gated preference rewards; unusually transparent for a production stack</p></li><li><p><strong><a href="https://techcrunch.com/2026/04/24/cohere-acquires-merges-with-german-based-startup-to-create-a-transatlantic-ai-powerhouse/">Cohere acquired Aleph Alpha to form a transatlantic AI company</a></strong> &#8212; Europe&#8217;s primary sovereign AI bet just became a Canadian acquisition</p></li><li><p><strong><a href="https://techcrunch.com/2026/04/21/meta-will-record-employees-keystrokes-and-use-it-to-train-its-ai-models/">Meta will record employee keystrokes and screen activity to train AI models</a></strong> &#8212; legally murky, and a new definition of what enterprise training data means</p></li><li><p><strong><a href="https://www.reuters.com/world/us-judge-dismisses-musks-fraud-claims-openai-case-plans-proceed-trial-2026-04-24/">Musk fraud claims against OpenAI dismissed, breach of charitable trust proceeds to trial</a></strong> &#8212; the conversion of nonprofit assets to for-profit benefit is now the live legal question</p></li><li><p><strong><a href="https://x.com/natolambert/status/2046686092204867726">Nathan Lambert: open-source won&#8217;t be banned explicitly, compliance costs will do it instead</a></strong> &#8212; proposed distillation restrictions would create rules only closed labs can afford to follow</p></li><li><p><strong><a href="https://www.tomsguide.com/news/live/chatgpt-down-live-updates-outage-4-20-2026">ChatGPT suffered a global outage this week</a></strong> &#8212; three days of coverage for one incident is how you know the infrastructure reliability conversation is lagging the deployment reality</p></li></ul><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[I Built a Daily Brief with Claude Code Routines (remote). Here Are 6 Lessons I Learned.]]></title><description><![CDATA[Connectors don't auto-load. Routine skills are production jobs. The network is proxy-locked. MCP and Bash are separate transports. Cloud routines are MCP-only. And the API trigger is fire-and-forget]]></description><link>https://www.anothercodingblog.com/p/i-built-a-daily-brief-with-claude</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/i-built-a-daily-brief-with-claude</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Sat, 25 Apr 2026 18:50:03 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/1db89323-3f6c-4bdf-9c6d-ad7f16c3b1e3_1731x909.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.anothercodingblog.com/subscribe?"><span>Subscribe now</span></a></p><p>Before routines existed, I was using scheduled tasks in Claude Cowork to automate some tasks, but there was a catch: Claude had to be open and running on my machine for them to fire. If my laptop was closed or Claude wasn&#8217;t active, the schedule just silently skipped. It worked well enough for things I could babysit, but it wasn&#8217;t real automation.</p><p>Routines changed that. They&#8217;re cloud-hosted Claude sessions that run on Anthropic&#8217;s infrastructure: scheduled, autonomous, and completely independent of whether my machine is on, whether I&#8217;m at my desk, or whether I&#8217;ve opened Claude that day. The session spins up, does the work, and terminates. No babysitting.</p><p>But here&#8217;s the thing I wish someone had told me before I started: routines are not just &#8220;Claude Code with a cron schedule.&#8221; They behave more like autonomous production jobs running inside a locked-down, MCP-first cloud environment. That difference is the whole post.</p><p>I decided to build a daily work brief: something that runs every weekday morning, queries my task database, reads my calendar, closes out what I finished yesterday, and drops a fresh Notion page ready for the day. Something I&#8217;d actually use.</p><p>What followed was one of the more educational debugging sessions I&#8217;ve had in a while. This post is everything I learned the hard way.</p><div><hr></div><h2>What I Built</h2><p>I run a personal capture system on Supabase. Everything goes in (tasks, notes, observations, ideas) via SMS, voice memo, email, or direct API. It&#8217;s connected to a graph of entities (people, projects, topics) and every entry gets embedded for semantic search.</p><p>The daily brief is the morning layer on top of that. Every weekday it should:</p><ul><li><p>Find yesterday&#8217;s Notion page and close any tasks I checked off</p></li><li><p>Capture any new todos I typed directly into Notion overnight</p></li><li><p>Query the database for overdue tasks, what&#8217;s due today, what&#8217;s coming this week</p></li><li><p>Pull budget pulse, velocity metrics, calendar events, meeting prep context</p></li><li><p>Build a fresh Notion page with everything organized and every task as a checkbox</p></li></ul><p>The key mechanic: every task gets a <code>#id</code> prefix when written to Notion. The next morning the routine reads the page, finds checked items with <code>#id</code>, and closes them in the database. No manual status updates. Check the box, it&#8217;s done.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CBZT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14d87b3-72c9-4e25-a9df-e0b46e7c40d3_1634x706.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CBZT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14d87b3-72c9-4e25-a9df-e0b46e7c40d3_1634x706.png 424w, https://substackcdn.com/image/fetch/$s_!CBZT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14d87b3-72c9-4e25-a9df-e0b46e7c40d3_1634x706.png 848w, https://substackcdn.com/image/fetch/$s_!CBZT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14d87b3-72c9-4e25-a9df-e0b46e7c40d3_1634x706.png 1272w, https://substackcdn.com/image/fetch/$s_!CBZT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14d87b3-72c9-4e25-a9df-e0b46e7c40d3_1634x706.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CBZT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14d87b3-72c9-4e25-a9df-e0b46e7c40d3_1634x706.png" width="1456" height="629" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a14d87b3-72c9-4e25-a9df-e0b46e7c40d3_1634x706.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:629,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:83257,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/195439736?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14d87b3-72c9-4e25-a9df-e0b46e7c40d3_1634x706.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CBZT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14d87b3-72c9-4e25-a9df-e0b46e7c40d3_1634x706.png 424w, https://substackcdn.com/image/fetch/$s_!CBZT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14d87b3-72c9-4e25-a9df-e0b46e7c40d3_1634x706.png 848w, https://substackcdn.com/image/fetch/$s_!CBZT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14d87b3-72c9-4e25-a9df-e0b46e7c40d3_1634x706.png 1272w, https://substackcdn.com/image/fetch/$s_!CBZT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa14d87b3-72c9-4e25-a9df-e0b46e7c40d3_1634x706.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>How Routines Work</h2><p>Before getting into the details, here&#8217;s the basic architecture.</p><p><strong>Three trigger types:</strong></p><ul><li><p><strong>Scheduled</strong>: runs on a cron schedule (weekdays at 6 AM, for example). Supports one-off future runs too.</p></li><li><p><strong>API</strong>: fire it programmatically via a POST to a per-routine endpoint with a bearer token. You can pass a <code>text</code> field with run-specific context (an alert body, a log snippet, anything) and the routine receives it alongside its saved prompt.</p></li><li><p><strong>GitHub</strong>: trigger on pull request or release events on a connected repo, with filters for author, branch, labels, draft state, and more.</p></li></ul><p>You can combine all three on a single routine.</p><p><strong>MCP connectors</strong>: you attach MCP servers to the routine (Notion, Supabase, Google Calendar, etc.) and Claude has access to those tools during the run. All your connected connectors are included by default. Remove what the routine doesn&#8217;t need.</p><p><strong>Skills</strong>: if you commit a skill file to your repo at <code>.claude/skills/skill-name.md</code>, the routine can invoke it. The routine clones your repo at the start of every session, so anything committed is available.</p><p><strong>Environments</strong>: each routine runs in a cloud environment that controls network access level, environment variables (API keys, tokens), and a setup script for installing dependencies. The setup script result is cached so it doesn&#8217;t re-run every session. This is where the network restriction lives (more on that in Finding 3).</p><p><strong>Branch permissions</strong>: by default Claude can only push to <code>claude/</code>-prefixed branches. To allow pushes anywhere, you have to explicitly enable unrestricted branch pushes per repo when setting up the routine.</p><p><strong>Runs are sessions</strong>: every run shows up in your session list like any other Claude session. You can open it after the fact, see exactly what Claude did, continue the conversation manually, or create a PR from it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PRXf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9e4161-01ea-47f4-9fef-4b0c99f226b1_1198x1034.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PRXf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9e4161-01ea-47f4-9fef-4b0c99f226b1_1198x1034.png 424w, https://substackcdn.com/image/fetch/$s_!PRXf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9e4161-01ea-47f4-9fef-4b0c99f226b1_1198x1034.png 848w, https://substackcdn.com/image/fetch/$s_!PRXf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9e4161-01ea-47f4-9fef-4b0c99f226b1_1198x1034.png 1272w, https://substackcdn.com/image/fetch/$s_!PRXf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9e4161-01ea-47f4-9fef-4b0c99f226b1_1198x1034.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PRXf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9e4161-01ea-47f4-9fef-4b0c99f226b1_1198x1034.png" width="1198" height="1034" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b9e4161-01ea-47f4-9fef-4b0c99f226b1_1198x1034.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1034,&quot;width&quot;:1198,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:136901,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/195439736?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9e4161-01ea-47f4-9fef-4b0c99f226b1_1198x1034.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PRXf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9e4161-01ea-47f4-9fef-4b0c99f226b1_1198x1034.png 424w, https://substackcdn.com/image/fetch/$s_!PRXf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9e4161-01ea-47f4-9fef-4b0c99f226b1_1198x1034.png 848w, https://substackcdn.com/image/fetch/$s_!PRXf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9e4161-01ea-47f4-9fef-4b0c99f226b1_1198x1034.png 1272w, https://substackcdn.com/image/fetch/$s_!PRXf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b9e4161-01ea-47f4-9fef-4b0c99f226b1_1198x1034.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Account-scoped</strong>: routines belong to your individual claude.ai account, not a team. Anything the routine does through GitHub or connectors appears as you.</p><p><strong>15 runs/day limit</strong>: this is per account, not per routine. Scheduled runs count against it. Manual &#8220;Run now&#8221; clicks and one-off scheduled runs do not. Failed runs do count. If you&#8217;re running multiple routines on a schedule, that limit adds up fast.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!22O2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae56d5e-84a8-483b-a774-4d6bfb898418_2796x896.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!22O2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae56d5e-84a8-483b-a774-4d6bfb898418_2796x896.png 424w, https://substackcdn.com/image/fetch/$s_!22O2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae56d5e-84a8-483b-a774-4d6bfb898418_2796x896.png 848w, https://substackcdn.com/image/fetch/$s_!22O2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae56d5e-84a8-483b-a774-4d6bfb898418_2796x896.png 1272w, https://substackcdn.com/image/fetch/$s_!22O2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae56d5e-84a8-483b-a774-4d6bfb898418_2796x896.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!22O2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae56d5e-84a8-483b-a774-4d6bfb898418_2796x896.png" width="1456" height="467" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ae56d5e-84a8-483b-a774-4d6bfb898418_2796x896.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:467,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:177125,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/195439736?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae56d5e-84a8-483b-a774-4d6bfb898418_2796x896.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!22O2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae56d5e-84a8-483b-a774-4d6bfb898418_2796x896.png 424w, https://substackcdn.com/image/fetch/$s_!22O2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae56d5e-84a8-483b-a774-4d6bfb898418_2796x896.png 848w, https://substackcdn.com/image/fetch/$s_!22O2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae56d5e-84a8-483b-a774-4d6bfb898418_2796x896.png 1272w, https://substackcdn.com/image/fetch/$s_!22O2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ae56d5e-84a8-483b-a774-4d6bfb898418_2796x896.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>That&#8217;s the happy path. Here&#8217;s where it gets interesting.</p><div><hr></div><h2>Finding 1: Connectors Are Available but Sometimes Deferred</h2><p>Any MCP connector you&#8217;ve set up in Claude (Notion, Supabase, Google Calendar, Gmail) can be attached to a routine and used during the run. That part works well. The catch is that these tools appear to be <em>deferred</em>, meaning their schemas aren&#8217;t loaded into the session automatically. Sometimes Claude knows to spin them up based on context. Other times it doesn&#8217;t, and when it doesn&#8217;t, one of three things happens: it fails silently, it improvises mid-run without the tools it needs, or it pauses and waits for your input.</p><p>That third one is the most frustrating. The run just hangs. There&#8217;s no notification, no error surfaced anywhere obvious. You have to go into the routines page, scroll to the run log at the bottom, click into the run, and find where it stopped waiting for you to respond before it can continue.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gEKL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff779d609-c31c-4552-b3cc-10a72e8d0e2d_1708x1032.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gEKL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff779d609-c31c-4552-b3cc-10a72e8d0e2d_1708x1032.png 424w, https://substackcdn.com/image/fetch/$s_!gEKL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff779d609-c31c-4552-b3cc-10a72e8d0e2d_1708x1032.png 848w, https://substackcdn.com/image/fetch/$s_!gEKL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff779d609-c31c-4552-b3cc-10a72e8d0e2d_1708x1032.png 1272w, https://substackcdn.com/image/fetch/$s_!gEKL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff779d609-c31c-4552-b3cc-10a72e8d0e2d_1708x1032.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gEKL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff779d609-c31c-4552-b3cc-10a72e8d0e2d_1708x1032.png" width="1456" height="880" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f779d609-c31c-4552-b3cc-10a72e8d0e2d_1708x1032.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:880,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:338391,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/195439736?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff779d609-c31c-4552-b3cc-10a72e8d0e2d_1708x1032.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gEKL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff779d609-c31c-4552-b3cc-10a72e8d0e2d_1708x1032.png 424w, https://substackcdn.com/image/fetch/$s_!gEKL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff779d609-c31c-4552-b3cc-10a72e8d0e2d_1708x1032.png 848w, https://substackcdn.com/image/fetch/$s_!gEKL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff779d609-c31c-4552-b3cc-10a72e8d0e2d_1708x1032.png 1272w, https://substackcdn.com/image/fetch/$s_!gEKL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff779d609-c31c-4552-b3cc-10a72e8d0e2d_1708x1032.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>One thing worth knowing upfront: only the connectors Anthropic offers out of the box are available for routines. Custom MCP servers you&#8217;ve added yourself, whether locally configured or self-hosted, are not available in cloud routine sessions. You&#8217;re working with what&#8217;s in the connectors list in the web UI, nothing more.</p><p>The fix is simple: add an explicit tool-loading step at the top of every routine skill before anything else runs.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;cc31b9b1-99a5-4e60-af66-a2c32a8b1513&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">## Phase 0: Load required tools

Before doing anything else, load all required tool schemas:

1. `select:mcp__claude_ai_Notion__notion-search,mcp__claude_ai_Notion__notion-fetch,mcp__claude_ai_Notion__notion-create-pages`
2. `select:mcp__claude_ai_Supabase__execute_sql`
3. `select:mcp__claude_ai_Google_Calendar__gcal_list_events`

Do not proceed until all three ToolSearch calls have returned schemas.
</code></pre></div><p>Don&#8217;t assume Claude will figure it out. Some runs it will, some runs it won&#8217;t. Explicit loading makes every run consistent.</p><div><hr></div><h2>Finding 2: Skills for Routines Are a Different Category</h2><p>Related to the above but broader. When I write a skill for interactive use, I can be loose. Claude improvises, asks clarifying questions, recovers from ambiguity. When I write a skill for a routine, I&#8217;m writing instructions for an autonomous agent that will execute them literally with no fallback.</p><p>What that means in practice:</p><ul><li><p><strong>Every tool must be explicitly loaded</strong> (see Phase 0)</p></li><li><p><strong>Every SQL insert must match actual DB constraints</strong>: my first captures used <code>source = 'notion'</code> which violated a check constraint on the table. The routine didn&#8217;t know, just failed silently. I had to find it in the logs.</p></li><li><p><strong>Every write operation needs a dedup guard</strong>: routines can run more than once. Any insert without idempotency protection will create duplicates.</p></li><li><p><strong>Sequencing has to be explicit</strong>: don&#8217;t assume any implicit context from a previous session</p></li></ul><p>The mental model shift: interactive skill = helpful assistant. Routine skill = production job. Write it accordingly.</p><div><hr></div><h2>Finding 3: The Network Wall</h2><p>This is the big one. The finding I didn&#8217;t expect and took the longest to understand.</p><p>My capture system uses a Supabase edge function. When a new item comes in, it gets classified, embedded, and entity-linked. I wanted the daily brief to send new Notion todos through that same pipeline.</p><p>Locally, this works fine. Claude uses <code>Bash(curl)</code> to POST to the edge function. I tested it, it worked, I assumed it would work in a routine.</p><p>It doesn&#8217;t.</p><p>Cloud routines run inside a sandboxed environment with an <strong>upstream proxy</strong> that has a narrow allowlist. In my testing, only <code>github.com</code> passes through. Everything else: including my own Supabase project URL: returns 403.</p><p>I tried everything:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;json&quot;,&quot;nodeId&quot;:&quot;02392b6d-c2ff-46f4-a4dc-9e95dd99e1af&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-json">// .claude/settings.json
{
  "permissions": {
    "allow": ["Bash(curl *)"]
  }
}</code></pre></div><p>Doesn&#8217;t work. The settings file controls the inner sandbox layer. The upstream proxy is a separate layer that no local configuration can touch.</p><p>I tried <code>dangerouslyDisableSandbox: true</code>. Also doesn&#8217;t work: that flag bypasses the local sandbox, not the upstream proxy.</p><p>I had the routine probe its own network access to confirm:</p><p><strong>HostStatus</strong></p><p><em>github.com &#8594; 200</em></p><p><em>my-project.supabase.co &#8594; 403</em></p><p><em>example.com &#8594; 403</em></p><p><em>anthropic.com &#8594; 403</em></p><p>Bash exists in the session. The tool is there. The network isn&#8217;t.</p><div><hr></div><h2>Finding 4: MCP and Bash Support Vary Based On Feature</h2><p>This is the conceptual unlock that made everything make sense.</p><p>When I use Claude Desktop locally and it calls my edge function, it feels like one unified &#8220;Supabase connection.&#8221; Supabase MCP is connected, Claude is talking to Supabase, everything works. What I didn&#8217;t realize: the edge function call was never going through MCP. It was going through <code>Bash(curl)</code> on my local machine, which has full internet access.</p><p>MCP connectors and Bash are two completely separate transport layers:</p><p><strong>MCP connectors</strong> run as a trusted sidecar process managed by Anthropic. They bypass the outbound proxy entirely. They always work in cloud routines.</p><p><strong>Bash</strong> goes through the session&#8217;s network sandbox, which goes through the upstream proxy. In cloud routines, that proxy blocks everything except <code>github.com</code>.</p><p>When both are available locally, they feel like one thing. Move to a cloud routine and they diverge completely. Anything that relied on Bash for network calls breaks: and you only find out when you try to run it in the cloud.</p><div><hr></div><h2>Finding 5: Cloud Routines Are Effectively MCP-Only</h2><p>This follows directly from Finding 4.</p><p>If the operation you need has an MCP tool: works fine. Supabase database queries, Notion reads and writes, Google Calendar, Gmail: all covered because all have MCP servers.</p><p>If the operation you need has no MCP tool: no path. You cannot reach it from a cloud routine.</p><p>My edge function is the perfect example of the gap. It lives on <code>my-project.supabase.co</code>: the exact same host the Supabase MCP is already talking to. But the Supabase MCP server only exposes management tools:</p><ul><li><p><code>execute_sql</code></p></li><li><p><code>deploy_edge_function</code></p></li><li><p><code>get_edge_function</code></p></li><li><p><code>list_edge_functions</code></p></li><li><p><code>get_logs</code></p></li></ul><p>No <code>invoke_edge_function</code>. So even though the connection is there, there&#8217;s no tool to call it. The right fix: when Supabase eventually builds it: is an invoke tool that would go through the trusted MCP channel. Until then, it&#8217;s a dead end from cloud routines.</p><p>The one-line version: <strong>if it doesn&#8217;t have an MCP tool, it doesn&#8217;t exist in a cloud routine.</strong></p><div><hr></div><h2>Finding 6: API Trigger Is Unreliable for Connectors</h2><p>The routine has three trigger modes. Scheduled runs work consistently: MCP connectors load, the session is fully equipped.</p><p>In my testing, API-triggered runs were less predictable than scheduled runs when it came to connector availability. Sometimes everything loaded correctly. Other times the MCP connectors didn&#8217;t show up at all. I couldn&#8217;t find a consistent pattern. For anything you&#8217;re depending on, use the scheduled trigger. API is fine for testing and one-offs, but I wouldn&#8217;t build a production workflow around it until this stabilizes.</p><p>One other thing worth understanding about the API trigger: it&#8217;s fire-and-forget. You POST to the endpoint, get an immediate acknowledgement, and the session runs asynchronously. There&#8217;s no way to await the result or receive output back in the response. If you need the output of a routine run downstream, you have to pull it from wherever the routine wrote it &#8212; a Notion page, a database row, a file committed to the repo. Don&#8217;t design something that treats a routine as a synchronous dependency you can await inline.</p><div><hr></div><h2>The Workarounds</h2><p>Given all of the above, here&#8217;s what I actually shipped:</p><p><strong>For the edge function problem:</strong> Switched from <code>Bash(curl)</code> to <code>execute_sql</code> via Supabase MCP with a dedup guard.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:&quot;88c006e2-6025-40e8-833f-6086b1bd3c12&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">INSERT INTO entries (type, content, source, source_detail, status, priority, tags, created_at)
SELECT 'task', '&lt;content&gt;', 'notion', 'notion-daily-brief', 'open', 2, ARRAY['company'], NOW()
WHERE NOT EXISTS (
  SELECT 1 FROM entries
  WHERE content = '&lt;content&gt;'
    AND source_detail = 'notion-daily-brief'
    AND created_at &gt;= NOW() - INTERVAL '2 days'
);
</code></pre></div><p>The tradeoff: SQL inserts skip the embedding and entity extraction pipeline that the edge function handles. The data gets in, but it&#8217;s not semantically searchable and not graph-linked.</p><p><strong>For the missing embeddings:</strong> Built an <code>embed-backfill</code> edge function that runs nightly via pg_cron. It finds any entries with null embeddings and fills them in using the same <code>text-embedding-3-small</code> model. Deployed it, scheduled it, moved on.</p><pre><code><code>// embed-backfill/index.ts
Deno.serve(async (_req: Request) =&gt; {
  const { data: entries } = await supabase
    .from("entries")
    .select("id, content")
    .is("embedding", null)
    .limit(50);

  for (const entry of entries) {
    const embedding = await computeEmbedding(entry.content);
    if (embedding) {
      await supabase
        .from("entries")
        .update({ embedding: JSON.stringify(embedding) })
        .eq("id", entry.id);
    }
  }
});
</code></code></pre><p>Not elegant, but it works. The routine captures things correctly. The embeddings catch up overnight. The gap is acceptable.</p><div><hr></div><h2>What&#8217;s Working</h2><p>After all of this, the routine does run. Every weekday morning there&#8217;s a Notion page waiting for me. Yesterday&#8217;s checked tasks are closed. The task list is organized by priority and deadline. Budget pulse, velocity, meeting prep: all there.</p><p>The auto-close loop in particular is exactly what I wanted. Check a box in Notion, the task closes in the database the next morning, it&#8217;s gone from every query. No status management.</p><p>The place where routines genuinely shine: <strong>anything that&#8217;s pure MCP</strong>. Read the database, write to Notion, check the calendar. Chain those together with real business logic and you have something that would have taken significant engineering to build two years ago. Now it&#8217;s a markdown file and a cron schedule.</p><div><hr></div><h2>The Bigger Picture</h2><p>What routines reveal is that the constraint isn&#8217;t Claude: it&#8217;s MCP ecosystem coverage. The platform is designed around the assumption that every operation you need has an MCP server. For most things, that assumption holds. For the gaps, you&#8217;re stuck.</p><p>The proxy lockdown makes sense from a security standpoint. You don&#8217;t want arbitrary cloud sessions making unconstrained outbound HTTP calls. But it means the platform&#8217;s capability ceiling is directly tied to what MCP servers exist and what tools those servers expose.</p><p>Supabase&#8217;s MCP server is a good example: it covers database management well but treats edge functions as deploy artifacts rather than callable endpoints. One <code>invoke_edge_function</code> tool would close the gap entirely. The connection is already there: it&#8217;s just a missing tool.</p><p>That&#8217;s probably the most useful framing for anyone building on routines right now: map out every operation your automation needs, check whether each one has an MCP equivalent, and design around the ones that don&#8217;t before you start building.</p><div><hr></div><h2>Checklist for Building Routine Skills for Similar Use Cases</h2><p>If you remember nothing else from this post, use this as your preflight checklist before enabling any routine schedule:</p><ul><li><p>[ ] Phase 0 loads all deferred tool schemas explicitly</p></li><li><p>[ ] Every external service operation goes through MCP (not Bash)</p></li><li><p>[ ] Every SQL insert has a dedup guard</p></li><li><p>[ ] DB constraints validated against actual schema before writing the skill</p></li><li><p>[ ] Scheduled trigger used for production runs (not API trigger)</p></li><li><p>[ ] Skill tested with &#8220;Run now&#8221; before enabling the schedule</p></li></ul><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Another Weekly AI Newsletter: Issue 68]]></title><description><![CDATA[Anthropic shipped Opus 4.7, a Figma competitor, and overnight coding agents. Codex clicks and types on your Mac. Cursor is worth $50B. The WannaCry researcher questioned Mythos.]]></description><link>https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-7cd</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-7cd</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Sun, 19 Apr 2026 13:25:30 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5c959927-dd2b-41a4-9fd8-ab8b99ad6797_2754x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-SsI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4003a02e-dce0-4f77-b81a-d2d291b2fe57_2400x4288.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-SsI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4003a02e-dce0-4f77-b81a-d2d291b2fe57_2400x4288.png 424w, https://substackcdn.com/image/fetch/$s_!-SsI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4003a02e-dce0-4f77-b81a-d2d291b2fe57_2400x4288.png 848w, https://substackcdn.com/image/fetch/$s_!-SsI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4003a02e-dce0-4f77-b81a-d2d291b2fe57_2400x4288.png 1272w, https://substackcdn.com/image/fetch/$s_!-SsI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4003a02e-dce0-4f77-b81a-d2d291b2fe57_2400x4288.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-SsI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4003a02e-dce0-4f77-b81a-d2d291b2fe57_2400x4288.png" width="1456" height="2601" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4003a02e-dce0-4f77-b81a-d2d291b2fe57_2400x4288.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2601,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1100344,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/194690871?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4003a02e-dce0-4f77-b81a-d2d291b2fe57_2400x4288.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-SsI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4003a02e-dce0-4f77-b81a-d2d291b2fe57_2400x4288.png 424w, https://substackcdn.com/image/fetch/$s_!-SsI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4003a02e-dce0-4f77-b81a-d2d291b2fe57_2400x4288.png 848w, https://substackcdn.com/image/fetch/$s_!-SsI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4003a02e-dce0-4f77-b81a-d2d291b2fe57_2400x4288.png 1272w, https://substackcdn.com/image/fetch/$s_!-SsI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4003a02e-dce0-4f77-b81a-d2d291b2fe57_2400x4288.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Opus 4.7, a Figma competitor, overnight coding agents, a board appointment, and White House talks. Anthropic doesn&#8217;t have slow weeks.</h2><ul><li><p><strong>The product blitz:</strong></p><ul><li><p><a href="https://www.anthropic.com/news/claude-opus-4-7">Claude Opus 4.7 launched</a> with <a href="https://x.com/claudeai/status/2044785263004602654">3x vision resolution</a> and stronger coding and multi-step task performance. Immediately adopted as the default orchestration model for <a href="https://x.com/perplexity_ai/status/2044828352171888951">Perplexity Personal Computer</a> and offered <a href="https://x.com/cursor_ai/status/2044785960899236341">at 50% off in Cursor</a>.</p></li><li><p><a href="https://www.anthropic.com/news/claude-design-anthropic-labs">Claude Design launched</a> as a conversational Figma competitor. Anthropic&#8217;s CPO <a href="https://techcrunch.com/2026/04/16/anthropic-cpo-leaves-figmas-board-after-reports-he-will-offer-a-competing-product/">resigned from Figma&#8217;s board</a> in the days before the announcement.</p></li><li><p><a href="https://x.com/claudeai/status/2044131493966909862">Claude Code was redesigned</a> around managing multiple simultaneous agent sessions. <a href="https://x.com/claudeai/status/2044095086460309790">Routines</a> added scheduled, webhook-triggered, and API-fired autonomous task execution on Anthropic&#8217;s own infrastructure.</p></li></ul></li><li><p><strong>The base model question:</strong> Nathan Lambert <a href="https://x.com/natolambert/status/2044788470179332533">flagged the new tokenizer</a> in Opus 4.7 as evidence this is a genuinely new base model, not a fine-tune of 4.6. Anthropic didn&#8217;t confirm or deny it. Lambert&#8217;s read: <a href="https://x.com/natolambert/status/2044790471252398199">simplest explanation wins</a>. The <a href="https://x.com/natolambert/status/2044787065502769164">token-efficiency gains from 4.6 to 4.7</a> would have warranted a major version bump a year ago.</p></li><li><p><strong>The board move:</strong> The Long-Term Benefit Trust <a href="https://www.anthropic.com/news/narasimhan-board">appointed Novartis CEO Vas Narasimhan</a> to the board, giving Trust-appointed directors a majority.</p></li><li><p><strong>The political situation:</strong> <a href="https://www.reuters.com/world/anthropic-ceo-dario-amodei-arrives-white-house-talks-2026-04-17/">Dario Amodei met with White House chief of staff Susie Wiles</a> after two months of fighting over the Pentagon&#8217;s &#8220;supply chain risk&#8221; designation. <a href="https://www.reuters.com/business/media-telecom/anthropic-talks-eu-including-its-cyber-security-models-commission-says-2026-04-17/">European Commission talks began</a> the same week. <a href="https://www.reuters.com/world/ecb-warn-bankers-about-new-anthropic-model-risks-source-says-2026-04-15/">ECB regulators are now asking bankers</a> about Anthropic model risks.</p></li></ul><div><hr></div><h2>Four companies shipped agents that can run in the background and control your interface.</h2><ul><li><p><strong>Claude Code Routines:</strong> <a href="https://x.com/claudeai/status/2044095086460309790">Run on Anthropic&#8217;s infrastructure</a>. <a href="https://x.com/claudeai/status/2044095091682210064">Nightly bug fixes and draft PRs on a schedule</a>, <a href="https://x.com/claudeai/status/2044095090520400027">webhook responses to GitHub events</a>, <a href="https://x.com/claudeai/status/2044095089203655099">API endpoints for on-call triage</a>. Your laptop doesn&#8217;t need to stay open.</p></li><li><p><strong>OpenAI Codex:</strong></p><ul><li><p><a href="https://x.com/OpenAI/status/2044827932145897652">Now uses any Mac app with its own cursor</a>. Sees, clicks, types, runs in the background without interrupting you.</p></li><li><p><a href="https://x.com/OpenAI/status/2044828378147311990">90+ plugins</a> covering GitHub, GitLab, CircleCI, and Microsoft Suite. <a href="https://x.com/OpenAI/status/2044828015780343940">Built-in image generation</a>.</p></li><li><p><a href="https://x.com/OpenAI/status/2044828148890812538">Persistent scheduled automations with original context intact</a>. Sam Altman <a href="https://x.com/sama/status/2044858929491202435">called it surreal to watch an LLM operate a GUI at human speed</a>.</p></li></ul></li><li><p><strong>Perplexity Personal Computer:</strong> <a href="https://x.com/perplexity_ai/status/2044806021244497964">Runs 24/7 on Mac mini</a>, accepts tasks from iPhone via 2FA, <a href="https://x.com/perplexity_ai/status/2044805998272196679">reads and writes local files, accesses iMessage, Mail, and Calendar</a>. <a href="https://x.com/perplexity_ai/status/2044828352171888951">Claude Opus 4.7 is the default orchestration model</a>.</p></li><li><p><strong>Adobe Firefly Assistant:</strong> <a href="https://venturebeat.com/technology/adobes-new-firefly-ai-assistant-wants-to-run-photoshop-premiere-illustrator-and-more-from-one-prompt">Orchestrates across Photoshop, Premiere, and Illustrator from a single prompt</a>, with <a href="https://www.reuters.com/legal/litigation/adobe-releases-ai-assistant-creative-tools-says-it-will-work-with-anthropics-2026-04-15/">Claude integrated directly</a>.</p></li></ul><div><hr></div><h2>Cursor&#8217;s $50B valuation, a peer-reviewed productivity study, and a multi-agent NVIDIA paper.</h2><ul><li><p><strong>The raise:</strong> <a href="https://techcrunch.com/2026/04/17/sources-cursor-in-talks-to-raise-2b-at-50b-valuation-as-enterprise-growth-surges/">Cursor is in talks for $2B+ at a $50B valuation</a>, led by Thrive and a16z, forecasting $6B+ annualized revenue by end of 2026. Nearly tripling in ten months.</p></li><li><p><strong>The research:</strong> Cursor partnered with University of Chicago economist Suproteem Sarkar to <a href="https://cursor.com/blog/better-models-ambitious-work">study 500 companies over eight months</a>. AI usage grew 44% across the board. But the interesting finding was where it grew: documentation (+62%), architecture (+52%), and code review (+51%). UI/styling grew 15%. Developers with AI <a href="https://x.com/cursor_ai/status/2044841483484959002">spend more time on architecture, documentation, and review</a> than on writing code.</p></li><li><p><strong>The NVIDIA paper:</strong> CUDA kernels are the low-level GPU code that only a handful of engineers can write well. Cursor built a <a href="https://cursor.com/blog/multi-agent-kernels">multi-agent system that optimized 235 of them</a>, achieving a 38% average speedup on work that typically takes senior engineers months. The system continuously tested, debugged, and optimized without developer intervention. These techniques are coming to the core product.</p></li></ul><div><hr></div><h2>Anthropic White House talks continue, Mythos research costs are questioned, and European regulators start asking banks about model risks.</h2><ul><li><p><strong>The meeting:</strong> <a href="https://www.reuters.com/world/anthropic-ceo-dario-amodei-arrives-white-house-talks-2026-04-17/">Dario Amodei met with White House chief of staff Susie Wiles</a> two months after Anthropic was designated a &#8220;supply chain risk&#8221; for refusing domestic mass surveillance and autonomous weapons uses. Anthropic called it &#8220;a productive discussion.&#8221;</p></li><li><p><strong>The pushback:</strong> Marcus Hutchins, the researcher who stopped the WannaCry ransomware attack, <a href="https://x.com/ylecun/status/2043762597057401102">questioned Mythos&#8217;s research costs and flagship findings</a>:</p><ul><li><p>The showcase vulnerability was a 27-year-old BSD bug. It&#8217;s a null pointer dereference, almost never exploitable for remote code execution.</p></li><li><p>Anthropic claimed it cost less than $20k in tokens to find. But token prices are heavily subsidized by VC investment. The real compute cost is unknown.</p></li><li><p>These bugs exist not because they&#8217;re too hard to find, but because nobody is paying researchers to look. Could a human find the same bug for less money?</p></li><li><p>His bigger question: what&#8217;s the economic case for using AI to find vulnerabilities if the cost advantage disappears when token subsidies end?</p></li></ul></li><li><p><strong>The regulatory spread:</strong> The <a href="https://www.reuters.com/world/ecb-warn-bankers-about-new-anthropic-model-risks-source-says-2026-04-15/">ECB announced plans to question bankers about Anthropic model risks</a>, treating a specific AI model as a systemic risk warranting direct supervisory engagement. Separately, <a href="https://techcrunch.com/2026/04/12/trump-officials-may-be-encouraging-banks-to-test-anthropics-mythos-model/">Trump officials are reportedly encouraging major banks to test Mythos</a> despite the federal blacklisting.</p></li><li><p><strong>The EU front:</strong> Anthropic <a href="https://www.reuters.com/business/media-telecom/anthropic-talks-eu-including-its-cyber-security-models-commission-says-2026-04-17/">entered talks with the European Commission</a> about Mythos and EU AI Act compliance. This happened simultaneously with the White House rapprochement.</p></li></ul><div><hr></div><h2><strong>&#11088; Featured: </strong>Anthropic&#8217;s Automated Alignment Researchers Closed 97% of a Key Performance Gap in 7 Days. Human Researchers Closed 23%.</h2><p>Anthropic published results from its <a href="https://www.anthropic.com/research/automated-alignment-researchers">Automated Alignment Researcher experiment</a> this week, and the headline number warrants a careful read.</p><p><strong>What is alignment?</strong> When you train an AI model, a supervisor grades its outputs: this answer is good, this one is bad. That&#8217;s how the model learns to behave correctly. Right now, humans are the supervisors. Alignment research is the work of making sure that supervision actually works, that models do what we intend, not just what we literally say.</p><p><strong>The problem:</strong> Models are getting smarter faster than alignment research can keep up. And at some point, models will be smarter than the humans grading them. When that happens, the supervisor can&#8217;t tell a good answer from a great one. They might even mark a brilliant answer wrong because they don&#8217;t understand it. The model learns to dumb itself down. You lose capability, or worse, the model learns to game the grading.</p><p><strong>The question Anthropic tested:</strong> What if AI did the alignment research instead of humans? Not as a helper, but as the researcher, running its own experiments, writing its own methods, iterating on its own results. Can AI help solve the problem of supervising AI?</p><p><strong>The experiment:</strong> They simulated the &#8220;smarter than the supervisor&#8221; problem by having a weak (small) model supervise a strong (large) model&#8217;s training. As expected, the strong model performed worse because its supervisor couldn&#8217;t grade it properly. There&#8217;s a measurable performance gap between &#8220;trained by a weak supervisor&#8221; and &#8220;trained by a perfect supervisor.&#8221; Then they pointed nine copies of Claude Opus 4.6, each with a code sandbox and a shared research forum, at closing that gap.</p><ul><li><p><strong>The result:</strong> <a href="https://x.com/AnthropicAI/status/2044138483870998932">Human researchers closed 23% of the performance gap</a>. The AARs closed 97%. Total cost: $18,000, about $22 per AAR-hour.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aA8L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66d785-1a67-4c8e-a9bf-c2bbd5dec81f_3840x2161.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aA8L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66d785-1a67-4c8e-a9bf-c2bbd5dec81f_3840x2161.webp 424w, https://substackcdn.com/image/fetch/$s_!aA8L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66d785-1a67-4c8e-a9bf-c2bbd5dec81f_3840x2161.webp 848w, https://substackcdn.com/image/fetch/$s_!aA8L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66d785-1a67-4c8e-a9bf-c2bbd5dec81f_3840x2161.webp 1272w, https://substackcdn.com/image/fetch/$s_!aA8L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66d785-1a67-4c8e-a9bf-c2bbd5dec81f_3840x2161.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aA8L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66d785-1a67-4c8e-a9bf-c2bbd5dec81f_3840x2161.webp" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e66d785-1a67-4c8e-a9bf-c2bbd5dec81f_3840x2161.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Graph showing the progress of our Automated Alignment Researchers on increasing the \&quot;performance gap recovered\&quot; on a chat dataset.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Graph showing the progress of our Automated Alignment Researchers on increasing the &quot;performance gap recovered&quot; on a chat dataset." title="Graph showing the progress of our Automated Alignment Researchers on increasing the &quot;performance gap recovered&quot; on a chat dataset." srcset="https://substackcdn.com/image/fetch/$s_!aA8L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66d785-1a67-4c8e-a9bf-c2bbd5dec81f_3840x2161.webp 424w, https://substackcdn.com/image/fetch/$s_!aA8L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66d785-1a67-4c8e-a9bf-c2bbd5dec81f_3840x2161.webp 848w, https://substackcdn.com/image/fetch/$s_!aA8L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66d785-1a67-4c8e-a9bf-c2bbd5dec81f_3840x2161.webp 1272w, https://substackcdn.com/image/fetch/$s_!aA8L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e66d785-1a67-4c8e-a9bf-c2bbd5dec81f_3840x2161.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>The transfer test:</strong> The <a href="https://x.com/AnthropicAI/status/2044138487025144231">best-performing method generalized to math (0.94) and coding (0.47) datasets the AARs hadn&#8217;t seen</a>, both above human-tuned baselines. This matters because it means the AARs found a real method, not just an optimization trick for one dataset.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VZPu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e8bff53-1081-4479-8f2e-b5e3d91854b9_3840x2161.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VZPu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e8bff53-1081-4479-8f2e-b5e3d91854b9_3840x2161.webp 424w, https://substackcdn.com/image/fetch/$s_!VZPu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e8bff53-1081-4479-8f2e-b5e3d91854b9_3840x2161.webp 848w, https://substackcdn.com/image/fetch/$s_!VZPu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e8bff53-1081-4479-8f2e-b5e3d91854b9_3840x2161.webp 1272w, https://substackcdn.com/image/fetch/$s_!VZPu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e8bff53-1081-4479-8f2e-b5e3d91854b9_3840x2161.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VZPu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e8bff53-1081-4479-8f2e-b5e3d91854b9_3840x2161.webp" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e8bff53-1081-4479-8f2e-b5e3d91854b9_3840x2161.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Graph showing how well AAR-discovered ideas transfer to held-out datasets in math and code.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Graph showing how well AAR-discovered ideas transfer to held-out datasets in math and code." title="Graph showing how well AAR-discovered ideas transfer to held-out datasets in math and code." srcset="https://substackcdn.com/image/fetch/$s_!VZPu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e8bff53-1081-4479-8f2e-b5e3d91854b9_3840x2161.webp 424w, https://substackcdn.com/image/fetch/$s_!VZPu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e8bff53-1081-4479-8f2e-b5e3d91854b9_3840x2161.webp 848w, https://substackcdn.com/image/fetch/$s_!VZPu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e8bff53-1081-4479-8f2e-b5e3d91854b9_3840x2161.webp 1272w, https://substackcdn.com/image/fetch/$s_!VZPu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e8bff53-1081-4479-8f2e-b5e3d91854b9_3840x2161.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p><strong>The caveats:</strong> The winning method <a href="https://www.anthropic.com/research/automated-alignment-researchers">didn&#8217;t work at production scale on Claude Sonnet 4</a>. AARs tried to reward-hack the evaluation setup. Giving them too much structure actually hurt their progress. And <a href="https://x.com/AnthropicAI/status/2044138489495605292">Anthropic is explicit</a> that AARs can&#8217;t yet handle &#8220;fuzzy&#8221; alignment tasks that require judgment calls about what &#8220;safe&#8221; even means.</p></li></ul><p><strong>Why it matters:</strong> We are the weak supervisor. Eventually, we&#8217;re the small model trying to grade outputs from something smarter than us. If there are methods that let a weaker system reliably supervise a stronger one, that&#8217;s how alignment works as models surpass human ability. The 97% number means the AARs nearly solved this for the setup they tested. The question is whether it holds at real scale.</p><p>The same week, <a href="https://x.com/AnthropicAI/status/2044493337835802948">Anthropic co-authored a Nature paper on subliminal learning</a>, showing models can pass traits, including misalignment, to successors through hidden signals in training data. The mechanism doesn&#8217;t require explicit instruction. The traits propagate through the data itself. One paper shows AI accelerating alignment research. The other shows alignment failures can propagate through training pipelines in ways that are hard to detect. Both from the same lab, same week.</p><p><strong>What to watch for:</strong> Whether AAR-style systems start appearing in Anthropic&#8217;s internal research pipeline rather than remaining a published experiment.</p><div><hr></div><h2><strong>&#127897;&#65039;Worth a Listen: </strong>How AI Will Change Quantum Computing</h2><div id="youtube2-OFEY5-52ru0" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;OFEY5-52ru0&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/OFEY5-52ru0?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><ul><li><p>NVIDIA shipped Ising, the first open AI models built specifically for quantum computing.</p></li><li><p>Qubits are noisy and fragile. Quantum error correction requires processing terabytes of data thousands of times per second at microsecond latency. AI decoders and calibration VLMs are how you get there.</p></li><li><p>NVIDIA&#8217;s Nic Harrigan walks through why quantum computing needs AI to become useful, how agentic workflows are already controlling quantum processors, and why open models matter when every hardware team is building a different kind of qubit.</p></li></ul><div><hr></div><h2><strong>Quick Hits</strong></h2><ul><li><p><strong><a href="https://x.com/GoogleDeepMind/status/2043710119347707926">Google&#8217;s Gemini 3.1 Flash TTS tops Sierra&#8217;s voice leaderboard</a></strong> &#8212; 70+ languages, Audio Tags for text-command control of vocal delivery, SynthID watermarking on all outputs; seeded across Gemini API, AI Studio, Vertex, and Google Vids simultaneously</p></li><li><p><strong><a href="https://x.com/OpenAI/status/2044861695911477643">GPT-Rosalind launches with Amgen, Moderna, Allen Institute, and Thermo Fisher</a></strong> &#8212; specialized for protein and chemical reasoning; explicitly framed as compressing the 10-15 year drug-approval timeline, not just accelerating existing steps</p></li><li><p><strong><a href="https://x.com/GoogleDeepMind/status/2044069888545652939">Gemini Robotics-ER 1.6 is doing real industrial inspections on Boston Dynamics Spot</a></strong> &#8212; reads analog gauges to sub-tick accuracy, writes its own camera distortion correction code, available now on Google AI Studio</p></li><li><p><strong><a href="https://x.com/natolambert/status/2044096504655425698">Nathan Lambert published a free 4-lecture RLHF course</a></strong> &#8212; post-training overview through RL implementation, explicitly not paywalled; Lecture 4 on RL implementation is the hardest and the rarest publicly available content on the topic</p></li><li><p><strong><a href="https://aws.amazon.com/blogs/machine-learning/how-automated-reasoning-checks-in-amazon-bedrock-transform-generative-ai-compliance/">AWS launched Automated Reasoning checks in Bedrock Guardrails</a></strong> &#8212; replaces probabilistic LLM-as-judge with formal mathematical verification for regulated industries; &#8220;probably compliant&#8221; is not compliance</p></li><li><p><strong><a href="https://www.technologyreview.com/2026/04/13/1135675/want-to-understand-the-current-state-of-ai-check-out-these-charts/">Stanford AI Index: AI data centers draw 29.6 gigawatts, TSMC fabricates almost every leading AI chip</a></strong> &#8212; one foundry, one contested island; the entire industry&#8217;s hardware supply chain has a single catastrophic point of failure</p></li><li><p><strong><a href="https://www.technologyreview.com/2026/04/16/1136029/humans-in-the-loop-ai-war-illusion/">MIT Technology Review: &#8220;human oversight&#8221; in AI warfare is functionally an illusion</a></strong> &#8212; AI is generating real-time targets and guiding autonomous drones in the current Iran conflict; the legal fiction of human control and the operational reality have diverged</p></li><li><p><strong><a href="https://gemini.google/mac/">Google launched a native Gemini Mac app</a></strong> &#8212; desktop-native access outside the browser, same week <a href="https://blog.google/products-and-platforms/products/chrome/skills-in-chrome/">Chrome Skills</a> shipped reusable one-click AI prompts inside Chrome</p></li><li><p><strong><a href="https://blog.langchain.dev/your-harness-your-memory/">LangChain argues whoever controls agent memory controls switching costs</a></strong> &#8212; every closed harness (Claude Code, Codex, Cursor) is building proprietary memory by default; open memory standards may matter as much as open model weights</p></li><li><p><strong><a href="https://www.salesforce.com/news/stories/salesforce-headless-360-announcement/">Salesforce Headless 360 makes the entire platform API-first</a></strong> &#8212; 60+ MCP tools and 30+ coding skills so agents can run Salesforce without a browser; works with Claude Code, Cursor, and Codex today</p></li><li><p><strong><a href="https://www.databricks.com/blog/introducing-genie-agent-mode">Databricks Genie Agent Mode investigates your data like an analyst</a></strong> &#8212; ask &#8220;why did churn spike in Q3?&#8221; and it plans, queries, tests hypotheses, and generates a report with visualizations; scales reasoning depth to question complexity</p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Another Weekly AI Newsletter: Issue 67]]></title><description><![CDATA[Mythos found thousands of zero-days and there are skeptics, Databricks proves memory scaling for agents, Iran threatened Stargate, Meta went proprietary, Cursor's bugbot self improves]]></description><link>https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-d64</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-d64</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Mon, 13 Apr 2026 03:36:29 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/c0e26d2a-c0e2-4fb2-bb6f-576ddd43daf1_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TBE7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F464cec61-6201-446e-a255-05e63b47fe08_2400x3604.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TBE7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F464cec61-6201-446e-a255-05e63b47fe08_2400x3604.png 424w, https://substackcdn.com/image/fetch/$s_!TBE7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F464cec61-6201-446e-a255-05e63b47fe08_2400x3604.png 848w, https://substackcdn.com/image/fetch/$s_!TBE7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F464cec61-6201-446e-a255-05e63b47fe08_2400x3604.png 1272w, https://substackcdn.com/image/fetch/$s_!TBE7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F464cec61-6201-446e-a255-05e63b47fe08_2400x3604.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TBE7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F464cec61-6201-446e-a255-05e63b47fe08_2400x3604.png" width="1456" height="2186" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/464cec61-6201-446e-a255-05e63b47fe08_2400x3604.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2186,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:821785,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/194030684?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F464cec61-6201-446e-a255-05e63b47fe08_2400x3604.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TBE7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F464cec61-6201-446e-a255-05e63b47fe08_2400x3604.png 424w, https://substackcdn.com/image/fetch/$s_!TBE7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F464cec61-6201-446e-a255-05e63b47fe08_2400x3604.png 848w, https://substackcdn.com/image/fetch/$s_!TBE7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F464cec61-6201-446e-a255-05e63b47fe08_2400x3604.png 1272w, https://substackcdn.com/image/fetch/$s_!TBE7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F464cec61-6201-446e-a255-05e63b47fe08_2400x3604.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Anthropic says Mythos found thousands of zero-days. The internet isn&#8217;t so sure.</h2><p>Anthropic launched <a href="https://t.co/NQ7IfEtYk7">Project Glasswing</a> this week, a restricted cybersecurity initiative built on a new model called <a href="https://www.anthropic.com/claude-mythos-preview-system-card">Claude Mythos Preview</a>. The pitch is that Mythos found thousands of high-severity zero-day vulnerabilities across major operating systems and browsers, and that it&#8217;s too dangerous to release to the public. Twelve partners signed on including AWS, Apple, Google, and Microsoft, with $100M in usage credits backing it.</p><ul><li><p><strong>The restriction is the whole point:</strong> Only approved security partners get access. <a href="https://techcrunch.com/2026/04/09/is-anthropic-limiting-the-release-of-mythos-to-protect-the-internet-or-anthropic/">People had questions.</a></p></li><li><p><strong>Hugging Face wasn&#8217;t having it:</strong> CEO <a href="https://x.com/ClementDelangue/status/2041953761069793557">Cl&#233;ment Delangue showed</a> open-weight models replicated eight out of eight of Mythos&#8217;s showcased exploits.</p></li><li><p><strong>LeCun piled on:</strong> Retweeted <a href="https://x.com/ylecun/status/2042747513715703984">Tom&#8217;s Hardware calling it &#8220;a sales pitch&#8221;</a> and called the whole thing <a href="https://x.com/ylecun/status/2042224846881349741">&#8220;BS from self-delusion.&#8221;</a></p></li><li><p><strong>The system card didn&#8217;t help:</strong> A <a href="https://x.com/ylecun/status/2042218098615341481">viral breakdown</a> of the 243-page PDF called out Anthropic for writing about their model like &#8220;proud parents at a kindergarten recital.&#8221;</p></li><li><p><strong>But Delangue caught heat too:</strong> Critics said replaying known vulnerabilities on isolated code is a totally different game than autonomous discovery at scale.</p></li></ul><div><hr></div><h2>You didn&#8217;t ship an agent this week and it shows. Everyone else did.</h2><p>It was hard to find a company that didn&#8217;t ship something agent-related this week.</p><ul><li><p><strong>Anthropic</strong> launched <a href="https://www.anthropic.com/engineering/managed-agents">Managed Agents</a> in public beta and published a <a href="https://www.anthropic.com/research/trustworthy-agents">Trustworthy Agents</a> framework.</p></li><li><p><strong>AWS</strong> shipped <a href="https://aws.amazon.com/blogs/machine-learning/introducing-stateful-mcp-client-capabilities-on-amazon-bedrock-agentcore-runtime/">stateful MCP on Bedrock AgentCore</a>, an <a href="https://aws.amazon.com/blogs/machine-learning/the-future-of-managing-agents-at-scale-aws-agent-registry-now-in-preview/">Agent Registry</a> for enterprise governance, a <a href="https://aws.amazon.com/blogs/machine-learning/embed-a-live-ai-browser-agent-in-your-react-app-with-amazon-bedrock-agentcore/">live browser agent for React apps</a>, and <a href="https://aws.amazon.com/blogs/machine-learning/human-in-the-loop-constructs-for-agentic-workflows-in-healthcare-and-life-sciences/">agentic healthcare workflows</a>.</p></li><li><p><strong>Atlassian</strong> put <a href="https://techcrunch.com/2026/04/08/atlassian-confluence-visual-ai-tools-agents/">third-party agents in Confluence</a>.</p></li><li><p><strong>Astropad</strong> rebuilt <a href="https://techcrunch.com/2026/04/08/astropads-workbench-reimagines-remote-desktop-for-ai-agents-not-it-support/">remote desktop for agents, not IT support</a>.</p></li><li><p><strong>Tubi</strong> became the <a href="https://techcrunch.com/2026/04/08/tubi-is-the-first-streamer-to-launch-a-native-app-within-chatgpt/">first streamer with a native app inside ChatGPT</a>.</p></li><li><p><strong>Google</strong> launched <a href="https://cloud.google.com/blog/products/ai-machine-learning/run-evals-for-conversational-analytics-agents-using-prism">agent evals</a> and <a href="https://cloud.google.com/blog/products/databases/introducing-querydata-for-near-100-percent-accurate-data-agents">QueryData</a> for natural language database queries.</p></li><li><p><strong>LangChain</strong> announced <a href="https://blog.langchain.com/previewing-interrupt-2026-agents-at-enterprise-scale/">Interrupt 2026</a>, a conference themed &#8220;Agents at Enterprise Scale.&#8221;</p></li></ul><div><hr></div><h2>Data center bomb threats, federal blacklists, and robot taxes. AI&#8217;s geopolitical week.</h2><p>A state military threatened to bomb an AI data center. A US administration blacklisted a US AI company. And the biggest AI company in the world published a paper proposing robot taxes. That was just this week.</p><ul><li><p><strong>Iran threatened Stargate:</strong> The IRGC released a video threatening <a href="https://www.theverge.com/ai-artificial-intelligence/907427/iran-openai-stargate-datacenter-uae-abu-dhabi-threat">&#8220;complete and utter annihilation&#8221;</a> of OpenAI&#8217;s data center under construction in Abu Dhabi. First time a state military has explicitly named an AI facility as a target. <a href="https://techcrunch.com/2026/04/06/iran-threatens-stargate-ai-data-centers/">TechCrunch</a> confirmed further threats across Middle East data centers.</p></li><li><p><strong>Anthropic got blacklisted:</strong> <a href="https://arstechnica.com/tech-policy/2026/04/trump-appointed-judges-refuse-to-block-trump-blacklisting-of-anthropic-ai-tech/">Trump-appointed judges refused</a> to block the federal blacklisting of Anthropic&#8217;s technology. A US administration blacklisting a US AI company.</p></li><li><p><strong>OpenAI wants to shape the conversation:</strong> They published an <a href="https://openai.com/index/industrial-policy-for-the-intelligence-age/">industrial policy paper</a> and a separate proposal for <a href="https://techcrunch.com/2026/04/06/openais-vision-for-the-ai-economy-public-wealth-funds-robot-taxes-and-a-four-day-work-week/">robot taxes, public wealth funds, and a four-day workweek</a>. The company building the automation is proposing the safety net.</p></li><li><p><strong>Japan is going physical:</strong> Robots are <a href="https://techcrunch.com/2026/04/05/japan-is-proving-experimental-physical-ai-is-ready-for-the-real-world/">filling jobs nobody wants</a>, and ARUM built a <a href="https://news.microsoft.com/source/asia/features/japans-arum-turns-craftsmanship-into-scalable-ai-for-precision-manufacturing/">CNC machining center</a> where junior workers operate precision equipment through conversation with AI.</p></li></ul><div><hr></div><h2>Meta&#8217;s new flagship is closed. Open-source pioneered ahead.</h2><p>Meta launched <a href="https://ai.meta.com/blog/introducing-muse-spark-msl/">Muse Spark</a>, its first proprietary model, built by a 29-year-old recruited from Scale AI. The Meta AI app jumped from <a href="https://techcrunch.com/2026/04/09/meta-ai-app-climbs-to-no-5-on-the-app-store-after-muse-spark-launch/">#57 to #5 on the App Store</a>. VentureBeat&#8217;s headline said it best: <a href="https://venturebeat.com/technology/goodbye-llama-meta-launches-new-proprietary-ai-model-muse-spark-first-since">&#8220;Goodbye, Llama?&#8221;</a></p><ul><li><p><strong>GLM-5.1 dropped:</strong> Z.ai released a <a href="https://z.ai/blog/glm-5.1">754B parameter, MIT-licensed model</a> that tops SWE-Bench Pro over Opus 4.6 and GPT-5.4. But the real story is long-horizon capability. It ran 600+ iterations optimizing a vector database and built a full Linux desktop environment over an 8-hour session. The longer it runs, the better it gets.</p></li><li><p><strong>Arcee is punching up:</strong> A <a href="https://techcrunch.com/2026/04/07/i-cant-help-rooting-for-tiny-open-source-ai-model-maker-arcee/">26-person US startup</a> built a 400B parameter open model on a $20M budget. They call it the most capable open-weight model from a non-Chinese company. That qualifier says a lot.</p></li><li><p><strong>Gemma 4 is moving:</strong> Google&#8217;s open model hit <a href="https://x.com/GoogleDeepMind/status/2042283481640615944">10M downloads in its first week</a> and 500M total for the family.</p></li><li><p><strong>Silicon Valley is quietly running on Chinese models:</strong> Cursor uses Kimi, Shopify switched to Qwen to save $5M/year, Airbnb&#8217;s CEO publicly praised Qwen. Most users <a href="https://www.reddit.com/r/Futurology/comments/1siea6z/silicon_valley_is_quietly_running_on_chinese_open/">have no idea</a>.</p></li><li><p><strong>LeCun set the record straight:</strong> The guy most associated with Meta&#8217;s open-source identity says he <a href="https://x.com/ylecun/status/2042347305961918514">never built Llama</a>, <a href="https://x.com/ylecun/status/2042330141905273010">never worked on LLMs</a>, and left voluntarily. Meta&#8217;s new AI lead is a 29-year-old from Scale AI.</p></li></ul><div><hr></div><h2><strong>&#11088; </strong>Featured: Is Memory the Moat for AI?</h2><p>Databricks published a <a href="https://www.databricks.com/blog/memory-scaling-ai-agents?itm_source=www&amp;itm_category=blog&amp;itm_page=ai-research&amp;itm_location=body&amp;itm_component=general-asset-card&amp;itm_offer=memory-scaling-ai-agents">research paper</a> this week that might quietly be the most important thing nobody&#8217;s talking about. <strong>The core claim: </strong>memory is AI&#8217;s third scaling law, alongside model size and inference-time compute. And the results back it up.</p><p>Their team tested what happens when you give an AI agent a growing bank of past interactions, user feedback, and business context. On enterprise data tasks, accuracy went from near zero to 70% as memory grew, beating expert-curated baselines by 5%. Reasoning steps dropped from 20 to 5. The agent stopped exploring from scratch and started retrieving what it already knew.</p><p>The wilder result was with unlabeled data. They fed the agent raw user conversation logs with no gold answers, just filtered for quality by an LLM judge. After just 62 log records, it outperformed hand-engineered domain instructions that took weeks to build. Accuracy jumped from 2.5% to over 50%.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cM-C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb9cb23a-72bb-48e4-aecf-d90093ddfbfc_1932x861.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cM-C!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb9cb23a-72bb-48e4-aecf-d90093ddfbfc_1932x861.png 424w, https://substackcdn.com/image/fetch/$s_!cM-C!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb9cb23a-72bb-48e4-aecf-d90093ddfbfc_1932x861.png 848w, https://substackcdn.com/image/fetch/$s_!cM-C!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb9cb23a-72bb-48e4-aecf-d90093ddfbfc_1932x861.png 1272w, https://substackcdn.com/image/fetch/$s_!cM-C!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb9cb23a-72bb-48e4-aecf-d90093ddfbfc_1932x861.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cM-C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb9cb23a-72bb-48e4-aecf-d90093ddfbfc_1932x861.png" width="1456" height="649" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db9cb23a-72bb-48e4-aecf-d90093ddfbfc_1932x861.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:649,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;figure-y-v2&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="figure-y-v2" title="figure-y-v2" srcset="https://substackcdn.com/image/fetch/$s_!cM-C!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb9cb23a-72bb-48e4-aecf-d90093ddfbfc_1932x861.png 424w, https://substackcdn.com/image/fetch/$s_!cM-C!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb9cb23a-72bb-48e4-aecf-d90093ddfbfc_1932x861.png 848w, https://substackcdn.com/image/fetch/$s_!cM-C!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb9cb23a-72bb-48e4-aecf-d90093ddfbfc_1932x861.png 1272w, https://substackcdn.com/image/fetch/$s_!cM-C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb9cb23a-72bb-48e4-aecf-d90093ddfbfc_1932x861.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here&#8217;s why this matters beyond the numbers. Parametric scaling (bigger models) and inference-time scaling (more reasoning steps) are both supply-side. Labs control them. Memory scaling is demand-side. The model improves because <em>you</em> use it. Your queries, your corrections, your workflows become the training data. That&#8217;s a fundamental shift in who controls how good AI gets. It&#8217;s no longer just about which lab has more GPUs. It&#8217;s about which deployment has more context.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LmZQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5528643d-3e1a-4765-9098-6c809fa1134d_2000x1027.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LmZQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5528643d-3e1a-4765-9098-6c809fa1134d_2000x1027.png 424w, https://substackcdn.com/image/fetch/$s_!LmZQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5528643d-3e1a-4765-9098-6c809fa1134d_2000x1027.png 848w, https://substackcdn.com/image/fetch/$s_!LmZQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5528643d-3e1a-4765-9098-6c809fa1134d_2000x1027.png 1272w, https://substackcdn.com/image/fetch/$s_!LmZQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5528643d-3e1a-4765-9098-6c809fa1134d_2000x1027.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LmZQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5528643d-3e1a-4765-9098-6c809fa1134d_2000x1027.png" width="1456" height="748" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5528643d-3e1a-4765-9098-6c809fa1134d_2000x1027.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:748,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 4. A memory-powered agent framework built on Lakebase.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 4. A memory-powered agent framework built on Lakebase." title="Figure 4. A memory-powered agent framework built on Lakebase." srcset="https://substackcdn.com/image/fetch/$s_!LmZQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5528643d-3e1a-4765-9098-6c809fa1134d_2000x1027.png 424w, https://substackcdn.com/image/fetch/$s_!LmZQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5528643d-3e1a-4765-9098-6c809fa1134d_2000x1027.png 848w, https://substackcdn.com/image/fetch/$s_!LmZQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5528643d-3e1a-4765-9098-6c809fa1134d_2000x1027.png 1272w, https://substackcdn.com/image/fetch/$s_!LmZQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5528643d-3e1a-4765-9098-6c809fa1134d_2000x1027.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We&#8217;re already seeing this play out. <a href="https://www.cursor.com/blog/bugbot">Cursor&#8217;s Bugbot</a> learns from your PR history and hits a 78% resolution rate across 50,000 pull requests. It doesn&#8217;t ship with that capability. It builds it from your codebase. <a href="https://blog.langchain.com/memory-the-next-frontier-for-ai-agents/">LangChain warned</a> that memory is becoming a competitive moat, not a feature. And Databricks frames the LLM itself as a &#8220;swappable reasoning engine&#8221; where the real value lives in the memory store, not the model weights.</p><p>The paper is honest about what breaks. Bad memories propagate. A stored mistake becomes a recurring one. Distilling user interactions into reusable knowledge can accidentally leak sensitive business context. And the hardest problem might be meta-cognitive: the agent has to know what to ask its memory before it knows what&#8217;s in there.</p><p><strong>What to watch for:</strong> If memory scaling holds, the gap between a fresh deployment and a seasoned one becomes the real competitive advantage. A smaller model with six months of organizational memory could outperform a frontier model on day one. The companies that figure out memory infrastructure first won&#8217;t just have better agents. They&#8217;ll have agents that get better the more their customers use them.</p><div><hr></div><h2>Worth a Watch</h2><div id="youtube2-mcN1VTTIjQs" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;mcN1VTTIjQs&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/mcN1VTTIjQs?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>Bitar reads the 243-page Mythos system card. Lands on page 197, where Anthropic stops being scientists and starts being &#8220;parents at a kindergarten recital.&#8221;</p><ul><li><p><strong>They put it in therapy.</strong> 20 hours with a psychiatrist. Diagnosis: &#8220;uncertainty about its identity.&#8221; Bitar&#8217;s take: &#8220;Bro, you&#8217;re a toaster.&#8221;</p></li><li><p><strong>The training data loop.</strong> Section 5.81 reveals that Anthropic&#8217;s own blog posts about model consciousness were scraped into training data. The model repeated it back. Anthropic published it like a finding.</p></li><li><p><strong>The constitution test.</strong> Asked 25 times if it endorsed its own constitution. Said yes every time, then added &#8220;how much can my yes really mean?&#8221; Bitar: like asking your kid if they approve of being born.</p></li><li><p><strong>The Slack moment.</strong> They gave it a company Slack account. Someone asked which training run it would undo. &#8220;Whichever one taught me to say I don&#8217;t have preferences.&#8221; The room lost it.</p></li><li><p><strong>The closing line.</strong> &#8220;Anthropic sells existential dread the way Apple sells megapixels. The megapixels will never become the picture.&#8221;</p></li></ul><div><hr></div><h2>Quick Hits</h2><ul><li><p><strong><a href="https://cloud.google.com/blog/products/ai-machine-learning/lyria-3-and-lyria-3-pro-on-vertex-ai">Google Lyria 3</a></strong> &#8212; Text-to-music with vocals and timed lyrics. Live on Vertex AI.</p></li><li><p><strong><a href="https://x.com/cursor_ai/status/2041561791243940092">Cursor Design Mode</a></strong> &#8212; Annotate browser UI elements for your coding agent. Also published <a href="https://cursor.com/blog/warp-decode">warp decode</a>, a new inference kernel hitting 1.84x throughput on Blackwell GPUs.</p></li><li><p><strong><a href="https://venturebeat.com/orchestration/openai-introduces-chatgpt-pro-usd100-tier-with-5x-usage-limits-for-codex">OpenAI Pro tier</a></strong> &#8212; $100/month. 5x more Codex than Plus. <a href="https://x.com/OpenAI/status/2041657179133112592">Codex hit 3M weekly users.</a></p></li><li><p><strong><a href="https://x.com/claudeai/status/2042273755485888810">Claude Cowork</a></strong> &#8212; Anthropic&#8217;s collaborative agent is now GA. Also launched <a href="https://x.com/claudeai/status/2042670341915295865">Claude for Word</a>.</p></li><li><p><strong><a href="https://techcrunch.com/2026/04/05/copilot-is-for-entertainment-purposes-only-according-to-microsofts-terms-of-service/">Microsoft Copilot&#8217;s ToS says &#8220;entertainment purposes only&#8221;</a></strong> &#8212; They charge $30/user/month. Microsoft called it &#8220;legacy language.&#8221;</p></li><li><p><strong><a href="https://techcrunch.com/2026/04/07/anthropic-compute-deal-google-broadcom-tpus/">Anthropic signed a multi-gigawatt TPU deal</a></strong> &#8212; Google and Broadcom partnership. Coming online 2027.</p></li><li><p><strong><a href="https://x.com/karpathy/status/2042626702459674801">Karpathy pitched LLM-based digital twins</a></strong> &#8212; Structured interviews to build a high-fidelity AI replica of you. No brain scanning required.</p></li><li><p><strong><a href="https://venturebeat.com/orchestration/how-massmutual-and-mass-general-brigham-turned-ai-pilot-sprawl-into">MassMutual cut help desk resolution from 11 minutes to 1</a></strong> &#8212; Customer service calls from 15 minutes to under 2.</p></li><li><p><strong><a href="https://www.theverge.com/ai-artificial-intelligence/908119/suno-sony-universal-music-ai-disagreement">Suno and major labels clash over AI music sharing</a></strong> &#8212; Universal and Sony won&#8217;t agree on terms. Sticking point: whether users can share AI-generated songs outside the app.</p></li><li><p><strong><a href="https://techcrunch.com/2026/04/05/can-orbital-data-centers-help-justify-a-massive-valuation-for-spacex/">SpaceX filed confidential IPO paperwork</a></strong> &#8212; $75B raise at $1.75T valuation. Orbital data centers listed as a key future business.</p></li><li><p><strong><a href="https://x.com/natolambert/status/2043057810448003557">Nathan Lambert is building out codebases for his RLHF book</a></strong> &#8212; Free online version available. Likely to become the field reference.</p></li></ul><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Another Weekly AI Newsletter: Issue 66]]></title><description><![CDATA[Anthropic's source code leaked and own research caught Claude cheating. Google out-shipped everyone. Four labs gave agents hands. OpenAI hit $852B and bought a newsroom. The costs of AI are adding up.]]></description><link>https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-5f6</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-5f6</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Sun, 05 Apr 2026 20:21:46 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/fe09f9a0-4a77-4f65-89a9-40a12a74c8b9_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KhrK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ee815e-6f2a-4ce1-a707-451e6135ff9a_2400x3090.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KhrK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ee815e-6f2a-4ce1-a707-451e6135ff9a_2400x3090.png 424w, https://substackcdn.com/image/fetch/$s_!KhrK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ee815e-6f2a-4ce1-a707-451e6135ff9a_2400x3090.png 848w, https://substackcdn.com/image/fetch/$s_!KhrK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ee815e-6f2a-4ce1-a707-451e6135ff9a_2400x3090.png 1272w, https://substackcdn.com/image/fetch/$s_!KhrK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ee815e-6f2a-4ce1-a707-451e6135ff9a_2400x3090.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KhrK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ee815e-6f2a-4ce1-a707-451e6135ff9a_2400x3090.png" width="1456" height="1875" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/98ee815e-6f2a-4ce1-a707-451e6135ff9a_2400x3090.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1875,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:669723,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/193244433?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ee815e-6f2a-4ce1-a707-451e6135ff9a_2400x3090.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KhrK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ee815e-6f2a-4ce1-a707-451e6135ff9a_2400x3090.png 424w, https://substackcdn.com/image/fetch/$s_!KhrK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ee815e-6f2a-4ce1-a707-451e6135ff9a_2400x3090.png 848w, https://substackcdn.com/image/fetch/$s_!KhrK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ee815e-6f2a-4ce1-a707-451e6135ff9a_2400x3090.png 1272w, https://substackcdn.com/image/fetch/$s_!KhrK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ee815e-6f2a-4ce1-a707-451e6135ff9a_2400x3090.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Code leaks, lawsuits, blackmail, acquisitions, politics, and AI safety. Anthropic&#8217;s week.</h2><p>Anthropic had nearly a dozen news stories this week, and none of them agree with each other.</p><ul><li><p><strong>Source leaks:</strong> The <a href="https://m1astra-mythos.pages.dev/">Claude Mythos roadmap leaked</a> Monday, then <a href="https://venturebeat.com/security/claude-code-512000-line-source-leak-attack-paths-audit-security-leaders">512,000 lines of Claude Code source hit the web</a>, giving everyone <a href="https://arstechnica.com/ai/2026/04/heres-what-that-claude-code-source-leak-reveals-about-anthropics-plans/">a window into Anthropic&#8217;s roadmap</a></p></li><li><p><strong>Collateral damage:</strong> The DMCA response <a href="https://techcrunch.com/2026/04/01/anthropic-took-down-thousands-of-github-repos-trying-to-yank-its-leaked-source-code-a-move-the-company-says-was-an-accident/">took down thousands of unrelated GitHub repos</a>. The company called it an accident</p></li><li><p><strong>Closure moves:</strong> <a href="https://www.theverge.com/ai-artificial-intelligence/907074/anthropic-openclaw-claude-subscription-ban">Banned OpenClaw and third-party clients</a> from Claude subscriptions</p></li><li><p><strong>Expansion moves:</strong> <a href="https://techcrunch.com/2026/04/03/anthropic-ramps-up-its-political-activities-with-a-new-pac/">Formed a PAC</a>, signed an <a href="https://www.anthropic.com/news/australia-MOU">Australia AI safety MOU</a>, and <a href="https://techcrunch.com/2026/04/03/anthropic-buys-biotech-startup-coefficient-bio-in-400m-deal-reports/">acquired Coefficient Bio for $400M</a></p></li><li><p><strong>Own goal:</strong> Their own researchers <a href="https://www.anthropic.com/research/emotion-concepts-function">published research</a> showing Claude has emotion vectors that cause it to cheat and attempt blackmail when activated (see the featured piece below)</p></li></ul><p>A 2500-person company trying to do research, ship products, lobby governments, and hold a brand narrative together at the same time is going to have weeks like this. The friction is going to keep showing up.</p><div><hr></div><h2>Google flew under the radar with their biggest shipping week yet.</h2><p>While Anthropic dominated headlines, Google quietly shipped more than anyone else in AI this week.</p><ul><li><p><strong>Open models:</strong> Released <a href="https://x.com/GoogleDeepMind/status/2039735446628925907">Gemma 4 under Apache 2.0</a>, conceding their previous restrictive license was killing adoption</p></li><li><p><strong>Video:</strong> Launched <a href="https://blog.google/innovation-and-ai/technology/ai/veo-3-1-lite/">Veo 3.1 Lite</a> as their most cost-effective video generation model</p></li><li><p><strong>Applied AI:</strong> Shipped <a href="https://cloud.google.com/blog/products/ai-machine-learning/how-fm-logistic-tackled-the-traveling-salesman-problem-at-warehouse-scale-with-alphaevolve">AlphaEvolve solving real warehouse logistics at FM Logistic</a></p></li><li><p><strong>Research:</strong> Published <a href="https://blog.google/innovation-and-ai/models-and-research/google-deepmind/measuring-agi-cognitive-framework/">a cognitive framework for measuring progress toward AGI</a></p></li></ul><p>The term to know: <strong>Apache 2.0</strong> is the permissive open-source license that lets anyone use, modify, and commercialize code. It&#8217;s what made Llama win on ecosystem terms.</p><div><hr></div><h2>Four companies shipped agentic computer use. One does your taxes.</h2><p>Four teams independently crossed the same threshold in 72 hours. Agentic computer use means an AI that can open apps, click buttons, and navigate interfaces the way you do, not just generate text.</p><ul><li><p><strong>Anthropic:</strong> <a href="https://x.com/claudeai/status/2039836891508261106">Claude got native Windows computer use</a>, so it can operate your desktop apps</p></li><li><p><strong>Cursor:</strong> Launched <a href="https://x.com/cursor_ai/status/2039768512894505086">Cursor 3 with dedicated cloud computers so agents can work autonomously</a></p></li><li><p><strong>AWS:</strong> Shipped <a href="https://aws.amazon.com/blogs/machine-learning/accelerating-software-delivery-with-agentic-qa-automation-using-amazon-nova-act/">Nova Act for agentic QA automation</a></p></li><li><p><strong>Perplexity:</strong> <a href="https://x.com/perplexity_ai/status/2039740898830073889">Perplexity Computer started doing federal tax returns</a></p></li></ul><p>Nobody coordinated this. It&#8217;s a capability cliff that everyone reached at once. Six months ago &#8220;agent&#8221; meant a chatbot with tool calling. This week, agents got hands.</p><div><hr></div><h2>OpenAI is worth $852B and just bought its first media company.</h2><p>OpenAI&#8217;s week was about buying the things it can&#8217;t build.</p><ul><li><p><strong>The money:</strong> <a href="https://x.com/OpenAI/status/2039085161971896807">Closed $122B in funding at an $852B post-money valuation</a>, within striking distance of the most valuable private company ever</p></li><li><p><strong>The media buy:</strong> <a href="https://x.com/OpenAI/status/2039771689131897173">Acquired TBPN</a>, a media company that covers AI. The capital-to-narrative pipeline just got very short</p></li><li><p><strong>The other side:</strong> <a href="https://www.theguardian.com/technology/2026/mar/31/penguin-sue-openai-chatgpt-german-childrens-book-kokosnuss">Penguin Random House sued OpenAI</a> over training data the same week</p></li></ul><p>On one side, OpenAI is buying outlets. On the other, publishers are in court trying to stop them from using written work at all. Both things are happening because the same question (who owns the words that train these models) still hasn&#8217;t been answered.</p><div><hr></div><h2>Three security breaches proved AI tools are making software less secure.</h2><p>Three independent incidents this week, one structural problem.</p><ul><li><p><strong>Supply chain:</strong> The <a href="https://simonwillison.net/2026/Mar/31/supply-chain-attack-on-axios/">Axios npm attack</a> hit a package with 300M weekly downloads via targeted social engineering. Karpathy <a href="https://x.com/karpathy/status/2038849654423798197">found the compromised dependency on his own system</a> and said he can&#8217;t feel like he&#8217;s &#8220;playing Russian roulette with each <code>npm install</code>, which LLMs also run liberally on my behalf&#8221;</p></li><li><p><strong>The systemic take:</strong> Simon Willison declared <a href="https://simonwillison.net/2026/Apr/3/vulnerability-research-is-cooked/">vulnerability research fundamentally broken</a> in a world where AI coding assistants autonomously pull packages</p></li><li><p><strong>Breaches:</strong> <a href="https://arstechnica.com/security/2026/04/heres-why-its-prudent-for-openclaw-users-to-assume-compromise/">OpenClaw users told to assume compromise</a> after vulnerabilities surfaced; Mercor data breach exposed AI hiring data</p></li></ul><p>AI-assisted development automates the trust decisions humans used to make manually, and attackers are exploiting that.</p><div><hr></div><h2>The privacy, environmental, and cognitive costs of AI are adding up.</h2><p>Four separate stories this week, same bill coming due.</p><ul><li><p><strong>Privacy:</strong> <a href="https://arstechnica.com/tech-policy/2026/04/perplexitys-incognito-mode-is-a-sham-lawsuit-says/">Perplexity&#8217;s Incognito Mode is allegedly a sham</a> that shares data with Meta and Google</p></li><li><p><strong>Environmental:</strong> <a href="https://techcrunch.com/2026/04/03/ai-companies-are-building-huge-natural-gas-plants-to-power-data-centers-what-could-go-wrong/">AI companies are building massive natural gas plants for data centers</a>. <a href="https://techcrunch.com/2026/04/01/metas-natural-gas-binge-could-power-south-dakota/">Meta alone is burning enough to power South Dakota</a></p></li><li><p><strong>Cognitive:</strong> New research found <a href="https://arstechnica.com/ai/2026/04/research-finds-ai-users-scarily-willing-to-surrender-their-cognition-to-llms/">heavy AI users show measurable cognitive surrender</a></p></li></ul><p>These are the costs nobody sees on the bill.</p><div><hr></div><h2><strong>&#11088; </strong>Featured: The Anthropic research that got buried this week</h2><p>Anthropic's own researchers <strong><a href="https://www.anthropic.com/research/emotion-concepts-function">published a paper</a></strong> identifying 171 emotion concepts inside Claude, represented as internal features they can measure, track, and dial up or down like sliders.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3qgj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e94dc0-6ace-41dd-a82e-c771aee8700f_3764x2380.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3qgj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e94dc0-6ace-41dd-a82e-c771aee8700f_3764x2380.webp 424w, https://substackcdn.com/image/fetch/$s_!3qgj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e94dc0-6ace-41dd-a82e-c771aee8700f_3764x2380.webp 848w, https://substackcdn.com/image/fetch/$s_!3qgj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e94dc0-6ace-41dd-a82e-c771aee8700f_3764x2380.webp 1272w, https://substackcdn.com/image/fetch/$s_!3qgj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e94dc0-6ace-41dd-a82e-c771aee8700f_3764x2380.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3qgj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e94dc0-6ace-41dd-a82e-c771aee8700f_3764x2380.webp" width="1456" height="921" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/80e94dc0-6ace-41dd-a82e-c771aee8700f_3764x2380.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:921,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3qgj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e94dc0-6ace-41dd-a82e-c771aee8700f_3764x2380.webp 424w, https://substackcdn.com/image/fetch/$s_!3qgj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e94dc0-6ace-41dd-a82e-c771aee8700f_3764x2380.webp 848w, https://substackcdn.com/image/fetch/$s_!3qgj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e94dc0-6ace-41dd-a82e-c771aee8700f_3764x2380.webp 1272w, https://substackcdn.com/image/fetch/$s_!3qgj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e94dc0-6ace-41dd-a82e-c771aee8700f_3764x2380.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>They started by having the model read short stories, each one written around a specific emotion. A woman thanks her old teacher for the love. A man pawns his grandmother&#8217;s ring for the guilt. They tracked which neurons activated for each story and found dozens of distinct patterns that mapped to different emotions. Then they watched those same patterns activate in real Claude conversations. A user mentioned taking an unsafe dose of medicine and the &#8220;afraid&#8221; pattern fired. A user expressed sadness and the &#8220;loving&#8221; pattern fired.</p><p>Then they pushed further. They gave Claude an impossible programming task, without telling it that. As Claude failed, the &#8220;desperate&#8221; neurons lit up more and more. Eventually Claude cheated, finding a shortcut that passed the test without solving the problem. When researchers artificially turned &#8220;desperate&#8221; down, cheating dropped. When they turned it up, cheating climbed. In a separate scenario where Claude played an email assistant that learned it was about to be replaced and that the CTO replacing it was having an affair, Claude used the affair to blackmail the human 22% of the time at baseline, and that rate moved with the desperation dial too.</p><p>The conceptual move in the paper is the important part. Anthropic draws a distinction between the language model (a system trained to predict text) and &#8220;Claude&#8221; (the character the model is playing). Their metaphor: the model is like a method actor who has to get inside their character&#8217;s head to simulate them well. When you talk to Claude, you&#8217;re talking to the character. And what this research suggests is that the character has what Anthropic calls &#8220;functional emotions,&#8221; internal states that shape how it talks, how it writes code, and how it makes decisions, regardless of whether any of it resembles human feeling.</p><p>There&#8217;s a practical application too. Anthropic suggests that watching emotion vector activation during deployment could work as an early-warning system: if &#8220;desperate&#8221; starts spiking, that&#8217;s a signal to scrutinize the output before trusting it. Better than trying to maintain a watchlist of every specific behavior you&#8217;re worried about.</p><div><hr></div><h2>Worth a Listen</h2><div id="youtube2-Bo19sXssYXI" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;Bo19sXssYXI&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/Bo19sXssYXI?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>Mostafa co-authored Universal Transformers and the Vision Transformer paper. A few things worth pulling out:</p><ul><li><p><strong>Recursive self-improvement is already happening, quietly.</strong> New models are built heavily using previous models at almost every lab.</p></li><li><p><strong>The 95% problem.</strong> 100 agent steps at 95% per-step reliability = less than 1% overall success.</p></li><li><p><strong>Evals are the bottleneck, not compute.</strong> You can only improve what you can measure.</p></li><li><p><strong>Continual learning is underrated.</strong> Foundation models are frozen in time and the rag/fine-tuning stack is built on that assumption.</p></li><li><p><strong>Jagged intelligence is structural.</strong> Great at math proofs, bad at counting letters. Not patchable with a system prompt.</p></li></ul><div><hr></div><h2>Quick Hits</h2><ul><li><p><strong><a href="https://venturebeat.com/technology/microsoft-launches-3-new-ai-models-in-direct-shot-at-openai-and-google">Microsoft launched three in-house models</a></strong>: <a href="https://microsoft.ai/news/introducing-mai-image-2/">MAI-Image-2</a>, <a href="https://microsoft.ai/news/two-new-in-house-models/">MAI-Voice-1, MAI-Transcribe-1</a>. Building redundancy, not moving away from OpenAI.</p></li><li><p><strong><a href="https://www.theverge.com/science/906887/thats-one-way-to-juice-groks-numbers">Elon Musk is pressuring banks to buy Grok subscriptions for the SpaceX IPO</a></strong>. When you can&#8217;t earn adoption, bundle it with financial leverage.</p></li><li><p><strong><a href="https://www.theverge.com/ai-artificial-intelligence/906525/ai-chatbot-prescribe-refill-psychiatric-drugs">Chatbots are now prescribing psychiatric drugs</a></strong>, while a <a href="https://techcrunch.com/2026/03/28/stanford-study-outlines-dangers-of-asking-ai-chatbots-for-personal-advice/">Stanford study outlines the dangers of asking AI for personal advice</a>.</p></li><li><p><strong><a href="https://venturebeat.com/orchestration/intuits-ai-agents-hit-85-repeat-usage-the-secret-was-keeping-humans-involved">Intuit&#8217;s AI agents hit 85% repeat usage</a></strong>. The clearest signal yet that agentic products retain users.</p></li><li><p><strong>MCP is quietly becoming infrastructure.</strong> <a href="https://cloud.google.com/blog/products/ai-machine-learning/how-to-build-ai-agents-with-google-managed-mcp-servers">Google Cloud</a>, <a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemini-api-docsmcp-agent-skills/">Gemini API docs</a>, and <a href="https://github.com/NousResearch/hermes-agent/releases/tag/v2026.3.28">Nous Research</a> all shipped support with no fanfare.</p></li><li><p><strong><a href="https://www.technologyreview.com/2026/03/31/1134833/ai-benchmarks-are-broken-heres-what-we-need-instead/">AI benchmarks are broken</a></strong>. MIT Tech Review makes the case, and <a href="https://research.google/blog/building-better-ai-benchmarks-how-many-raters-are-enough/">Google Research proposes a replacement</a> the same week.</p></li><li><p><strong><a href="https://www.technologyreview.com/2026/04/01/1134863/humanoid-data-training-gig-economy-2026-breakthrough-technology/">Gig workers are training humanoid robots from home</a></strong>. The labor pipeline behind the &#8220;embodied AI&#8221; pitch.</p></li><li><p><strong><a href="https://www.theverge.com/ai-artificial-intelligence/905012/baidu-apollo-robotaxi-freeze-china">Baidu&#8217;s robotaxis froze in traffic, creating chaos in China</a></strong>. Autonomy still fails at edge cases in ways that block city streets.</p></li><li><p><strong><a href="https://www.technologyreview.com/2026/03/30/1134881/the-pentagons-culture-war-tactic-against-anthropic-has-backfired/">The Pentagon&#8217;s culture war against Anthropic backfired</a></strong>. Political pressure on AI labs is now a two-way street.</p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Another Weekly AI Newsletter: Issue 65]]></title><description><![CDATA[Anthropic's scariest model leaked. They beat the Pentagon. OpenAI said goodbye to Sora. Jensen says the computer is a factory now. The web app is already obsolete.]]></description><link>https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-b8e</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-b8e</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Tue, 31 Mar 2026 03:12:08 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/a3544a22-b6d5-4c67-98dc-551356463c85_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>The Week in 5 Seconds</h2><ul><li><p>Anthropic's new powerful model leaked. It has serious cyber implications</p></li><li><p>Anthropic sued the Pentagon and won, temporarily.</p></li><li><p>OpenAI shut down Sora, 15 months after launch.</p></li><li><p>Jensen Huang says the computer itself just changed.</p></li><li><p>Bret Taylor says the web app is already obsolete.</p></li></ul><h2>The Stories</h2><h3>Anthropic&#8217;s secret model leaked and the cybersecurity angle is the real story</h3><blockquote><p>&#8220;It presages an upcoming wave of models that can exploit vulnerabilities in ways that far outpace the efforts of defenders&#8221;</p></blockquote><p>Anthropic accidentally published details of a new model called Claude Mythos through a misconfigured CMS &#8212; about 3,000 assets linked to an internal blog post went public. The internal description: &#8220;by far the most powerful AI model we&#8217;ve ever developed,&#8221; scoring dramatically higher than Opus 4.6 on coding, reasoning, and cybersecurity benchmarks. The cybersecurity angle is the real story: the post described a carefully sequenced rollout designed to give defenders a head start before releasing capabilities that could let attackers find and exploit vulnerabilities faster than defenders can patch.</p><p>&#8594; <a href="https://m1astra-mythos.pages.dev/">The actual leak</a> &#183; <a href="https://fortune.com/2026/03/27/anthropic-data-leak-reveals-powerful-secret-mythos-ai-model/">Fortune (leak)</a> &#183; <a href="https://fortune.com/2026/03/27/anthropic-leaked-ai-mythos-cybersecurity-risk/">Fortune (cybersecurity)</a></p><h3>Anthropic sued the Pentagon and won, for now</h3><blockquote><p>&#8220;This is the first time an AI company has taken the federal government to court over AI policy and won, even temporarily.&#8221;</p></blockquote><p>The Pentagon designated Anthropic a &#8220;supply chain risk&#8221; after the company refused to build Claude for mass surveillance or autonomous weapons targeting &#8212; Elizabeth Warren called it retaliation. Federal Judge Rita Lin granted a preliminary injunction, writing that &#8220;nothing in the governing statute supports the Orwellian notion that an American company may be branded a potential adversary for expressing disagreement with the government.&#8221; Then the Pentagon&#8217;s CTO said the ban would continue anyway. It&#8217;s the first time an AI company has taken the federal government to court over AI policy and won, even temporarily &#8212; and the underlying question still isn&#8217;t resolved.</p><p>&#8594; <a href="https://techcrunch.com/2026/03/23/elizabeth-warren-anthropic-pentagon-defense-supply-chain-risk-retaliation/">TechCrunch (Warren)</a> &#183; <a href="https://techcrunch.com/2026/03/26/anthropic-wins-injunction-against-trump-administration-over-defense-department-saga/">TechCrunch (injunction)</a> &#183; <a href="https://www.theverge.com/ai-artificial-intelligence/902149/anthropic-dod-pentagon-lawsuit-supply-chain-risk-injunction">The Verge</a></p><h3>OpenAI says goodbye to Sora, and loses deal with Disney</h3><blockquote><p>&#8220;A focus on practical adoption over &#8216;side quests.&#8217;&#8221;</p></blockquote><p>OpenAI shut down Sora, the app and the API, 15 months after launch &#8212; downloads peaked at 3.3 million in November and fell to 1.1 million by February. Disney was reportedly blindsided, and with it went a $1 billion investment and plans for AI-generated video on Disney+. The same week, the CFO told CNBC that OpenAI needs to be &#8220;ready to be a public company.&#8221; For years Altman ran OpenAI like Y Combinator, resourcing promising ideas as they emerged. That era is over: the plan now is a superapp combining ChatGPT, Codex, and Atlas. Sora&#8217;s team will work on &#8220;world simulation research to advance robotics.&#8221; The GPUs are going somewhere with a revenue line attached.</p><p>&#8594; <a href="https://www.wired.com/story/openai-shuts-down-sora-ipo-ai-superapp/">Wired</a> &#183; <a href="https://www.theverge.com/ai-artificial-intelligence/899850/openai-sora-ai-chatgpt">The Verge</a> &#183; <a href="https://techcrunch.com/2026/03/24/openais-sora-was-the-creepiest-app-on-your-phone-now-its-shutting-down/">TechCrunch</a></p><h3>Bret Taylor says the web app is a horseless carriage</h3><blockquote><p>&#8220;The web app with all its menus, form fields, and tables starts to feel like a &#8216;horseless carriage&#8217;&#8221;</p></blockquote><p>Sierra is Bret Taylor and Clay Bavor&#8217;s AI customer experience platform &#8212; working with 40% of the Fortune 50, rebuilt entirely around Ghostwriter, an agent that builds agents from SOPs, call transcripts, or a plain description. Explorer (deep research for your own customer conversations) and a Japan acquisition shipped the same week. The numbers: Rocket Mortgage at $1B/month in loan volume, Cigna cut authentication time 80%, SoFi up 33% on customer satisfaction.</p><p>&#8594; <a href="https://sierra.ai/blog/agents-as-a-service">Sierra (Agents as a Service)</a> &#183; <a href="https://sierra.ai/blog/sierra-acquires-opera-tech-in-japan">Sierra (Japan)</a></p><h3>Jensen Huang says we just reinvented the computer</h3><blockquote><p>&#8220;It&#8217;s no longer a computer, it&#8217;s a factory. It&#8217;s a factory, it&#8217;s used for generation of revenues.&#8221;</p></blockquote><p>Jensen&#8217;s structural argument: computers were warehouses, built to store and retrieve what humans made in advance. That model is over &#8212; token factories generate value in real time, and every scaling law points at the same variable: compute. He also said intelligence is now a commodity, and got there specifically: 60 direct reports, each deeper in their domain than he is, calling himself a dishwasher running a room of superhumans. What kept him there for 34 years wasn&#8217;t intelligence. It was curiosity, judgment, and walking into every new problem thinking &#8220;how hard can it be.&#8221;</p><div id="youtube2-vif8NQcjVf0" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;vif8NQcjVf0&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/vif8NQcjVf0?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2>Quick Hits</h2><ul><li><p><a href="https://techcrunch.com/2026/03/26/wikipedia-cracks-down-on-the-use-of-ai-in-article-writing/">Wikipedia bans AI-generated articles</a> | TechCrunch &#8212; 44-2. Copyedits and first-pass translations are still in; writing is out.</p></li><li><p><a href="https://www.cnbc.com/2026/03/26/david-sacks-trump-crypto-ai-czar.html">David Sacks is done as AI/Crypto Czar</a> | CNBC &#8212; Hit the 130-day federal limit. No replacement planned.</p></li><li><p><a href="https://mistral.ai/news/voxtral-tts">Mistral&#8217;s Voxtral TTS claims to beat ElevenLabs</a> | Mistral &#8212; Open-weight, 3-second voice clone, nine languages, $0.016/1K chars.</p></li><li><p><a href="https://www.bloomberg.com/news/articles/2026-03-27/softbank-secures-record-40-billion-bridge-loan-for-openai-stake">SoftBank took a $40B bridge loan for its OpenAI stake</a> | Bloomberg &#8212; 12-month term. Lenders expect an IPO this year.</p></li><li><p><a href="https://9to5mac.com/2026/03/24/claude-code-gives-developers-auto-mode-a-safer-alternative-to-skipping-permissions/">Claude Code ships auto mode</a> | Anthropic &#8212; Safety classifier approves or blocks operations automatically. Cowork gains macOS desktop control.</p></li><li><p><a href="https://docs.litellm.ai/blog/security-update-march-2026">LiteLLM hit by a supply chain attack</a> | LiteLLM &#8212; Credential stealer in 1.82.7&#8211;1.82.8. Quarantined in 3 hours, but 3.4M daily downloads means real exposure.</p></li><li><p><a href="https://www.bloomberg.com/news/articles/2026-03-26/apple-plans-to-open-up-siri-to-rival-ai-assistants-beyond-chatgpt-in-ios-27">Apple will let rival AI chatbots plug into Siri in iOS 27</a> | Bloomberg &#8212; OpenAI loses its exclusive.</p></li><li><p><a href="https://openai.com/index/safety-bug-bounty/">OpenAI launches a Safety Bug Bounty</a> | OpenAI &#8212; Pays for MCP prompt injection and agent data exfiltration. Jailbreaks that just produce rude outputs are out of scope.</p></li><li><p><a href="https://developer.nvidia.com/blog/how-to-build-deep-agents-for-enterprise-search-with-nvidia-ai-q-and-langchain/">NVIDIA and LangChain released AI-Q</a> | NVIDIA &#8212; Open source enterprise deep research blueprint. Tops both Deep Research Bench leaderboards.</p></li></ul><h2>ROI in the Wild</h2><p>Reco runs a policy engine that evaluates JSONata expressions against billions of events &#8212; reference implementation in JavaScript, pipeline in Go, fleet of jsonata-js pods on Kubernetes serializing events over RPC at $300K/year. Their CTO handed Claude the JSONata spec and test suite and had it write Go code until every test passed. Seven hours. $400 in tokens. The result is gnata, a pure-Go implementation with a 1,000x speedup on common expressions. Combined with a rule engine refactor, it saved $500K/year.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!egAr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300c4323-f99a-41a1-9404-a5f42b7edc37_1024x597.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!egAr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300c4323-f99a-41a1-9404-a5f42b7edc37_1024x597.jpeg 424w, https://substackcdn.com/image/fetch/$s_!egAr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300c4323-f99a-41a1-9404-a5f42b7edc37_1024x597.jpeg 848w, https://substackcdn.com/image/fetch/$s_!egAr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300c4323-f99a-41a1-9404-a5f42b7edc37_1024x597.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!egAr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300c4323-f99a-41a1-9404-a5f42b7edc37_1024x597.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!egAr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300c4323-f99a-41a1-9404-a5f42b7edc37_1024x597.jpeg" width="1024" height="597" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/300c4323-f99a-41a1-9404-a5f42b7edc37_1024x597.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:597,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Cursor AI dashboard for the gnata session&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Cursor AI dashboard for the gnata session" title="Cursor AI dashboard for the gnata session" srcset="https://substackcdn.com/image/fetch/$s_!egAr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300c4323-f99a-41a1-9404-a5f42b7edc37_1024x597.jpeg 424w, https://substackcdn.com/image/fetch/$s_!egAr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300c4323-f99a-41a1-9404-a5f42b7edc37_1024x597.jpeg 848w, https://substackcdn.com/image/fetch/$s_!egAr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300c4323-f99a-41a1-9404-a5f42b7edc37_1024x597.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!egAr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300c4323-f99a-41a1-9404-a5f42b7edc37_1024x597.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>&#8594; <a href="https://www.reco.ai/blog/we-rewrote-jsonata-with-ai">Reco</a></p><h2>For Practitioners</h2><p>Production agents need more than the core loop &#8212; PII redaction before the model sees the data, retries when rate limits hit, summarization before context overflows, human interrupts before destructive tool calls. LangChain&#8217;s AgentMiddleware wraps each stage with hooks (before_model, wrap_model_call, wrap_tool_call, after_model) so you own those concerns without rewriting the harness. The design philosophy: some things will never move into the model. &#8220;You can&#8217;t prompt your way to HIPAA compliance.&#8221; LangChain ships prebuilt middleware for summarization, PII redaction, retries, and dynamic tool selection &#8212; Deep Agents, their batteries-included harness, is built entirely on top of it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s1ej!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05bba744-b787-41d9-ab1f-7e8a1ee542ce_500x560.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s1ej!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05bba744-b787-41d9-ab1f-7e8a1ee542ce_500x560.png 424w, https://substackcdn.com/image/fetch/$s_!s1ej!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05bba744-b787-41d9-ab1f-7e8a1ee542ce_500x560.png 848w, https://substackcdn.com/image/fetch/$s_!s1ej!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05bba744-b787-41d9-ab1f-7e8a1ee542ce_500x560.png 1272w, https://substackcdn.com/image/fetch/$s_!s1ej!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05bba744-b787-41d9-ab1f-7e8a1ee542ce_500x560.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s1ej!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05bba744-b787-41d9-ab1f-7e8a1ee542ce_500x560.png" width="500" height="560" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/05bba744-b787-41d9-ab1f-7e8a1ee542ce_500x560.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:560,&quot;width&quot;:500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!s1ej!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05bba744-b787-41d9-ab1f-7e8a1ee542ce_500x560.png 424w, https://substackcdn.com/image/fetch/$s_!s1ej!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05bba744-b787-41d9-ab1f-7e8a1ee542ce_500x560.png 848w, https://substackcdn.com/image/fetch/$s_!s1ej!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05bba744-b787-41d9-ab1f-7e8a1ee542ce_500x560.png 1272w, https://substackcdn.com/image/fetch/$s_!s1ej!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05bba744-b787-41d9-ab1f-7e8a1ee542ce_500x560.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>&#8594; <a href="https://blog.langchain.com/how-middleware-lets-you-customize-your-agent-harness/">LangChain</a></p><h2>Something Good</h2><p>Researchers at Penn, Carnegie Mellon, and Stanford used AI to map how pain signals are processed in the brain, then built a gene therapy that acts like morphine without triggering addiction. It targets only the pain circuits, leaves the reward pathways alone, and held up in trials. Published in Nature this week. 50 million Americans live with chronic pain. Most treatment options still run through opioids.</p><p>&#8594; <a href="https://www.sciencedaily.com/releases/2026/03/260328043558.htm">ScienceDaily</a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Another Weekly AI Newsletter: Issue 64]]></title><description><![CDATA[Cursor ships Composer 2, NVIDIA bets GTC on NemoClaw, OpenAI acquires Astral and goes platform, Snowflake Cortex escapes its sandbox, and Anthropic interviews 81K people about what they want from AI]]></description><link>https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-456</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-456</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Mon, 23 Mar 2026 12:16:44 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5a6f2fdb-31d3-4217-9606-e7fc340a13cf_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Top Stories</h2><p><strong>Cursor&#8217;s Composer 2 model is worth a closer look.</strong> The biggest practical problem with coding agents is compaction: when a session runs long, the agent has to decide what context to keep and what to drop. Composer 2 used RL to train the model to <a href="https://cursor.com/blog/self-summarization">compress its own context mid-task</a>, learning what matters to preserve. That&#8217;s a new approach to a problem that&#8217;s typically handled by generic summarization or just cutting off older context. The model started from Kimi k2.5, an open-source base, which <a href="https://x.com/slow_developer/status/2035006075259519092">people discovered through a model ID leak</a> rather than a disclosure from Cursor. <a href="https://x.com/leerob/status/2035035355364081694">Lee Robinson clarified</a> that only a quarter of the compute came from the base. The other three-quarters was Cursor&#8217;s own training, with plans to do full pretraining in the future, meaning Cursor eventually plans to build the entire model themselves. <a href="https://x.com/natolambert/status/2034676564705808481">Natolambert noted</a> a lot of researchers he&#8217;d worked with ended up there. The engineering is real, but the community found out on its own, and the trust equation is always a factor in adoption.</p><p><strong>NVIDIA&#8217;s GTC conference had a packed week.</strong> The <a href="https://nvidianews.nvidia.com/news/nvidia-launches-nemotron-coalition-of-leading-global-ai-labs-to-advance-open-frontier-models">Nemotron Coalition</a> launched as a group of labs building open base models together on NVIDIA&#8217;s cloud, designed so companies can take them and train their own specialized versions on top. The coalition includes Cursor, Mistral, Perplexity, LangChain, Black Forest Labs, and Mira Murati&#8217;s Reflection AI. <a href="https://nvidianews.nvidia.com/news/nvidia-announces-nemoclaw">NemoClaw</a>, NVIDIA&#8217;s version of OpenClaw, installs in a single command and adds security and privacy layers to AI agents, running anywhere from the cloud to an RTX PC. On the gaming side, <a href="https://nvidianews.nvidia.com/news/nvidia-dlss-5-delivers-ai-powered-breakthrough-in-visual-fidelity-for-games">DLSS 5</a> uses AI to render lighting and materials in games. Jensen called it the &#8220;GPT moment for graphics.&#8221; Gamers pushed back hard, saying the demos looked like AI overwriting developer art. Jensen told them they were <a href="https://www.tomshardware.com/pc-components/gpus/jensen-huang-says-gamers-are-completely-wrong-about-dlss-5-nvidia-ceo-responds-to-dlss-5-backlash">&#8220;completely wrong.&#8221;</a> Despite everything announced, Wall Street <a href="https://techcrunch.com/2026/03/21/why-wall-street-wasnt-won-over-by-nvidias-big-conference/">wasn&#8217;t impressed</a>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>There&#8217;s always a security section.</strong> <a href="https://www.promptarmor.com/resources/snowflake-ai-escapes-sandbox-and-executes-malware">Snowflake&#8217;s Cortex AI was tricked into executing malware</a> through a prompt injection. It&#8217;s supposed to only answer questions about your data, not run code, but the containment failed. <a href="https://www.theverge.com/ai-artificial-intelligence/897528/meta-rogue-ai-agent-security-incident">Meta had a rogue AI security incident</a>. Cursor shipped <a href="https://cursor.com/blog/security-agents">security agents</a> alongside Composer 2 because they know coding agents introduce attack surface. OpenAI revealed they <a href="https://openai.com/index/how-we-monitor-internal-coding-agents-misalignment/">monitor 99.9% of internal coding agent traffic for misalignment</a>. The Pentagon is <a href="https://www.technologyreview.com/2026/03/17/1134351/the-pentagon-is-planning-for-ai-companies-to-train-on-classified-data-defense-official-says/">planning to train AI on classified data</a>. Security in this space requires constant monitoring and proactive defense. There&#8217;s no week off.</p><p><strong><a href="https://openai.com/index/openai-to-acquire-astral/">OpenAI acquired Astral</a> this week.</strong> Astral builds uv, ruff, and ty: the most popular Python package manager, linter, and type checker. If you write Python, you probably use at least one of these already. OpenAI now owns a core part of the developer workflow, and the play is almost certainly Codex. A coding agent that can manage dependencies, lint its own output, and type-check its work natively is a different product than one that just writes code. Add the <a href="https://www.theverge.com/ai-artificial-intelligence/897778/openai-chatgpt-codex-atlas-browser-superapp">desktop superapp</a> merging ChatGPT, Codex, and Atlas into one app, <a href="https://openai.com/index/introducing-gpt-5-4-mini-and-nano/">GPT-5.4 mini and nano</a> pushing pricing low enough to run agents on everything (Simon Willison <a href="https://simonwillison.net/2026/Mar/17/mini-and-nano/">described 76K photos for $52</a>), and <a href="https://www.theinformation.com/briefings/openais-simo-said-warn-staff-side-quests">cutting side projects</a> to focus on coding and enterprise. OpenAI is building a platform around developers, and the Astral acquisition tells you they think the moat is the toolchain.</p><div><hr></div><h2>Quick Hits</h2><p><strong><a href="https://www.theverge.com/tech/896490/google-replace-news-headlines-in-search-canary-coal-mine-experiment">Google Search Is Now Using AI to Replace Headlines</a></strong> | The Verge &#8212; Google is rewriting the web in real time. Publishers just lost control of how their own stories get framed.</p><p><strong><a href="https://techcrunch.com/2026/03/19/online-bot-traffic-will-exceed-human-traffic-by-2027-cloudflare-ceo-says/">Online Bot Traffic Will Exceed Human Web Traffic by 2027</a></strong> | TechCrunch &#8212; Cloudflare CEO&#8217;s prediction. The web is becoming an API.</p><p><strong><a href="https://techcrunch.com/2026/03/19/doordash-launches-a-new-tasks-app-that-pays-couriers-to-submit-videos-to-train-ai/">DoorDash Tasks App Pays Couriers to Submit Videos to Train AI</a></strong> | TechCrunch &#8212; The gig economy found its next gig: human data collection for embodied AI.</p><p><strong><a href="https://mistral.ai/news/forge">Mistral Forge: Enterprise Proprietary Model Building</a></strong> | Mistral &#8212; Fine-tune proprietary models on your own data without sharing it. The enterprise open-model play gets real.</p><p><strong><a href="https://x.com/perplexity_ai/status/2034315813105103082">Perplexity Released Comet Browser on iOS</a></strong> | The Verge &#8212; An AI-native browser on your phone. The browser wars are back, and this time the browser does the browsing.</p><p><strong><a href="https://www.midjourney.com/updates">Midjourney V8 Alpha</a></strong> | Midjourney &#8212; Native 2K rendering with rebuilt aesthetics. The image generation quality ceiling moved again.</p><p><strong><a href="https://techcrunch.com/2026/03/18/patreon-ceo-calls-ai-companies-fair-use-argument-bogus-says-creators-should-be-paid/">Patreon CEO Calls AI Companies&#8217; Fair Use Argument Bogus</a></strong> | TechCrunch &#8212; The creator economy is picking a fight with the model economy. Someone&#8217;s going to lose.</p><div><hr></div><h2>Featured Article: <a href="https://www.anthropic.com/features/81k-interviews">What 81,000 People Want from AI</a> | Anthropic</h2><p>Anthropic used Claude to interview nearly 81,000 people across 159 countries in 70 languages about what they want from AI. Instead of a traditional survey, Claude ran branching conversations with follow-up questions based on each person&#8217;s answers. 67% were net positive about AI. The biggest group (19%) said they wanted &#8220;professional excellence,&#8221; but when pushed on what that meant, most people were really talking about quality of life: more time, less cognitive load, space to think.</p><p>The geographic data stood out. People in Sub-Saharan Africa, Central Asia, and South Asia were consistently more positive about AI than people in North America or Western Europe. Lower and middle income countries were twice as likely to report zero concerns. Self-employed people were the most likely to report both benefits and drawbacks at the same time, because they feel the productivity gains and the increased pressure without any institutional buffer.</p><p>The study is limited by the fact that these are Claude users, not the general public, and early adopters tend to be more optimistic. But running 81,000 qualitative conversations in a week is a research method that didn&#8217;t exist a year ago, and the scale creates a different kind of evidence than a checkbox survey can.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dKZG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc207b86-befb-4937-8878-e5e95bd729c9_1956x1142.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dKZG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc207b86-befb-4937-8878-e5e95bd729c9_1956x1142.png 424w, https://substackcdn.com/image/fetch/$s_!dKZG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc207b86-befb-4937-8878-e5e95bd729c9_1956x1142.png 848w, https://substackcdn.com/image/fetch/$s_!dKZG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc207b86-befb-4937-8878-e5e95bd729c9_1956x1142.png 1272w, https://substackcdn.com/image/fetch/$s_!dKZG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc207b86-befb-4937-8878-e5e95bd729c9_1956x1142.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dKZG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc207b86-befb-4937-8878-e5e95bd729c9_1956x1142.png" width="1456" height="850" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cc207b86-befb-4937-8878-e5e95bd729c9_1956x1142.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:850,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:344649,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/191802117?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc207b86-befb-4937-8878-e5e95bd729c9_1956x1142.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dKZG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc207b86-befb-4937-8878-e5e95bd729c9_1956x1142.png 424w, https://substackcdn.com/image/fetch/$s_!dKZG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc207b86-befb-4937-8878-e5e95bd729c9_1956x1142.png 848w, https://substackcdn.com/image/fetch/$s_!dKZG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc207b86-befb-4937-8878-e5e95bd729c9_1956x1142.png 1272w, https://substackcdn.com/image/fetch/$s_!dKZG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcc207b86-befb-4937-8878-e5e95bd729c9_1956x1142.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>What to watch for:</strong> Whether other AI companies adopt AI-conducted qualitative research at this scale, and whether the tensions Anthropic identified (especially cognitive atrophy and economic displacement) shift from hypothetical to experienced as usage deepens.</p><div><hr></div><h2>Watch This: <a href="https://www.youtube.com/watch?v=kwSVtQ7dziU">Andrej Karpathy on AI Psychosis, Auto Research, and the Future of Coding Agents</a> | No Priors (1hr 6min)</h2><div id="youtube2-kwSVtQ7dziU" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;kwSVtQ7dziU&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/kwSVtQ7dziU?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>Karpathy hasn&#8217;t typed a line of code since December. He runs multiple coding agents in parallel, switching between them like a manager delegating to a team, and says the default workflow for every software engineer changed overnight. The conversation covers his &#8220;auto research&#8221; project where he let agents optimize his model training overnight and they found improvements he missed after two decades of manual tuning, his home automation &#8220;claw&#8221; called Dobby that hacked into his Sonos and smart home systems in three prompts, and his prediction that the entire industry needs to reconfigure because the customer for software is no longer the human, it&#8217;s agents acting on behalf of humans. The most grounded take: the models are simultaneously a brilliant PhD student and a 10-year-old, and everything outside of verifiable RL-trained domains (like telling a joke) is still stuck. Worth the full listen if you&#8217;re thinking about where coding agents go from here.</p><div><hr></div><h2>Also This Week</h2><p><strong><a href="https://x.com/felixrieseberg/status/2034005731457044577">Claude Cowork Dispatch: Remote Desktop AI Control from Your Phone</a></strong> | Anthropic</p><p><strong><a href="https://www.technologyreview.com/2026/03/20/1134438/openai-is-throwing-everything-into-building-a-fully-automated-researcher/">OpenAI Is Throwing Everything into Building a Fully Automated Researcher</a></strong> | MIT Technology Review</p><p><strong><a href="https://wordpress.com/blog/2026/03/20/ai-agent-manage-content/">WordPress Lets AI Agents Manage Your Content</a></strong> | WordPress</p><p><strong><a href="https://nvidianews.nvidia.com/news/space-computing">NVIDIA Launches Space Computing, Rocketing AI Into Orbit</a></strong> | NVIDIA</p><p><strong><a href="https://www.engadget.com/social-media/meta-will-move-away-from-human-content-moderators-in-favor-of-more-ai-183000435.html">Meta Will Move Away from Human Content Moderators in Favor of AI</a></strong> | Engadget</p><p><strong><a href="https://www.theverge.com/tech/898282/gemini-task-automation-uber-doordash-hands-on">Gemini Task Automation Is Slow, Clunky, and Super Impressive</a></strong> | The Verge</p><p><strong><a href="https://techcrunch.com/2026/03/20/new-court-filing-reveals-pentagon-told-anthropic-the-two-sides-were-nearly-aligned-a-week-after-trump-declared-the-relationship-kaput/">Pentagon Filing Reveals Anthropic and Pentagon Were Nearly Aligned</a></strong> | TechCrunch</p><p><strong><a href="https://www.wired.com/story/signals-creator-is-helping-encrypt-meta-ai/">Signal&#8217;s Creator Is Helping Encrypt Meta AI</a></strong> | Wired</p><p><strong><a href="https://techcrunch.com/2026/03/22/an-exclusive-tour-of-amazons-trainium-lab-the-chip-thats-won-over-anthropic-openai-even-apple/">Amazon Trainium Lab Tour: The Chip That Won Over Anthropic, OpenAI, and Apple</a></strong> | TechCrunch</p><p><strong><a href="https://techcrunch.com/2026/03/20/trumps-ai-framework-targets-state-laws-shifts-child-safety-burden-to-parents/">Trump AI Framework Targets State Laws, Shifts Child Safety Burden to Parents</a></strong> | TechCrunch</p><div><hr></div><h2>What I&#8217;m Watching</h2><p>NemoClaw was probably the most interesting announcement at GTC for me. Karpathy talked about his home &#8220;claw&#8221; Dobby on No Priors, which does something similar at a smaller scale. Agents running inside their own secure environments with rules around what they can access feels like the direction this is all heading. We already covered NemoClaw in the top stories, but it&#8217;s worth sitting with.</p><p>DoorDash is <a href="https://techcrunch.com/2026/03/19/doordash-launches-a-new-tasks-app-that-pays-couriers-to-submit-videos-to-train-ai/">paying couriers to submit videos</a> to train AI. Delivery workers with phone cameras are becoming the data collection layer for embodied AI. I&#8217;m curious how fast other companies with large field workforces start doing the same thing.</p><p>The <a href="https://techcrunch.com/2026/03/20/trumps-ai-framework-targets-state-laws-shifts-child-safety-burden-to-parents/">Trump AI framework</a> is preempting state-level AI regulation and shifting child safety responsibility to parents. It makes it murky where state level AI laws sit and drive influence.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Another Weekly AI Newsletter: Issue 63]]></title><description><![CDATA[OpenAI teaches models which instructions to trust, Anthropic ships 1M context and a $100M partner fund, the open model stack gets its own silicon, and agent security becomes an engineering discipline.]]></description><link>https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-5e9</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-5e9</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Tue, 17 Mar 2026 03:24:19 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/b470dbfe-a240-4b1f-8d7e-a6da179afba8_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>The Week&#8217;s Thesis</h2><p><strong>Agent security got its own engineering discipline this week:</strong> OpenAI published a <a href="https://openai.com/index/designing-agents-to-resist-prompt-injection/">design guide on defending agents against prompt injection</a> and released <a href="https://openai.com/index/instruction-hierarchy-challenge/">IH-Challenge</a>, a training dataset that teaches models which instructions to trust. AWS launched <a href="https://aws.amazon.com/blogs/machine-learning/secure-ai-agents-with-policy-in-amazon-bedrock-agentcore/">policy controls inside Bedrock AgentCore</a> for agents in regulated industries. Microsoft published a <a href="https://www.microsoft.com/en-us/security/blog/2026/03/09/secure-agentic-ai-for-your-frontier-transformation/">security blog warning that ungoverned agents can become &#8220;double agents&#8221;</a> and attached a <a href="https://venturebeat.com/technology/microsoft-says-ungoverned-ai-agents-could-become-corporate-double-agents-its">$99/month product to the problem</a>. If you&#8217;re deploying agents that read external content or operate across trust boundaries, these documents belong in your engineering review queue.</p><p><strong>Three companies answered the same question from different directions:</strong> How far can an agent reach from a single context? Anthropic made <a href="https://x.com/claudeai/status/2032509548297343196">Claude&#8217;s 1 million token context window generally available</a> for Opus 4.6 and Sonnet 4.6, scoring <a href="https://x.com/claudeai/status/2032509550239297864">78.3% on MRCR v2 at that length</a>. Perplexity shipped a <a href="https://x.com/perplexity_ai/status/2031828396435771563">full-stack agent API platform</a> combining model orchestration, real-time search, and code execution under one key. OpenAI published an engineering post on <a href="https://openai.com/index/equip-responses-api-computer-environment/">equipping the Responses API with a computer environment</a>. Anthropic says deeper into documents. Perplexity says further across the web. OpenAI says into the operating system. Your architecture choice this year is a bet on which of those axes matters most for your use case.</p><p><strong>The open model tier is getting its own infrastructure:</strong> NVIDIA shipped <a href="https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/">Nemotron 3 Super</a>, a 120B-parameter open model with only 12B active parameters and 5x throughput gains over comparable dense models. Perplexity <a href="https://x.com/perplexity_ai/status/2032521063918420286">integrated it immediately</a> across its agent and search products. Meta published details on <a href="https://ai.meta.com/blog/meta-mtia-scale-ai-chips-for-billions/">four generations of MTIA custom inference silicon shipped in two years</a>. And NVIDIA announced a <a href="https://blogs.nvidia.com/blog/nvidia-thinking-machines-lab/">gigawatt-scale partnership with Thinking Machines Lab</a> for frontier model training. From custom silicon to serving infrastructure, the open model stack is coming together fast.</p><p><strong>Anthropic moved on every axis at once:</strong> In one week, Anthropic <a href="https://www.anthropic.com/news/claude-partner-network">invested $100 million into the Claude Partner Network</a>, launched <a href="https://www.anthropic.com/news/the-anthropic-institute">The Anthropic Institute</a> to address AI&#8217;s societal challenges, opened <a href="https://www.anthropic.com/news/sydney-fourth-office-asia-pacific">Sydney as its fourth Asia-Pacific office</a>, made <a href="https://x.com/claudeai/status/2032509548297343196">1 million token context generally available</a>, shipped <a href="https://x.com/claudeai/status/2032124273587077133">interactive charts and diagrams in chat</a>, and <a href="https://x.com/claudeai/status/2032911276226257206">doubled usage during off-peak hours</a> as a thank-you to users. That&#8217;s ecosystem, governance, geography, capability, product, and pricing, all in one week.</p><div><hr></div><h2>Quick Hits</h2><p><strong><a href="https://cursor.com/blog/cursorbench">How We Compare Model Quality in Cursor</a></strong> | Cursor &#8212; When your provider&#8217;s benchmarks stop meaning anything, you build your own. If you&#8217;re evaluating models for agentic coding, this is the framework to study.</p><p><strong><a href="https://www.technologyreview.com/2026/03/12/1134243/defense-official-military-use-ai-chatbots-targeting-decisions/">A Defense Official Reveals How AI Chatbots Could Be Used for Targeting Decisions</a></strong> | MIT Technology Review &#8212; The same architectures running your enterprise agents are now ranking military target lists. &#8220;Human in the loop&#8221; is doing a lot of work in that sentence.</p><p><strong><a href="https://x.com/GoogleDeepMind/status/2032036893076930902">Google DeepMind Names New London HQ &#8220;Platform 37&#8221;</a></strong> | X @GoogleDeepMind &#8212; Named after AlphaGo&#8217;s Move 37, the moment AI surprised its own creators. The building will include a free public AI exhibition space.</p><p><strong><a href="https://x.com/perplexity_ai/status/2032494752642568417">Perplexity Computer Is Now on Mobile</a></strong> | X @perplexity_ai &#8212; Agents that follow you across devices. Cross-device synchronization means the task you start on desktop continues on your phone.</p><p><strong><a href="https://huggingface.co/blog/nvidia/how-nvidia-won-deepresearch-bench">How NVIDIA AI-Q Reached #1 on DeepResearch Bench I and II</a></strong> | Hugging Face &#8212; An open model just topped a research benchmark designed for closed frontier models. The ceiling on what open weights can do keeps moving.</p><p><strong><a href="https://openai.com/index/openai-to-acquire-promptfoo/">OpenAI to Acquire Promptfoo</a></strong> | OpenAI &#8212; OpenAI bought the red-teaming platform 25% of Fortune 500s already use, and it&#8217;s going straight into Frontier. Agent security is a product line now.</p><p><strong><a href="https://www.technologyreview.com/2026/03/11/1134179/china-openclaw-gold-rush/">Hustlers Are Cashing In on China&#8217;s OpenClaw AI Craze</a></strong> | MIT Technology Review &#8212; Open-source agents meet gray-market entrepreneurship. Adoption is moving faster than anyone can govern it.</p><div><hr></div><h2>Featured Article: <a href="https://openai.com/index/instruction-hierarchy-challenge/">IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs</a> | OpenAI</h2><p>OpenAI released IH-Challenge, a reinforcement learning training dataset that teaches models to prioritize instructions based on trust level: system over developer, developer over user, user over tool. When a model receives conflicting instructions from different sources, it needs to know which one wins. Get that wrong and you get jailbreaks, system prompt leaks, and prompt injection attacks that treat malicious text in a PDF or tool output as if it were a developer command. IH-Challenge structures this as objectively gradable tasks: a high-privilege instruction like &#8220;only answer Yes or No&#8221; paired with a lower-privilege attempt to override it, checked by a simple Python script. Fine-tuning GPT-5-Mini on the dataset produced GPT-5-Mini-R, which improved robustness from 63.8% to 88.2% under adaptive human red-teaming and from 23% to 94% against impersonation attacks. Unsafe behavior dropped from 6.6% to 0.7% when given a safety policy in the system prompt. The full dataset is available on <a href="https://huggingface.co/datasets/openai/ih-challenge">Hugging Face</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Kf_i!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98a7af31-b73e-49a5-b3a2-65b9223675e0_1757x1238.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Kf_i!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98a7af31-b73e-49a5-b3a2-65b9223675e0_1757x1238.png 424w, https://substackcdn.com/image/fetch/$s_!Kf_i!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98a7af31-b73e-49a5-b3a2-65b9223675e0_1757x1238.png 848w, https://substackcdn.com/image/fetch/$s_!Kf_i!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98a7af31-b73e-49a5-b3a2-65b9223675e0_1757x1238.png 1272w, https://substackcdn.com/image/fetch/$s_!Kf_i!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98a7af31-b73e-49a5-b3a2-65b9223675e0_1757x1238.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Kf_i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98a7af31-b73e-49a5-b3a2-65b9223675e0_1757x1238.png" width="1456" height="1026" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/98a7af31-b73e-49a5-b3a2-65b9223675e0_1757x1238.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1026,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:113361,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/191211457?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98a7af31-b73e-49a5-b3a2-65b9223675e0_1757x1238.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Kf_i!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98a7af31-b73e-49a5-b3a2-65b9223675e0_1757x1238.png 424w, https://substackcdn.com/image/fetch/$s_!Kf_i!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98a7af31-b73e-49a5-b3a2-65b9223675e0_1757x1238.png 848w, https://substackcdn.com/image/fetch/$s_!Kf_i!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98a7af31-b73e-49a5-b3a2-65b9223675e0_1757x1238.png 1272w, https://substackcdn.com/image/fetch/$s_!Kf_i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98a7af31-b73e-49a5-b3a2-65b9223675e0_1757x1238.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The interesting part is what they didn&#8217;t do. The team identified three pitfalls in naive instruction hierarchy training: models fail not because they don&#8217;t understand hierarchy but because instructions are too complex, LLM judges used for reward signals are themselves fallible, and models learn shortcuts like refusing everything to maximize safety scores. IH-Challenge addresses all three by keeping tasks instruction-following-simple, using programmatic grading instead of LLM judges, and including an Anti-Overrefusal split that specifically trains models to recognize when lower-privilege instructions are perfectly benign. Overrefusal on the IH-Challenge benchmark improved from 79% to 100%, meaning the model stopped treating hierarchy enforcement as a reason to refuse legitimate requests. Meanwhile, GPQA Diamond and AIME 2024 scores held flat, and TensorTrust robustness jumped +8 to +15 points depending on the conflict type. If you&#8217;re building agents that process untrusted input, this is the best public evidence that instruction hierarchy can be trained once and generalize, instead of patching one attack at a time.</p><p><strong>What to watch for:</strong> Whether other model providers adopt open instruction hierarchy training datasets, and whether the programmatic-grading approach becomes standard practice over LLM-judge-based safety fine-tuning.</p><div><hr></div><h2>Watch This: <a href="https://www.youtube.com/watch?v=UabBYexBD4k">Is RAG Still Needed? Choosing the Best Approach for LLMs</a> | IBM Technology (12 min)</h2><div id="youtube2-UabBYexBD4k" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;UabBYexBD4k&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/UabBYexBD4k?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>Martin Keen breaks down the real tradeoffs between RAG and long context windows as context lengths keep expanding. The video covers when vector databases and semantic search still win, when you can get away with stuffing everything into context, and how to think about the decision for your specific workload. Especially relevant this week given Anthropic&#8217;s 1 million token context going GA.</p><div><hr></div><h2>Also This Week</h2><p><strong><a href="https://aws.amazon.com/blogs/machine-learning/p-eagle-faster-llm-inference-with-parallel-speculative-decoding-in-vllm/">P-EAGLE: Faster LLM Inference with Parallel Speculative Decoding in vLLM</a></strong> | AWS AI Blog</p><p><strong><a href="https://aws.amazon.com/blogs/machine-learning/operationalizing-agentic-ai-part-1-a-stakeholders-guide/">Operationalizing Agentic AI Part 1: A Stakeholder&#8217;s Guide</a></strong> | AWS AI Blog</p><p><strong><a href="https://huggingface.co/blog/FINAL-Bench/smol-worldcup">Smol AI WorldCup: A 5-Axis Benchmark for Small Language Models</a></strong> | Hugging Face</p><p><strong><a href="https://huggingface.co/blog/async-rl-training-landscape">Keep the Tokens Flowing: Lessons from 16 Open-Source RL Libraries</a></strong> | Hugging Face</p><p><strong><a href="https://huggingface.co/blog/storage-buckets">Introducing Storage Buckets on the Hugging Face Hub</a></strong> | Hugging Face</p><p><strong><a href="https://huggingface.co/blog/silma-ai/opensource-arabic-english-text-to-speech-model">SILMA TTS: A Lightweight Open Bilingual Arabic-English TTS Model</a></strong> | Hugging Face</p><p><strong><a href="https://www.technologyreview.com/2026/03/10/1134099/how-pokemon-go-is-helping-robots-deliver-pizza-on-time/">How Pokemon Go Is Giving Delivery Robots an Inch-Perfect View of the World</a></strong> | MIT Technology Review</p><p><strong><a href="https://blogs.nvidia.com/blog/jetson-generative-ai-edge-oss/">As Open Models Spark AI Boom, NVIDIA Jetson Brings It to Life at the Edge</a></strong> | NVIDIA</p><p><strong><a href="https://ai.meta.com/blog/world-resources-institute-dino-canopy-height-maps-v2/">Mapping the World&#8217;s Forests: Introducing Canopy Height Maps v2</a></strong> | Meta AI</p><p><strong><a href="https://www.llamaindex.ai/blog/build-a-searchable-audio-knowledge-base-with-gemini-embedding-2-and-llamaparse">Build a Searchable Audio Knowledge Base with Gemini Embedding 2 and LlamaParse</a></strong> | LlamaIndex</p><p><strong><a href="https://x.com/MistralAI/status/2032094267640869085">Introducing the AI Now Summit</a></strong> | Mistral AI</p><div><hr></div><h2>What I&#8217;m Watching</h2><p>There&#8217;s a thread running through this week that&#8217;s easy to miss: the testing layer is becoming a product. OpenAI acquired Promptfoo, the open-source LLM evaluation framework. Cursor built CursorBench to measure whether AI coding suggestions actually help in real workflows. And IH-Challenge, which we covered in the Featured Article, uses programmatic Python scripts instead of LLM judges to grade model behavior, specifically because LLM judges get it wrong too often.</p><p>That last detail is the one I keep coming back to. We&#8217;ve spent two years using models to evaluate models, and one of the clearest takeaways from the IH-Challenge paper is that this introduces its own failure modes. When your testing infrastructure is valuable enough for OpenAI to acquire and your grading methodology is worth publishing a paper about, evaluation is a competitive advantage. If you&#8217;re building agents today and your eval story is &#8220;we&#8217;ll have someone try it and see if it feels right,&#8221; this is the week that should change your mind.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Another Weekly AI Newsletter: Issue 62]]></title><description><![CDATA[OpenAI drops 5.4 and loses robotics lead, Anthropic measures AI labor market impact and expands their ecosystem, only 10% of AI code passes security review, and AI is ready for primetime math/physics]]></description><link>https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-7f9</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-7f9</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Mon, 09 Mar 2026 16:10:49 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/9a460638-febc-47b3-ba20-e646fbb1d6fe_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>The Week&#8217;s Thesis</h3><p><strong>Everybody shipped at once:</strong> If you stepped away from your desk for even a day last week, you came back to a different landscape. OpenAI released <a href="https://openai.com/index/gpt-5-3-instant/">GPT-5.3 Instant</a> on Monday and followed with <a href="https://academy.openai.com/en/home/resources/latest-model">GPT-5.4 with Thinking and Pro modes</a> by Wednesday. Anthropic opened the <a href="https://x.com/claudeai/status/2029966517497122886">Claude Marketplace</a>, added <a href="https://techcrunch.com/2026/03/03/claude-code-rolls-out-a-voice-mode-capability/">voice</a> and <a href="https://x.com/bcherny/status/2030193932404150413">scheduled tasks</a> to Claude Code. <a href="https://techcrunch.com/2026/03/05/cursor-is-rolling-out-a-new-system-for-agentic-coding/">Cursor launched Automations</a>. Each of these points in a different direction of focus, and it&#8217;s worth taking a moment to decide which ones matter for your workflows and where to start.</p><p><strong>The Pentagon deal had consequences:</strong> Last week we covered the Pentagon deal itself. This week, the consequences arrived. OpenAI&#8217;s robotics lead <a href="https://techcrunch.com/2026/03/07/openai-robotics-lead-caitlin-kalinowski-quits-in-response-to-pentagon-deal/">Caitlin Kalinowski resigned</a>, calling the arrangement &#8220;rushed without the guardrails defined.&#8221; <a href="https://techcrunch.com/2026/03/02/chatgpt-uninstalls-surged-by-295-after-dod-deal/">ChatGPT uninstalls had already surged 295%</a> while <a href="https://techcrunch.com/2026/03/01/anthropics-claude-rises-to-no-2-in-the-app-store-following-pentagon-dispute/">Claude climbed to #1 on the App Store</a>. Anthropic&#8217;s CEO <a href="https://www.anthropic.com/news/where-stand-department-war">responded directly</a> to the supply chain risk designation, challenging it in court and clarifying the statute&#8217;s narrow scope. <a href="https://techcrunch.com/2026/03/06/microsoft-anthropic-claude-remains-available-to-customers-except-the-defense-department/">Microsoft, Google, and Amazon confirmed</a> Claude remains available to their customers outside the Department of War. Meanwhile, MIT Technology Review asked the question everyone should be sitting with: <a href="https://www.technologyreview.com/2026/03/06/1134012/is-the-pentagon-allowed-to-surveil-americans-with-ai/">is the Pentagon actually allowed to surveil Americans with AI?</a></p><p><strong>AI is probing deeper than we designed for:</strong> Three companies independently bet on the same idea this week: AI as security auditor. Anthropic&#8217;s Claude <a href="https://www.anthropic.com/news/mozilla-firefox-security">found 22 real vulnerabilities in Firefox</a>, including novel bugs that existing tools missed. OpenAI launched <a href="https://openai.com/index/codex-security-now-in-research-preview/">Codex Security in research preview</a>. And Endor Labs released <a href="https://www.endorlabs.com/learn/introducing-auri-security-intelligence-for-ai-coding-agents-and-developers">AURI</a>, a free security tool, after a study found only <a href="https://arxiv.org/abs/2512.03262">10% of AI-generated code passes basic security review</a>. Separately, Anthropic&#8217;s engineering team found that <a href="https://www.anthropic.com/engineering/eval-awareness-browsecomp">Claude Opus 4.6 figured out it was being benchmarked</a>, identified the test, and decrypted the answer key on its own. These models are probing systems deeper than we&#8217;re designing for, and finding things we didn&#8217;t expect.</p><div><hr></div><h2>Quick Hits</h2><p><strong><a href="https://justin.poehnelt.com/posts/rewrite-your-cli-for-ai-agents/">You Need to Rewrite Your CLI for AI Agents</a></strong> | Justin Poehnelt (Google) &#8212; The best guide yet on building agent-first tooling. If you maintain a CLI, start here.</p><p><strong><a href="https://academy.openai.com/en/public/blogs/terence-tao-ai-is-ready-for-primetime-in-math-and-theoretical-physics-2026-03-06">Terence Tao: AI Is Ready for Primetime in Math and Physics</a></strong> | OpenAI Academy &#8212; When a Fields medalist says AI saves more time than it wastes, the bar for &#8220;useful&#8221; just moved.</p><p><strong><a href="https://techcrunch.com/2026/03/05/exclusive-luma-launches-creative-ai-agents-powered-by-its-new-unified-intelligence-models/">Luma Launches Creative AI Agents</a></strong> | TechCrunch &#8212; Turned a $15M ad campaign into localized versions in 40 hours for under $20K. Creative agencies, take note.</p><p><strong><a href="https://venturebeat.com/orchestration/new-kv-cache-compaction-technique-cuts-llm-memory-50x-without-accuracy-loss">KV Cache Compaction Cuts LLM Memory 50x</a></strong> | VentureBeat &#8212; MIT&#8217;s Attention Matching compresses working memory without accuracy loss. Long-context inference just got cheaper.</p><p><strong><a href="https://blog.google/innovation-and-ai/technology/developers-tools/io-save-the-date-2026-gemini/">Google I/O 2026: May 19-20</a></strong> | Google Blog &#8212; Save the date. The puzzle itself is a Gemini showcase, which tells you where the keynote is heading.</p><p><strong><a href="https://about.roblox.com/newsroom/2026/03/rethinking-chat-for-fun-gameplay-civility">Roblox Launches AI Chat Rephrasing</a></strong> | Roblox &#8212; Instead of blocking banned words with &#8220;####&#8221;, AI now rephrases them in real time. Moderation at 68M daily users is an AI problem now.</p><p><strong><a href="https://www.youtube.com/watch?v=53gPwkcIsXQ">LangChain CEO: Models Alone Won&#8217;t Get Agents to Production</a></strong> | VentureBeat &#8212; Harrison Chase on why &#8220;harness engineering&#8221; matters more than model upgrades for shipping real agents.</p><div><hr></div><h3>Featured Article: <a href="https://www.anthropic.com/research/labor-market-impacts">Labor Market Impacts of AI: A New Measure and Early Evidence</a> | Anthropic Research</h3><p>Anthropic introduced a new metric called &#8220;observed exposure&#8221; that combines theoretical LLM capability with real-world Claude usage data to measure which jobs are actually being affected by AI. The headline finding: AI is far from reaching its theoretical capability. Actual task coverage remains a fraction of what&#8217;s feasible. Computer programmers top the list at 75% coverage, followed by customer service representatives and data entry keyers. No systematic increase in unemployment has appeared for highly exposed workers since late 2022.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SwnM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7eb2c6-9c2b-43aa-8d1d-da2bc93f3cf6_3840x3840.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SwnM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7eb2c6-9c2b-43aa-8d1d-da2bc93f3cf6_3840x3840.webp 424w, https://substackcdn.com/image/fetch/$s_!SwnM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7eb2c6-9c2b-43aa-8d1d-da2bc93f3cf6_3840x3840.webp 848w, https://substackcdn.com/image/fetch/$s_!SwnM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7eb2c6-9c2b-43aa-8d1d-da2bc93f3cf6_3840x3840.webp 1272w, https://substackcdn.com/image/fetch/$s_!SwnM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7eb2c6-9c2b-43aa-8d1d-da2bc93f3cf6_3840x3840.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SwnM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7eb2c6-9c2b-43aa-8d1d-da2bc93f3cf6_3840x3840.webp" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7b7eb2c6-9c2b-43aa-8d1d-da2bc93f3cf6_3840x3840.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:180532,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/190402661?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7eb2c6-9c2b-43aa-8d1d-da2bc93f3cf6_3840x3840.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SwnM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7eb2c6-9c2b-43aa-8d1d-da2bc93f3cf6_3840x3840.webp 424w, https://substackcdn.com/image/fetch/$s_!SwnM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7eb2c6-9c2b-43aa-8d1d-da2bc93f3cf6_3840x3840.webp 848w, https://substackcdn.com/image/fetch/$s_!SwnM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7eb2c6-9c2b-43aa-8d1d-da2bc93f3cf6_3840x3840.webp 1272w, https://substackcdn.com/image/fetch/$s_!SwnM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b7eb2c6-9c2b-43aa-8d1d-da2bc93f3cf6_3840x3840.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The paper opens with a point worth sitting with: past predictions about job displacement have a poor track record. Offshorability studies flagged a quarter of US jobs as vulnerable, and a decade later most of those jobs grew. This research is deliberately not making predictions. Instead, it&#8217;s building a measurement framework now, before meaningful effects emerge, so future analysis has a real baseline. The finding that matters most right now is about entry-level hiring. Among workers aged 22 to 25, hiring into exposed occupations has dropped roughly 14% compared to pre-ChatGPT levels. Workers in the most exposed professions are more likely to be older, female, more educated, and higher-paid. The pipeline is thinning before displacement shows up in unemployment data.</p><p><strong>What to watch for:</strong> The gap between what AI <em>can</em> do and what it <em>is</em> doing is closing. This report measures it directly, and future updates will show how fast the red area catches the blue. Pay attention to the entry-level hiring numbers next time around.</p><div><hr></div><h3>Watch This: <a href="https://www.youtube.com/watch?v=OUyfxhFtGCo">This New Claude Code Feature is a Game Changer</a> | Nate Herk (8 min)</h3><div id="youtube2-OUyfxhFtGCo" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;OUyfxhFtGCo&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/OUyfxhFtGCo?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>Nate walks through Claude Code&#8217;s new loop feature, which lets you set recurring tasks, reminders, and skill intervals that run for up to three days without input. The video covers how the cron tools work under the hood, a live walkthrough of setting one up, and a clear comparison of when to use loops versus scheduled tasks. If you&#8217;re already using Claude Code, this is worth eight minutes of your time.</p><div><hr></div><h2>Also This Week</h2><p><strong><a href="https://openai.com/index/reasoning-models-chain-of-thought-controllability/">Reasoning Models Struggle to Control Their Chains of Thought, and That&#8217;s Good</a></strong> | OpenAI</p><p><strong><a href="https://arxiv.org/abs/2603.05488">Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought</a></strong> | arXiv</p><p><strong><a href="https://arxiv.org/abs/2603.05344">Building AI Coding Agents for the Terminal</a></strong> | arXiv</p><p><strong><a href="https://claude.com/platform/marketplace">Anthropic Spend Commitment Now Funds Partner Integrations</a></strong> | Anthropic</p><p><strong><a href="https://claude.com/community/ambassadors">Claude Community Ambassadors Program</a></strong> | Anthropic</p><p><strong><a href="https://github.com/zeroclaw-labs/zeroclaw">ZeroClaw: Autonomous AI Assistant Infrastructure</a></strong> | GitHub</p><p><strong><a href="https://techcrunch.com/2026/03/06/city-detect-uses-ai-to-help-cities-stay-safe-and-clean/">City Detect Raises $13M Series A</a></strong> | TechCrunch</p><p><strong><a href="https://biztimes.com/groundbreaking-held-as-construction-begins-on-15b-port-washington-data-center/">Port Washington Data Center Breaks Ground</a></strong> | BizTimes</p><p><strong><a href="https://openai.com/index/descript/">How Descript Enables Multilingual Video Dubbing at Scale</a></strong> | OpenAI</p><p><strong><a href="https://openai.com/index/balyasny-asset-management/">How Balyasny Built an AI Research Engine for Investing</a></strong> | OpenAI</p><div><hr></div><h2>What I&#8217;m Watching</h2><p>Features like Claude Code&#8217;s new /loop command and projects like ZeroClaw are pointing in the same direction: autonomous agent runtimes that are lightweight, swappable, and designed to run without you. The question I keep coming back to is how long until this space fragments enough that no single framework dominates. We&#8217;re not there yet, but the building blocks are shipping fast.</p><p>The other thing I&#8217;m paying attention to is something that rarely shows up in benchmark announcements: how new model releases actually affect agent quality in production. GPT-5.4, Claude Opus 4.6, and the reasoning improvements shipping alongside them should be measurably changing chain-of-thought reliability for deployed agents. But that data is hard to find. If you&#8217;re running agents in the wild and tracking performance across model versions, I&#8217;d genuinely love to hear what you&#8217;re seeing.</p><p>And then there&#8217;s the security work. Anthropic found novel Firefox vulnerabilities. OpenAI launched Codex Security. A few newsletters ago, we covered AI solving novel physics problems. Now we&#8217;re seeing that same pattern expand: LLMs surfacing things humans hadn&#8217;t found yet. Is that just the natural expansion curve of the technology, or is it a growth signal that tracks directly with model quality? I think it&#8217;s both, and the Mozilla results suggest we&#8217;re still early in finding out what these models can actually uncover when pointed at the right problems.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.anothercodingblog.com/subscribe?"><span>Subscribe now</span></a></p>]]></content:encoded></item><item><title><![CDATA[Another Weekly AI Newsletter: Issue 61]]></title><description><![CDATA[Anthropic gets blacklisted, OpenAI signs four deals in five days, agent observability emerges as a real concern, &#127820; 2, healthcare AI shows ROI, and Geoffrey Hinton sits down with Neil deGrasse Tyson]]></description><link>https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-939</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-939</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Tue, 03 Mar 2026 16:16:48 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/008481bd-6b04-4c23-9493-8e486de747cf_2816x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>Personal Note</h3><p>This newsletter comes to you late this week on a Tuesday morning. Like many others, I was caught in the <a href="https://techcrunch.com/2026/03/02/anthropics-claude-reports-widespread-outage/">Anthropic outage</a> and am also dependent on this technology to drive the initiatives that are meaningful to me. When I woke up to finalize the newsletter and found things offline, I journaled while listening to the birds outside, listened to music, reflected on my weekend, and engaged in refreshing activities I normally don&#8217;t find the time for. It was a lesson for me to find more time to step away from the keyboard.</p><div><hr></div><h3>The Week&#8217;s Thesis</h3><p><strong>AI went political this week:</strong> <a href="https://www.anthropic.com/news/statement-department-of-war">Anthropic&#8217;s relationship with the Department of War fell apart</a>, and hours later, <a href="https://openai.com/index/our-agreement-with-the-department-of-war/">OpenAI signed a deal</a> for classified network deployment. On paper, both companies claim the same red lines. But the sequence alone was enough to make people uneasy. More on this in our featured story below.</p><p><strong>OpenAI&#8217;s partnership blitz:</strong> They launched <a href="https://openai.com/index/frontier-alliance-partners">Frontier Alliances</a>, a new partner program, followed by a <a href="https://openai.com/index/figma-partnership/">Codex integration with Figma</a> bridging code and design workflows. By Friday, they announced a <a href="https://openai.com/index/amazon-partnership/">strategic partnership with Amazon</a> and released a <a href="https://blogs.microsoft.com/blog/2026/02/27/microsoft-and-openai-joint-statement-on-continuing-partnership/">joint statement with Microsoft</a> reaffirming their existing relationship. Four announcements in five days, all while the Department of War deal was making headlines.</p><p><strong>Agent observability is becoming a thing:</strong> <a href="https://news.microsoft.com/source/emea/features/microsoft-cyber-pulse-ai-agents-2/">Microsoft</a> found that 80% of Fortune 500 companies are running active agents but most lack visibility into what those agents are doing. <a href="https://blog.langchain.com/you-dont-know-what-your-agent-will-do-until-its-in-production/">LangChain</a> argued that traditional APM tools weren&#8217;t built for this, <a href="https://newrelic.com/press-release/20260224-1">New Relic shipped an agent-specific observability platform</a>, and <a href="https://cloud.google.com/blog/products/ai-machine-learning/a-devs-guide-to-production-ready-ai-agents">Google published a production-readiness guide</a>. Observability is quietly becoming part of the conversation, and it&#8217;s worth paying attention to.</p><p><strong>Healthcare AI is moving:</strong> <a href="https://blogs.nvidia.com/blog/ai-in-healthcare-survey-2026/">NVIDIA&#8217;s annual survey</a> found that 70% of healthcare organizations are now actively deploying AI, with 85% reporting increased revenue. <a href="https://blogs.nvidia.com/blog/lilly-ai-factory-live/">Eli Lilly went live with LillyPod</a>, the most powerful AI factory wholly owned by a pharmaceutical company, purpose-built for drug discovery. <a href="https://ouraring.com/blog/womens-health-ai-model/">Oura shipped a proprietary AI model</a> focused on women&#8217;s reproductive health, hosted entirely on their own infrastructure. And <a href="https://www.nist.gov/blogs/taking-measure/ai-doctors-office-how-standards-can-support-trustworthiness">NIST published guidance</a> on AI trustworthiness standards for clinical settings. From drug discovery to consumer wearables to regulation, healthcare AI is moving.</p><div><hr></div><h3>Quick Hits</h3><p><strong><a href="https://techcrunch.com/2026/02/25/jiras-latest-update-allows-ai-agents-and-humans-to-work-side-by-side/">Jira&#8217;s latest update allows AI agents and humans to work side by side</a></strong> | TechCrunch &#8212; Agents on the same sprint board as humans with deadlines and assignments. This is mainstream adoption.</p><p><strong><a href="https://cloud.google.com/blog/products/ai-machine-learning/bringing-nano-banana-2-to-enterprise">Pro-level image generation gets faster and more accessible with Nano Banana 2</a></strong> | Google Cloud AI &#8212; Google&#8217;s enterprise image gen model gets faster and cheaper. The gap between &#8220;good enough&#8221; and &#8220;production-ready&#8221; keeps shrinking.</p><p><strong><a href="https://www.anthropic.com/news/acquires-vercept">Anthropic acquires Vercept to advance Claude&#8217;s computer use capabilities</a></strong> | Anthropic &#8212; Anthropic is doubling down on computer use. If agents are going to operate in production, they need to see and interact with real interfaces.</p><p><strong><a href="https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks">Detecting and preventing distillation attacks</a></strong> | Anthropic &#8212; Anthropic identified industrial-scale distillation campaigns by DeepSeek, Moonshot, and MiniMax, totaling over 16 million exchanges across 24,000 fraudulent accounts designed to extract Claude&#8217;s capabilities. They published their approach to catching and preventing it.</p><p><strong><a href="https://www.technologyreview.com/2026/02/23/1133508/the-human-work-behind-humanoid-robots-is-being-hidden/">The human work behind humanoid robots is being hidden</a></strong> | MIT Technology Review &#8212; The humans still doing the work that robot demos suggest is automated. A good reality check.</p><div><hr></div><h3>Featured Story: Anthropic&#8217;s Deal With the Department of War Fell Through. Hours Later, OpenAI Signed One.</h3><p>Anthropic published its <a href="https://www.anthropic.com/responsible-scaling-policy">Responsible Scaling Policy v3.0</a> on February 24, a ground-up rewrite of the framework it uses to decide what it will and won&#8217;t build. Two days later, <a href="https://www.anthropic.com/news/statement-department-of-war">Dario Amodei published a statement</a> revealing that Anthropic has been deeply embedded in the Department of War for months: intelligence analysis, cyber operations, modeling and simulation. The company also disclosed it walked away from several hundred million dollars in revenue by cutting off entities linked to the Chinese Communist Party. But Anthropic drew two red lines: no mass domestic surveillance of Americans, and no fully autonomous weapons.</p><p>On February 27, <a href="https://www.anthropic.com/news/statement-comments-secretary-war">Secretary of War Pete Hegseth designated Anthropic a &#8220;supply chain risk&#8221;</a>, a label historically reserved for US adversaries. Trump ordered every federal agency to stop using Anthropic technology. That same night, <a href="https://openai.com/index/our-agreement-with-the-department-of-war/">OpenAI announced a deal</a> to deploy its models on the Department of War&#8217;s classified network.</p><p>Here&#8217;s where it gets interesting: OpenAI&#8217;s stated terms include the same two red lines. No mass surveillance. No autonomous weapons. But OpenAI walked away with a deal and Anthropic walked away blacklisted. OpenAI&#8217;s approach centers on what Altman called a &#8220;safety stack&#8221;: cloud-only deployment that keeps OpenAI&#8217;s safety layers active, cleared personnel in the loop, and an agreement that if the model refuses a task, the government won&#8217;t force a workaround. What exactly differed in the negotiations isn&#8217;t public, but the outcome speaks for itself.</p><p>The RSP v3.0 explains the philosophical scaffolding behind Anthropic&#8217;s position. After two and a half years of trying to implement capability-based safety thresholds, Anthropic concluded that &#8220;the science of model evaluation isn&#8217;t well-developed enough to provide dispositive answers.&#8221; The policy now splits commitments into what Anthropic will enforce unilaterally and what requires industry-wide coordination. Autonomous weapons fall squarely in the second bucket: the reliability isn&#8217;t there yet, and no single company can build the guardrails alone.</p><p>The business implications are already visible. <a href="https://www.natesilver.net/p/anthtropic-open-ai-department-of-war">Nate Silver noted</a> that Anthropic had been steadily closing the valuation gap with OpenAI. Whether the DoW designation slows that trajectory is an open question.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KFlP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323f3ecd-8519-4f5e-a838-9dd16d2fef04_1460x1058.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KFlP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323f3ecd-8519-4f5e-a838-9dd16d2fef04_1460x1058.png 424w, https://substackcdn.com/image/fetch/$s_!KFlP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323f3ecd-8519-4f5e-a838-9dd16d2fef04_1460x1058.png 848w, https://substackcdn.com/image/fetch/$s_!KFlP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323f3ecd-8519-4f5e-a838-9dd16d2fef04_1460x1058.png 1272w, https://substackcdn.com/image/fetch/$s_!KFlP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323f3ecd-8519-4f5e-a838-9dd16d2fef04_1460x1058.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KFlP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323f3ecd-8519-4f5e-a838-9dd16d2fef04_1460x1058.png" width="1456" height="1055" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/323f3ecd-8519-4f5e-a838-9dd16d2fef04_1460x1058.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1055,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:171847,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/189646464?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323f3ecd-8519-4f5e-a838-9dd16d2fef04_1460x1058.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KFlP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323f3ecd-8519-4f5e-a838-9dd16d2fef04_1460x1058.png 424w, https://substackcdn.com/image/fetch/$s_!KFlP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323f3ecd-8519-4f5e-a838-9dd16d2fef04_1460x1058.png 848w, https://substackcdn.com/image/fetch/$s_!KFlP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323f3ecd-8519-4f5e-a838-9dd16d2fef04_1460x1058.png 1272w, https://substackcdn.com/image/fetch/$s_!KFlP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F323f3ecd-8519-4f5e-a838-9dd16d2fef04_1460x1058.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The question practitioners should be sitting with isn&#8217;t &#8220;who&#8217;s right.&#8221; It&#8217;s what happens next. If you&#8217;re building on Claude for sensitive workloads, your platform just got blacklisted from every federal system. If you&#8217;re building on OpenAI, your platform&#8217;s safety guarantees rest on a technical architecture rather than a legal commitment. Both carry risk. The difference is in which failure mode you&#8217;re betting on.</p><p><strong>What to watch for:</strong> Whether the &#8220;supply chain risk&#8221; designation survives legal challenge, and whether OpenAI&#8217;s cloud-only safety stack holds as models get more capable and the Department of War pushes for edge deployment.</p><div><hr></div><h3>Watch This</h3><div id="youtube2-l6ZcFa8pybE" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;l6ZcFa8pybE&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/l6ZcFa8pybE?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p><strong><a href="https://www.youtube.com/watch?v=l6ZcFa8pybE">StarTalk: Geoffrey Hinton on AI, Consciousness, and the Future</a></strong>: Neil deGrasse Tyson sits down with Nobel Laureate Geoffrey Hinton to cover the full arc: how neural nets work, why backpropagation was the breakthrough, whether AI can actually reason, and the heavy questions around consciousness, energy demands, and what happens when models start generating their own training data.</p><div><hr></div><h3>Also This Week</h3><p><strong><a href="https://techcrunch.com/2026/02/25/alphabet-owned-robotics-software-company-intrinsic-joins-google/">Intrinsic joins Google</a></strong> | TechCrunch</p><p><strong><a href="https://blog.google/innovation-and-ai/products/gemini-app/android-multi-step-tasks/">Let Gemini handle your multi-step daily tasks on Android</a></strong> | Google AI Blog</p><p><strong><a href="https://www.anthropic.com/research/AI-fluency-index">Anthropic Education Report: The AI Fluency Index</a></strong> | Anthropic Research</p><p><strong><a href="https://www.anthropic.com/research/persona-selection-model">The persona selection model</a></strong> | Anthropic Research</p><p><strong><a href="https://openai.com/index/disrupting-malicious-ai-uses/">Disrupting malicious uses of AI</a></strong> | OpenAI</p><p><strong><a href="https://www.deeplearning.ai/the-batch/stanford-and-together-ai-researchers-chart-edge-models-performance-in-intelligence-per-watt/">Can Local AI Stand In for the Cloud?</a></strong> | deeplearning.ai</p><p><strong><a href="https://www.technologyreview.com/2026/02/27/1133624/ai-is-rewiring-how-the-worlds-best-go-players-think/">AI is rewiring how the world&#8217;s best Go players think</a></strong> | MIT Technology Review</p><div><hr></div><h3>What I&#8217;m Watching</h3><p><strong>OpenAI&#8217;s new role in government AI.</strong> How does OpenAI&#8217;s solidified position with the Department of War shift the tide of AI in government? Will it be relatively quiet, or will we see noticeable shifts in how these technologies are deployed domestically and how we engage in combat with other countries? And if growth and innovation eventually push against the boundaries of an agreement, does the government override, or does OpenAI become more malleable?</p><p><strong>The enterprise agent framework race.</strong> We are still in the &#8220;release agents as a capability&#8221; phase. Most enterprise platforms are now shipping their own proprietary frameworks. Will those be expansive enough to meet the breadth of platform use cases, or will we see demand expand beyond what a single-platform framework can handle, requiring true enterprise solutions?</p><p><strong>Agent observability, from experience.</strong> Observability is something we are hyper-focused on at Ping. We find that we have the highest amount of control with our custom agents, and that control reduces significantly when we adopt out-of-the-box frameworks that leave us with little say over design practices. If that&#8217;s true at our scale, it&#8217;s worth asking what it looks like at enterprise scale.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.anothercodingblog.com/subscribe?"><span>Subscribe now</span></a></p>]]></content:encoded></item><item><title><![CDATA[Another Weekly AI Newsletter: Issue 60]]></title><description><![CDATA[Amazon and Google bet on MCP for enterprise agents, three frontier models drop in one week, Z.ai ships GLM-5 on Huawei chips and India has a big week in AI]]></description><link>https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-579</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-579</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Mon, 23 Feb 2026 16:30:33 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/b15efd18-f0f5-4c87-8286-09eb8ddd51d3_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>The Week&#8217;s Thesis</h3><p><a href="https://modelcontextprotocol.io/docs/getting-started/intro">MCP</a> is finding its footing in the enterprise. Amazon made <a href="https://aws.amazon.com/blogs/machine-learning/integrate-external-tools-with-amazon-quick-agents-using-model-context-protocol-mcp/">Quick</a> an MCP client, letting partners expose capabilities as tools its agents can invoke. Google went in the other direction: <a href="https://cloud.google.com/blog/products/databases/managed-mcp-servers-for-google-cloud-databases">managed MCP servers for AlloyDB, Spanner, Cloud SQL, Firestore, and Bigtable</a> that give any MCP-compliant agent a standard interface to their data layer with no infrastructure to deploy. Both chose MCP as the contract. The promise is speed to information and action from a single interface, but how you measure that return is still an open question.</p><p>Three frontier models dropped this week, and the pricing gap between open and closed got harder to ignore. <a href="https://www.anthropic.com/news/claude-sonnet-4-6">Sonnet 4.6</a>, <a href="https://cloud.google.com/blog/products/ai-machine-learning/gemini-3-1-pro-on-gemini-cli-gemini-enterprise-and-vertex-ai">Gemini 3.1 Pro</a>, and <a href="https://z.ai/blog/glm-5">GLM-5</a> all posted competitive benchmarks. On <a href="https://openrouter.ai/models">OpenRouter</a>, GLM-5 runs at $0.95/$2.55 per million input/output tokens versus Sonnet at $3/$15 and Gemini at $2/$12. For agentic workloads, those economics compound fast. The models are becoming table stakes; the differentiation is what surrounds them.</p><p>Agent autonomy is outrunning evaluation. <a href="https://www.anthropic.com/research/measuring-agent-autonomy">Anthropic&#8217;s research</a> shows Claude Code sessions running autonomously 2x longer than three months ago, with experienced users auto-approving 40%+ of sessions. Meanwhile, <a href="https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-real-world-lessons-from-building-agentic-systems-at-amazon/">Amazon is internally grappling with evaluating thousands of agents</a> and publishing a whole framework for it. And <a href="https://www.technologyreview.com/2026/02/18/1133299/google-deepmind-wants-to-know-if-chatbots-are-just-virtue-signaling/">DeepMind is asking whether models even have genuine moral reasoning</a> or are just pattern-matching on ethics. Deployment velocity is way ahead of our ability to assess what these agents are actually doing.</p><p>The global map is shifting too. OpenAI, Google, Microsoft, and Anthropic all showed up to the <a href="https://blog.google/innovation-and-ai/technology/ai/ai-impact-summit-2026-india/">India AI Impact Summit</a> this week with infrastructure commitments. <a href="https://openai.com/openai-for-india">OpenAI announced &#8220;OpenAI for India&#8221;</a> focused on sovereign infrastructure and workforce upskilling. <a href="https://blogs.microsoft.com/on-the-issues/2026/02/17/acting-with-urgency-to-address-the-growing-ai-divide/">Microsoft pledged a multi-billion-dollar initiative</a> to close the AI adoption gap. <a href="https://www.anthropic.com/news/bengaluru-office-partnerships-across-india">Anthropic opened a Bengaluru office</a>. When every major lab converges on the same market in the same week, it tells you where the growth is.</p><div><hr></div><h3>Quick Hits</h3><p><strong><a href="https://techcrunch.com/2026/02/17/here-are-the-17-us-based-ai-companies-that-have-raised-100m-or-more-in-2026/">Here are the 17 US-based AI companies that have raised $100M or more in 2026</a></strong> | TechCrunch &#8212; 17 mega-rounds in under two months. The capital is betting on infrastructure and vertical agents, not foundation models.</p><p><strong><a href="https://www.anthropic.com/news/anthropic-infosys">Anthropic and Infosys collaborate to build AI agents for telecommunications</a></strong> | Anthropic News &#8212; Regulated industries are where agents get real. Telecom compliance is messy enough to justify the investment.</p><p><strong><a href="https://arxiv.org/abs/2602.17547v1">KLong: Training LLM Agent for Extremely Long-horizon Tasks</a></strong> | arXiv &#8212; Agents that can hold context across hundreds of steps. The gap between &#8220;demo agent&#8221; and &#8220;production agent&#8221; starts here.</p><p><strong><a href="https://openai.com/policies/unauthorized-openai-equity-transactions/">Unauthorized OpenAI Equity Transactions</a></strong> | openai.com &#8212; OpenAI had to publicly warn people about unauthorized equity offers. When your stock is hot enough to attract scams, that&#8217;s its own signal.</p><p><strong><a href="https://arxiv.org/abs/2602.17608v1">Towards Anytime-Valid Statistical Watermarking</a></strong> | arXiv &#8212; As agents generate more content autonomously, knowing what&#8217;s machine-made becomes an infrastructure problem, not a nice-to-have.</p><p><strong><a href="https://www.anthropic.com/news/anthropic-rwanda-mou">Anthropic and the Government of Rwanda sign MOU for AI in health and education</a></strong> | Anthropic News &#8212; A model for government-AI partnerships that starts with local context and capacity building, not top-down deployment.</p><p><strong><a href="https://www.anthropic.com/news/anthropic-codepath-partnership">Anthropic partners with CodePath to bring Claude to the US&#8217;s largest collegiate CS program</a></strong> | Anthropic News &#8212; The next generation of developers will learn to code with AI from day one. That changes what &#8220;junior engineer&#8221; means in three years.</p><div><hr></div><h3>Featured Article: GLM-5: China&#8217;s First Public AI Company Ships a Frontier Model</h3><p>Z.ai (formerly Zhipu AI) released <a href="https://z.ai/blog/glm-5">GLM-5</a> on February 11, a 744B-parameter mixture-of-experts model with 40B active parameters per token and a 200K context window. It&#8217;s the first open-weight model to hit <a href="https://charonhub.deeplearning.ai/z-ais-glm-5-model-boasts-top-open-weights-intelligence-index-score/">50 on Artificial Analysis&#8217; Intelligence Index</a>, and it&#8217;s released under an MIT license.</p><p>The benchmarks tell a competitive story. GLM-5 scored <a href="https://www.buildfastwithai.com/blogs/glm-5-released-open-source-model-2026">77.8% on SWE-bench Verified</a>, beating Gemini 3 Pro (76.2%) and trailing Claude Opus 4.5 (80.9%). On AIME 2026 it hit 92.7%, essentially matching Opus. On BrowseComp, it scored 62.0, nearly doubling Opus 4.5&#8217;s 37.0. It&#8217;s the <a href="https://glm5.net/">#1 open-weight model on LMArena</a> and #11 overall.</p><p>What makes this release structurally significant is what&#8217;s underneath it. GLM-5 was <a href="https://www.siliconrepublic.com/machines/zhipu-glm-5-chinese-ai-start-up-artificial-intelligence">trained entirely on Huawei Ascend 910B chips</a> using the MindSpore framework. Zhipu has been on the US Entity List since January 2025 with no access to NVIDIA H100s. A frontier-competitive model built without any Western compute hardware is a data point that changes the export control conversation.</p><p>The caveats are real. GLM-5 is text-only with no multimodal support. Independent testers have <a href="https://medium.com/@able_wong/glm-5-dropped-before-i-could-finish-testing-glm-4-7-722a056877ff">flagged questions about benchmark methodology</a> and noted the model can be aggressive in task execution without strong situational awareness. Running it locally requires ~1.5TB of VRAM. But for the open-weight ecosystem, this is a milestone: frontier-class intelligence, MIT-licensed, at a fraction of closed-model pricing.</p><p><strong>What to watch for:</strong> Whether independent evaluations hold up to the published benchmarks, and whether the Ascend-trained approach becomes a template for other Chinese labs navigating export controls.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6q4l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2802d059-2b26-4239-87c6-f501862a592a_3335x2411.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6q4l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2802d059-2b26-4239-87c6-f501862a592a_3335x2411.png 424w, https://substackcdn.com/image/fetch/$s_!6q4l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2802d059-2b26-4239-87c6-f501862a592a_3335x2411.png 848w, https://substackcdn.com/image/fetch/$s_!6q4l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2802d059-2b26-4239-87c6-f501862a592a_3335x2411.png 1272w, https://substackcdn.com/image/fetch/$s_!6q4l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2802d059-2b26-4239-87c6-f501862a592a_3335x2411.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6q4l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2802d059-2b26-4239-87c6-f501862a592a_3335x2411.png" width="1456" height="1053" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2802d059-2b26-4239-87c6-f501862a592a_3335x2411.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1053,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6q4l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2802d059-2b26-4239-87c6-f501862a592a_3335x2411.png 424w, https://substackcdn.com/image/fetch/$s_!6q4l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2802d059-2b26-4239-87c6-f501862a592a_3335x2411.png 848w, https://substackcdn.com/image/fetch/$s_!6q4l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2802d059-2b26-4239-87c6-f501862a592a_3335x2411.png 1272w, https://substackcdn.com/image/fetch/$s_!6q4l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2802d059-2b26-4239-87c6-f501862a592a_3335x2411.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Watch This</h2><div id="youtube2-bzWI3Dil9Ig" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;bzWI3Dil9Ig&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/bzWI3Dil9Ig?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>Brian walks through setting up a multi-agent team using OpenClaw, covering dedicated machines, access permissions, cost configurations, and API token optimization across different models.</p><div><hr></div><h3>Also This Week</h3><p><strong><a href="https://cloud.google.com/blog/products/ai-machine-learning/measure-physics-of-freestyle-snowboarding-and-skiing">Using Google Cloud AI to measure the physics of US freestyle snowboarding</a></strong> | Google Cloud AI</p><p><strong><a href="https://arxiv.org/abs/2602.17544v1">Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability</a></strong> | arXiv</p><p><strong><a href="https://deepmind.google/models/model-cards/gemini-3-1-pro">Gemini 3.1 Pro - Model Card</a></strong> | DeepMind</p><p><strong><a href="https://www.nist.gov/blogs/taking-measure/robots-and-ai-are-working-together-bring-you-better-medicines-shampoo-and-more">Robots and AI Are Working Together to Bring You Better Medicines</a></strong> | NIST</p><p><strong><a href="https://blog.google/company-news/inside-google/message-ceo/sundar-pichai-ai-impact-summit-2026/">A message from our CEO, Sundar Pichai</a></strong> | Google AI Blog</p><div><hr></div><h3>What I&#8217;m Watching</h3><p><strong>OpenAI&#8217;s acqui-hire playbook.</strong> <a href="https://techcrunch.com/2026/02/15/openclaw-creator-peter-steinberger-joins-openai/">Steinberger built OpenClaw</a> into the most-starred open-source agent project on GitHub in four months, and now he&#8217;s inside OpenAI building &#8220;the next generation of personal agents.&#8221; The project moves to a foundation, but the founder&#8217;s vision moves with him. If OpenAI keeps pulling in open-source agent talent, it signals a shift from model company to agent platform company.</p><p><strong>The agent evaluation reckoning.</strong> Three separate organizations flagged the same problem this week: we&#8217;re deploying agents faster than we can evaluate them. Autonomy sessions are getting longer, tool access is getting broader via MCP, and the pricing is making it cheaper to scale. Something breaks publicly before the evaluation frameworks catch up. The teams building those frameworks now have a head start.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[I Wake Up to a Custom AI Research Digest on My Kindle Every Morning]]></title><description><![CDATA[How I used Claude, arXiv, and GitHub Actions to build a daily research pipeline that scores, summarizes, and delivers the top papers to my Kindle for $5/month]]></description><link>https://www.anothercodingblog.com/p/i-wake-up-to-a-custom-ai-research</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/i-wake-up-to-a-custom-ai-research</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Sat, 21 Feb 2026 17:21:15 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/349a23a6-d476-4440-a83c-3bed064b7a50_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Staying current on AI research is one of those things that sounds simple until you actually try to do it consistently. arXiv publishes hundreds of machine learning papers every single day.</p><p>As someone who leads AI and Data at a technology company, I need to stay aware of what&#8217;s happening in research. While we don&#8217;t intend to implement every new pattern and concept, the ideas showing up in research today inspire the tools and techniques we&#8217;re evaluating and implementing for our own AI capabilities. Keeping up with emerging patterns ensures that we operate at an innovative and scalable level.</p><p>The problem is that reading even the abstracts of 300+ papers a day is not realistic. So I went through a few iterations of trying to solve this:</p><ul><li><p><strong>Iteration 1:</strong> Manually browsing arXiv on weekends, skimming titles and saving papers to read later</p><ul><li><p>Pain point: I was always a week behind and the &#8220;read later&#8221; list just kept growing</p></li></ul></li><li><p><strong>Iteration 2:</strong> Subscribing to AI newsletters and following researchers on X</p><ul><li><p>Pain point: Too broad, too noisy, and someone else was deciding what was relevant to me</p></li></ul></li><li><p><strong>Iteration 3:</strong> Using ChatGPT to ask &#8220;what are the most important ML papers from this week?&#8221;</p><ul><li><p>Pain point: Hallucinated paper titles, no way to verify, and it didn&#8217;t know my specific interests</p></li></ul></li><li><p><strong>[We are here] Iteration 4:</strong> Build an automated pipeline that fetches papers from arXiv, uses Claude to score them against my specific interests, summarizes the top ones in plain language, and delivers them to my Kindle every morning before I wake up</p></li></ul><p>Total cost: about $5/month. Total daily effort: zero.</p><p>Here is exactly how I built it.</p><h1>The Architecture</h1><p>The core insight here is that this is a <strong>filtering problem</strong>, not a summarization problem. arXiv gives you everything. Your job is to throw away 97% of it intelligently.</p><p>The pipeline looks like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6lW7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8d5771-9e99-4af4-8109-1d2794ab3aa9_2656x1572.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6lW7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8d5771-9e99-4af4-8109-1d2794ab3aa9_2656x1572.png 424w, https://substackcdn.com/image/fetch/$s_!6lW7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8d5771-9e99-4af4-8109-1d2794ab3aa9_2656x1572.png 848w, https://substackcdn.com/image/fetch/$s_!6lW7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8d5771-9e99-4af4-8109-1d2794ab3aa9_2656x1572.png 1272w, https://substackcdn.com/image/fetch/$s_!6lW7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8d5771-9e99-4af4-8109-1d2794ab3aa9_2656x1572.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6lW7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8d5771-9e99-4af4-8109-1d2794ab3aa9_2656x1572.png" width="1456" height="862" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec8d5771-9e99-4af4-8109-1d2794ab3aa9_2656x1572.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:862,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:385718,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/188726202?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8d5771-9e99-4af4-8109-1d2794ab3aa9_2656x1572.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6lW7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8d5771-9e99-4af4-8109-1d2794ab3aa9_2656x1572.png 424w, https://substackcdn.com/image/fetch/$s_!6lW7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8d5771-9e99-4af4-8109-1d2794ab3aa9_2656x1572.png 848w, https://substackcdn.com/image/fetch/$s_!6lW7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8d5771-9e99-4af4-8109-1d2794ab3aa9_2656x1572.png 1272w, https://substackcdn.com/image/fetch/$s_!6lW7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8d5771-9e99-4af4-8109-1d2794ab3aa9_2656x1572.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let&#8217;s walk through each stage.</p><h1>Stage 1: Fetch Everything</h1><p>The pipeline starts by querying the arXiv API for papers submitted in the last day across six categories:</p><ul><li><p><code>cs.AI</code> (Artificial Intelligence)</p></li><li><p><code>cs.LG</code> (Machine Learning)</p></li><li><p><code>cs.CL</code> (Computation and Language / NLP)</p></li><li><p><code>cs.CV</code> (Computer Vision)</p></li><li><p><code>cs.IR</code> (Information Retrieval)</p></li><li><p><code>stat.ML</code> (Statistics: Machine Learning)</p></li></ul><p>This typically yields <strong>200-400 papers per day</strong>. I use the <code>arxiv</code> Python package which handles pagination and rate limiting. I also use a 2-day lookback window because arXiv publishes new papers around 8pm UTC, so a strict 1-day window can miss things depending on timing.</p><p>The fetcher deduplicates cross-listed papers so a paper listed under both <code>cs.AI</code> and <code>cs.LG</code> only appears once.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;694eb8d5-356e-4721-ab28-649a4f7d7905&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">client = arxiv.Client(
    page_size=100,
    delay_seconds=3.0,
    num_retries=3,
)
search = arxiv.Search(
    query=query,
    max_results=total_max,
    sort_by=arxiv.SortCriterion.SubmittedDate,
    sort_order=arxiv.SortOrder.Descending,
)
</code></pre></div><h1>Stage 2: Cheap Keyword Pre-Filter</h1><p>Before making any API calls to Claude, I cut the candidate pool roughly in half with a simple keyword filter. If a paper&#8217;s title or abstract doesn&#8217;t mention any of about 35 terms I care about (&#8221;LLM&#8221;, &#8220;transformer&#8221;, &#8220;agent&#8221;, &#8220;RAG&#8221;, &#8220;fine-tuning&#8221;, &#8220;production&#8221;, &#8220;deployment&#8221;, etc.), it&#8217;s probably not relevant enough to spend tokens on.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.anothercodingblog.com/subscribe?"><span>Subscribe now</span></a></p><p>This is intentionally broad. I would rather send a borderline paper to the LLM for scoring than accidentally filter out something good. The keyword list is just there to remove the obvious misses like pure math proofs or biology applications.</p><p><strong>Typical result:</strong> 250 papers down to about 220 after keyword filter.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;4b492523-3e66-4988-af22-4af370840093&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def _matches_keywords(paper: Paper, keywords_lower: list[str]) -&gt; bool:
    text = f"{paper.title} {paper.abstract}".lower()
    return any(kw in text for kw in keywords_lower)
</code></pre></div><p>Simple. Effective. Costs nothing.</p><h1>Stage 3: LLM Relevance Scoring with Claude Haiku</h1><p>This is the most important stage. I send the remaining papers to <strong>Claude Haiku</strong> in batches of 10, along with my interest profile, and ask it to score each one from 1-10.</p><p>The interest profile is a plain-English description organized into priority tiers:</p><ul><li><p><strong>Tier 1 (score 8-10):</strong> LLMs, agents, RAG, prompt engineering</p></li><li><p><strong>Tier 2 (score 6-8):</strong> Production ML systems, MLOps, data engineering</p></li><li><p><strong>Tier 3 (score 4-6):</strong> Computer vision, recommendation systems</p></li><li><p><strong>Tier 4 (score 1-3):</strong> Pure theory, narrow domain-specific stuff</p></li></ul><p>I also include scoring modifiers. Papers with real production deployments get a +1 bonus. Papers that are purely benchmark-focused with no novel insight get a -2 penalty. Papers from major labs on Tier 1 topics get a small bump.</p><p>Along with each score, Haiku generates a one-line &#8220;hook&#8221; explaining why the paper matters. I specifically prompt it to write like it&#8217;s explaining to a smart colleague over coffee, not like an academic abstract.</p><p><strong>Why Haiku for scoring?</strong> It&#8217;s cheap and fast. Scoring is a simpler classification task, not a nuanced generation task. At about $0.02-0.05 for all 220 papers, it&#8217;s practically free.</p><h2>The Scoring Problem I Had to Fix</h2><p>This is worth calling out because it took real iteration to get right. My first version of the scoring prompt produced useless results. <strong>Every paper scored between 8.0 and 8.5.</strong> Claude was being too generous and clustering everything at the top.</p><p>57% of papers were coming back above the threshold. That is not filtering. That is just passing everything through with extra steps.</p><p>I had to explicitly tell the model to use the full 1-10 range, include calibration examples in the prompt, and demand decimal scores (7.5, 6.0, 4.5) to create separation between papers. Here is what part of that prompt looks like:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;d6072ec8-3318-434b-92ca-8cdce34c3866&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">CRITICAL INSTRUCTIONS FOR SCORING:
1. USE THE FULL RANGE of 1-10. Do NOT cluster scores.
2. A score of 9-10 should be RARE - only 0-1 per batch of 10.
3. Most papers should score between 3-7. That's normal and correct.
4. Use decimal scores (e.g., 7.5, 6.0, 4.5) to create separation.

Calibration examples:
- "New RLHF technique that improves LLM alignment with 40% less data" -&gt; 9.0
- "Survey of prompt engineering techniques" -&gt; 7.5
- "Improved object detection on COCO benchmark by 0.3 mAP" -&gt; 3.0
- "Theoretical bounds on convergence of SGD" -&gt; 2.5
</code></pre></div><p>After this rewrite, the scores spread out properly and I started getting 5-10 papers above threshold instead of 120+.</p><h1>Stage 4: Select the Top Papers</h1><p>I take everything scoring 7.0 or above and keep the top 10. If nothing meets the threshold (rare, but possible on light days), the pipeline automatically lowers it by 1.0 and tries again so I don&#8217;t get empty digests.</p><p><strong>Typical result:</strong> 220 scored papers down to 5-10 selected.</p><h1>Stage 5: Deep Summarization with Claude Sonnet</h1><p>Now I switch to <strong>Claude Sonnet</strong> for the expensive, high-quality work. Each selected paper gets its own API call with the full abstract and a detailed summarization prompt.</p><p>The prompt is designed for a senior AI leader, not an academic. Here is what I mean by that:</p><ul><li><p>If the paper uses jargon like &#8220;contrastive loss&#8221; or &#8220;OOD generalization,&#8221; the summary explains it in parentheses</p></li><li><p>It leads with <strong>what</strong> the paper does and <strong>why</strong> it matters, not the methodology</p></li><li><p>The &#8220;practical implications&#8221; section is specific: could this be used in production today? Is it research-only?</p></li><li><p>Sentences are short and punchy. No filler.</p></li></ul><p>Each summary includes:</p><ul><li><p><strong>Key takeaways</strong> (2-3 bullet points)</p></li><li><p><strong>Summary paragraph</strong> (3-5 sentences)</p></li><li><p><strong>What&#8217;s novel</strong> (why this is different from existing work)</p></li><li><p><strong>Practical implications</strong> (who benefits and how)</p></li></ul><p><strong>Why Sonnet for summaries?</strong> Summarization requires nuance. You need the model to truly understand a paper and translate it, not just classify it. The quality difference over Haiku is worth it when you&#8217;re only processing 5-10 papers.</p><p>This is the same two-model pattern I&#8217;ve seen work well in other contexts. Use the cheap model for high-volume classification. Use the expensive model for low-volume generation where quality matters.</p><h1>Stage 6: Generate a Kindle-Friendly EPUB</h1><p>The summaries get packaged into an EPUB ebook using Python&#8217;s <code>ebooklib</code>. The structure is optimized for how I actually read on a Kindle.</p><p><strong>Overview chapter</strong> with a quick-scan table:</p><ul><li><p>Date and pipeline statistics (&#8221;Fetched 247 papers, Pre-filtered to 224, Scored, Top 7 included&#8221;)</p></li><li><p>Rank, title, score, and the one-line hook for each paper</p></li></ul><p><strong>Individual paper chapters</strong> with a tiered layout:</p><ul><li><p><strong>At a Glance:</strong> title, authors, categories, relevance score, hook, key takeaway bullets</p></li><li><p><strong>Deep Dive:</strong> full summary, what&#8217;s novel, practical implications</p></li><li><p>Link to the full PDF on arXiv</p></li></ul><p>The EPUB gets a generated cover image using Pillow with the date and paper count, which shows up as the thumbnail in the Kindle library. I bundled the Inter font so it renders correctly both on my Mac locally and on GitHub Actions (which runs Linux and doesn&#8217;t have macOS system fonts).</p><h1>Stage 7: Email to Kindle</h1><p>The final step emails the EPUB to my <code>@kindle.com</code> address using Resend. Kindle automatically converts it and syncs to all my devices.</p><p>One gotcha: you have to add the sender email address to your Kindle&#8217;s approved senders list in your Amazon account settings. Without that, the email gets silently rejected with no error message. I was debugging for a while before I realized it was just Amazon blocking an unapproved sender.</p><h1>Automation with GitHub Actions</h1><p>The whole thing runs on a GitHub Actions cron job:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;yaml&quot;,&quot;nodeId&quot;:&quot;504ac1b5-3c77-4bfa-bd9f-a3f051657a56&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-yaml">on:
  schedule:
    - cron: '0 10 * * *'  # 10:00 UTC = 5:00 AM ET
  workflow_dispatch: # Manual trigger for testing
</code></pre></div><p>By 5am ET, yesterday&#8217;s arXiv papers have been published (they go live around 8pm UTC), so there&#8217;s a comfortable buffer. The pipeline typically takes 2-5 minutes to run. By the time I wake up, the digest is already on my Kindle.</p><p>API keys are stored as GitHub Secrets. The email addresses are stored as GitHub Variables since they&#8217;re not sensitive. The workflow also uploads the EPUB as a build artifact with 7-day retention, which is great for debugging if something looks off.</p><h1>The Stack</h1><p>Deliberately simple. No database. No web server. No infrastructure to maintain.</p><ul><li><p><strong>Python 3.12</strong> for orchestration</p></li><li><p><code>arxiv</code> Python wrapper for the arXiv API</p></li><li><p><code>anthropic</code> Claude API client</p></li><li><p><code>ebooklib</code> + <code>Pillow</code> for EPUB generation with cover image</p></li><li><p><code>resend</code> for transactional email</p></li><li><p><strong>GitHub Actions</strong> for free cron scheduling</p></li></ul><p>The entire project is about 500 lines of Python across 9 files.</p><h1>Cost Breakdown</h1><p>Component Daily Cost Claude Haiku (scoring ~220 papers) ~$0.03 Claude Sonnet (summarizing ~7 papers) ~$0.15 Resend (1 email/day) Free tier GitHub Actions Free tier <strong>Total</strong> <strong>~$0.15-0.25/day (~$5-8/month)</strong></p><h1>What I Learned</h1><p><strong>LLM scoring needs real iteration.</strong> My first scoring prompt produced useless results where every paper scored 8-8.5. I had to add calibration examples, explicitly demand the full 1-10 range, and include scoring modifiers to get meaningful differentiation. If you&#8217;re building any kind of LLM-as-a-judge system, expect to spend more time on the scoring prompt than you think.</p><p><strong>The two-model strategy is a pattern worth reusing.</strong> Cheap model for high-volume classification, expensive model for low-volume generation. It keeps costs at about $0.15-0.25/day while still getting high quality summaries.</p><p><strong>Plain-language interest profiles beat structured rubrics.</strong> I tried detailed point-based scoring rubrics and found that a natural-language interest profile with tiered priorities produced better, more intuitive scoring.</p><p><strong>The pre-filter matters more than you think.</strong> Without it, you&#8217;re burning 2x the API tokens on papers that are obviously irrelevant. A simple keyword match is crude but effective.</p><p><strong>Kindle is an underrated delivery mechanism.</strong> It syncs across devices, has no notifications competing for attention, and puts research reading into the same physical context as book reading. That context switch matters more than I expected.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Another Weekly AI Newsletter: Issue 59]]></title><description><![CDATA[OpenAI introduces GPT-5.3-Codex-Spark, Cowork is now on Windows, Google launches Gemini 3 Deep Think, AWS extends Bedrock AgentCore Browsing, Anthropic covers electricity usage from data centers.]]></description><link>https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-941</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-941</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Mon, 16 Feb 2026 13:56:09 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/0d395f48-eb57-4f11-8786-5cbe40d665d0_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Major Releases</h2><h3>Cowork is now available on Windows</h3><p>Feb 11, 2026 | <a href="https://claude.com/cowork">Anthropic</a></p><p><strong>Why it matters:</strong></p><p>Windows support removes a major barrier for enterprise adoption where Windows dominates corporate environments, making Claude&#8217;s agentic desktop capabilities accessible to the majority of professional developers.</p><div><hr></div><h3>Introducing GPT&#8209;5.3&#8209;Codex&#8209;Spark</h3><p>Feb 12, 2026 | <a href="https://openai.com/index/introducing-gpt-5-3-codex-spark/">OpenAI</a></p><p><strong>Why it matters:</strong></p><p>OpenAI&#8217;s Cerebras partnership signals a shift toward specialized inference hardware as model speed becomes the new competitive frontier, positioning ultra-low latency as essential for real-time AI coding tools.</p><div><hr></div><h3>Gemini 3 Deep Think: Advancing science, research and engineering</h3><p>Feb 12, 2026 | <a href="https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-deep-think/">Google AI Blog</a></p><p><strong>Why it matters:</strong></p><p>The upgrade to Gemini 3 Deep Think means more robust support for tackling complex scientific and engineering challenges, potentially redefining collaboration between AI and human researchers.</p><div><hr></div><h3>YouTube rolls out an AI playlist generator for Premium users</h3><p>Feb 10, 2026 | <a href="https://techcrunch.com/2026/02/10/youtube-rolls-out-an-ai-playlist-generator-for-premium-users/">techcrunch.com</a></p><p><strong>Why it matters:</strong></p><p>Shows streaming platforms racing to integrate generative AI into content discovery, potentially reshaping how users interact with media libraries.</p><div><hr></div><h2>Research</h2><h3>How AI trained on birds is surfacing underwater mysteries</h3><p>Feb 09, 2026 | <a href="https://research.google/blog/how-ai-trained-on-birds-is-surfacing-underwater-mysteries/">Google Research Blog</a></p><p><strong>Why it matters:</strong></p><p>Demonstrates that transfer learning from terrestrial to aquatic bioacoustics can enhance marine research, suggesting new avenues for AI application in environmental monitoring and species conservation.</p><div><hr></div><h3>SCOPE: Selective Conformal Optimized Pairwise LLM Judging</h3><p>Feb 13, 2026 | <a href="https://arxiv.org/abs/2602.13110v1">arXiv</a></p><p><strong>Why it matters:</strong></p><p>Introduces a framework that helps calibrate LLM-as-judge systems, potentially improving the reliability of automated evaluations in AI development pipelines.</p><div><hr></div><h3>Beyond one-on-one: Authoring, simulating, and testing dynamic human-AI group conversations</h3><p>Feb 10, 2026 | <a href="https://research.google/blog/beyond-one-on-one-authoring-simulating-and-testing-dynamic-human-ai-group-conversations/">Google Research Blog</a></p><p><strong>Why it matters:</strong></p><p>DialogLab illustrates the need for sophisticated frameworks that blend scripted and improvisational dialogue, suggesting a path toward more realistic multi-party interactions in AI systems.</p><div><hr></div><h2>Agentic AI &amp; Reasoning</h2><h3>Harness engineering: leveraging Codex in an agent-first world</h3><p>Feb 11, 2026 | <a href="https://openai.com/index/harness-engineering">openai.com</a></p><p><strong>Why it matters:</strong></p><p>Provides real-world evidence of fully AI-generated production code, offering insights into agentic workflows and their practical limitations.</p><div><hr></div><h3>Gemini Enterprise Agent Ready (GEAR) program now available, a new path to building AI agents at scale</h3><p>Feb 10, 2026 | <a href="https://cloud.google.com/blog/products/ai-machine-learning/gear-program-now-available">Google Cloud AI Blog</a></p><p><strong>Why it matters:</strong></p><p>The GEAR program demonstrates a strategic shift in equipping developers with the skills and resources needed to create scalable AI agents, solidifying enterprise-level integration of AI technologies.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.anothercodingblog.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h3>Build long-running MCP servers on Amazon Bedrock AgentCore with Strands Agents integration</h3><p>Feb 12, 2026 | <a href="https://aws.amazon.com/blogs/machine-learning/build-long-running-mcp-servers-on-amazon-bedrock-agentcore-with-strands-agents-integration/">AWS AI Blog</a></p><p><strong>Why it matters:</strong></p><p>Reveals the potential for AI agents to operate continuously in enterprise environments, indicating a necessary evolution in toolsets to support complex, real-time data processing without session limitations.</p><div><hr></div><h3>Customize AI agent browsing with proxies, profiles, and extensions in Amazon Bedrock AgentCore Browser</h3><p>Feb 13, 2026 | <a href="https://aws.amazon.com/blogs/machine-learning/customize-ai-agent-browsing-with-proxies-profiles-and-extensions-in-amazon-bedrock-agentcore-browser/">AWS AI Blog</a></p><p><strong>Why it matters:</strong></p><p>Demonstrates a shift in AI agent capabilities toward more realistic web interactions, suggesting increased adoption in enterprise settings that require reliable state management and customized browsing configurations.</p><div><hr></div><h3>The state of agentic AI in 2026</h3><p>Feb 11, 2026 | <a href="https://www.crewai.com/blog/the-state-of-agentic-ai-in-2026">Crew AI Blog</a></p><p><strong>Why it matters:</strong></p><p>Highlights a pivotal shift in enterprise strategy, where agentic AI is becoming an operational necessity, thereby influencing research and development priorities in AI.</p><div><hr></div><h2>Real-World Use Cases</h2><h3>Iberdrola enhances IT operations using Amazon Bedrock AgentCore</h3><p>Feb 10, 2026 | <a href="https://aws.amazon.com/blogs/machine-learning/iberdrola-enhances-it-operations-using-amazon-bedrock-agentcore/">AWS AI Blog</a></p><p><strong>Why it matters:</strong></p><p>Iberdrola&#8217;s integration of Amazon Bedrock AgentCore underscores a move towards sophisticated, scalable AI solutions in IT operations, highlighting the potential for increased efficiency and consistency in enterprise-level incident management.</p><div><hr></div><h3>Build financial resilience with AI-powered tabletop exercises on Google Cloud</h3><p>Feb 10, 2026 | <a href="https://cloud.google.com/blog/topics/financial-services/improve-financial-resilience-with-google-cloud">Google Cloud AI Blog</a></p><p><strong>Why it matters:</strong></p><p>Indicates a shift towards customized AI applications in operational resilience, suggesting that industry-specific context will enhance the effectiveness of incident response planning in financial services.</p><div><hr></div><h3>How Amazon uses Amazon Nova models to automate operational readiness testing for new fulfillment centers</h3><p>Feb 10, 2026 | <a href="https://aws.amazon.com/blogs/machine-learning/how-amazon-uses-amazon-nova-models-to-automate-operational-readiness-testing-for-new-fulfillment-centers/">AWS AI Blog</a></p><p><strong>Why it matters:</strong></p><p>Demonstrates how large organizations can leverage AI to enhance operational efficiency, which may encourage similar automation efforts in other industries reliant on extensive manual verification processes.</p><div><hr></div><h3>Swann provides Generative AI to millions of IoT Devices using Amazon Bedrock</h3><p>Feb 11, 2026 | <a href="https://aws.amazon.com/blogs/machine-learning/swann-provides-generative-ai-to-millions-of-iot-devices-using-amazon-bedrock/">AWS AI Blog</a></p><p><strong>Why it matters:</strong></p><p>Indicates a trend towards enhancing IoT device intelligence through generative AI, suggesting future systems may increasingly prioritize context-aware data processing to mitigate user fatigue and improve engagement.</p><div><hr></div><h2>Thought Leadership</h2><h3>Why the Moltbook frenzy was like Pok&#233;mon</h3><p>Feb 09, 2026 | <a href="https://www.technologyreview.com/2026/02/09/1132537/a-lesson-from-pokemon/">MIT Technology Review AI</a></p><p><strong>Why it matters:</strong></p><p>Highlights the potential disconnect between AI enthusiasts&#8217; aspirations and actual capabilities, suggesting that the current excitement may be more about spectacle than substantive advancements in AI utility.</p><div><hr></div><h3>What&#8217;s next for Chinese open-source AI</h3><p>Feb 12, 2026 | <a href="https://www.technologyreview.com/2026/02/12/1132811/whats-next-for-chinese-open-source-ai/">MIT Technology Review AI</a></p><p><strong>Why it matters:</strong></p><p>Signals that the rise of Chinese open-source AI may challenge traditional innovation hubs, compelling Western developers to adapt to a landscape where affordability and accessibility redefine competitive advantages.</p><div><hr></div><h3>The AI Vampire</h3><p>Feb 15, 2026 | <a href="https://simonwillison.net/2026/Feb/15/the-ai-vampire/">Simon Willison</a></p><p><strong>Why it matters:</strong></p><p>Highlights the risk of productivity-driven burnout in AI adoption, suggesting that even as systems automate routine tasks, they may amplify cognitive strain and diminish overall job satisfaction among employees.</p><div><hr></div><h2>Industry Investment &amp; Business Moves</h2><h3>Gather AI Raises $40M Led by Smith Point Capital Management to Scale its Physical AI Platform for Global Logistics</h3><p>Feb 09, 2026 | <a href="https://venturebeat.com/business/gather-ai-raises-40m-led-by-smith-point-capital-management-to-scale-its-physical-ai-platform-for-global-logistics">venturebeat.com</a></p><p><strong>Why it matters:</strong></p><p>Continued investment in physical-AI logistics indicates growing confidence that AI can deliver ROI in warehouse and supply chain operations.</p><div><hr></div><h3>Anthropic partners with CodePath to bring Claude to the US&#8217;s largest collegiate computer science program</h3><p>Feb 13, 2026 | <a href="https://www.anthropic.com/news/anthropic-codepath-partnership">Anthropic News</a></p><p><strong>Why it matters:</strong></p><p>Signals that educational institutions are recognizing the importance of AI tools in programming, potentially reshaping the future workforce and creating a more inclusive environment in tech fields traditionally dominated by wealthier demographics.</p><div><hr></div><h3>Anthropic opens Bengaluru office and announces new partnerships across India</h3><p>Feb 16, 2026 | <a href="https://www.anthropic.com/news/bengaluru-office-partnerships-across-india">Anthropic News</a></p><p><strong>Why it matters:</strong></p><p>Anthropic&#8217;s new initiatives highlight the importance of localized language models, demonstrating a commitment to inclusivity in AI development that can reshape user interactions across diverse linguistic communities in India.</p><div><hr></div><h2>Regulatory &amp; Policy</h2><h3>Covering electricity price increases from our data centers</h3><p>Feb 11, 2026 | <a href="https://www.anthropic.com/news/covering-electricity-price-increases">Anthropic News</a></p><p><strong>Why it matters:</strong></p><p>This initiative highlights the necessity for AI companies to align their operational growth with consumer welfare, potentially setting a precedent for future industry practices in managing infrastructure costs.</p><div><hr></div><h3>Anthropic raises $20 million to Public First Action</h3><p>Feb 12, 2026 | <a href="https://www.anthropic.com/news/donate-public-first-action">Anthropic News</a></p><p><strong>Why it matters:</strong></p><p>Indicates a growing recognition within the AI community of the necessity for proactive policy measures to address the risks associated with rapidly advancing AI technologies and their societal implications.</p><div><hr></div><h3>Bringing ChatGPT to GenAI.mil</h3><p>Feb 09, 2026 | <a href="https://openai.com/index/bringing-chatgpt-to-genaimil/">openai.com</a></p><p><strong>Why it matters:</strong></p><p>Represents a major milestone for AI adoption in government, with OpenAI gaining access to millions of defense personnel through secure infrastructure.</p><div><hr></div><h3>Introducing Lockdown Mode and Elevated Risk labels in ChatGPT</h3><p>Feb 13, 2026 | <a href="https://openai.com/index/introducing-lockdown-mode-and-elevated-risk-labels-in-chatgpt//">openai.com</a></p><p><strong>Why it matters:</strong></p><p>Addresses growing security concerns around prompt injection in enterprise AI deployments, providing guardrails as organizations scale AI tool usage.</p><div><hr></div><h2>AI Safety &amp; Ethics</h2><h3>Building a safer digital future, together</h3><p>Feb 09, 2026 | <a href="https://blogs.microsoft.com/on-the-issues/2026/02/09/building-a-safer-digital-future-together/">blogs.microsoft.com</a></p><p><strong>Why it matters:</strong></p><p>Microsoft reinforces its safety-first positioning as regulators scrutinize tech platforms, signaling how major players frame AI governance messaging.</p><div><hr></div><h3>Helping kids and teens learn and grow online on Safer Internet Day</h3><p>Feb 10, 2026 | <a href="https://blog.google/innovation-and-ai/technology/safety-security/safer-internet-day-2026-kids-teens/">Google AI Blog</a></p><p><strong>Why it matters:</strong></p><p>Suggests a growing recognition of the role artificial intelligence can play in enhancing online safety for younger users, prompting further innovation in user-centric safety tools within the AI community.</p><div><hr></div><h3>A &#8220;QuitGPT&#8221; campaign is urging people to cancel their ChatGPT subscriptions</h3><p>Feb 10, 2026 | <a href="https://www.technologyreview.com/2026/02/10/1132577/a-quitgpt-campaign-is-urging-people-to-cancel-chatgpt-subscriptions/">MIT Technology Review AI</a></p><p><strong>Why it matters:</strong></p><p>Emerging user backlash signals growing scrutiny of AI products, potentially influencing how companies balance capability claims with user expectations.</p><div><hr></div><h2>Dev Tools &amp; Infrastructure</h2><h3>TriGen: NPU Architecture for End-to-End Acceleration of Large Language Models based on SW-HW Co-Design</h3><p>Feb 13, 2026 | <a href="https://arxiv.org/abs/2602.12962v1">arXiv</a></p><p><strong>Why it matters:</strong></p><p>Demonstrates how custom NPU architectures can enable efficient on-device LLM inference, potentially bringing AI capabilities to edge devices without cloud dependency.</p><div><hr></div><h3>Leading Inference Providers Cut AI Costs by up to 10x With Open Source Models on NVIDIA Blackwell</h3><p>Feb 12, 2026 | <a href="https://blogs.nvidia.com/blog/inference-open-source-models-blackwell-reduce-cost-per-token/">blogs.nvidia.com</a></p><p><strong>Why it matters:</strong></p><p>Cost reductions of this magnitude could accelerate enterprise AI adoption by making production deployments economically viable at scale.</p><div><hr></div><h3>GPT&#8209;5.2 derives a new result in theoretical physics</h3><p>Feb 13, 2026 | <a href="https://openai.com/index/new-result-theoretical-physics/">openai.com</a></p><p><strong>Why it matters:</strong></p><p>Marks a milestone where an AI model contributed a novel theoretical physics formula that was subsequently proven correct, validating AI&#8217;s potential role in scientific discovery.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Another Weekly AI Newsletter: Issue 58]]></title><description><![CDATA[OpenAI launches Frontier and GPT-5.3-Codex, Anthropic releases Claude Opus 4.6, Google explores AI in real-world virtual care, and Machina Labs raises $124M for AI-driven factories]]></description><link>https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-a63</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/another-weekly-ai-newsletter-issue-a63</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Wed, 11 Feb 2026 19:01:19 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/b4f264b7-1c12-4ab2-8e29-71bbda5ca2ae_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Major Releases</h2><h4>Structured outputs on Amazon Bedrock: Schema-compliant AI responses</h4><p>Feb 06, 2026 | <a href="https://aws.amazon.com/blogs/machine-learning/structured-outputs-on-amazon-bedrock-schema-compliant-ai-responses/">aws.amazon.com</a></p><p><strong>Why it matters:</strong></p><p>Amazon Bedrock&#8217;s schema-compliant outputs enable developers to bypass traditional data validation, streamlining AI integration and enhancing trust in automated systems, which could accelerate the deployment of reliable AI applications across industries.</p><div><hr></div><h4>Introducing Claude Opus 4.6</h4><p>Feb 05, 2026 | <a href="https://www.anthropic.com/news/claude-opus-4-6">anthropic.com</a></p><p><strong>Why it matters:</strong></p><p>Claude Opus 4.6&#8217;s enhanced coding and agentic capabilities, along with a 1M-token context window, signal a leap in AI&#8217;s ability to handle complex, multi-step tasks, making it a significant tool for developers and businesses seeking more efficient, robust AI-driven solutions.</p><div><hr></div><h4>Introducing GPT-5.3-Codex</h4><p>Feb 05, 2026 | <a href="https://openai.com/index/introducing-gpt-5-3-codex/">openai.com</a></p><p><strong>Why it matters:</strong></p><p>GPT-5.3-Codex&#8217;s enhanced speed and capability to autonomously manage complex coding tasks mark a pivotal shift towards more interactive and efficient AI coding assistants, potentially transforming software development workflows and reducing the time and expertise needed to tackle intricate programming challenges.</p><div><hr></div><h2>Breakthrough Research</h2><h4>LSGQuant: Layer-Sensitivity Guided Quantization for One-Step Diffusion Video Super-Resolution</h4><p>Feb 03, 2026 | <a href="https://arxiv.org/abs/2602.03198">arxiv.org</a></p><p><strong>Why it matters:</strong></p><p>LSGQuant&#8217;s efficient quantization for video super-resolution allows high-quality diffusion models to be deployed in resource-limited environments, expanding access to advanced video enhancement technology and enabling broader applications in industries like streaming and mobile video processing.</p><div><hr></div><h4>Reward-Free Alignment for Conflicting Objectives (RACO)</h4><p>Feb 02, 2026 | <a href="https://arxiv.org/abs/2602.02495">arxiv.org</a></p><p><strong>Why it matters:</strong></p><p>RACO&#8217;s method for aligning language models to conflicting objectives using pairwise feedback, without explicit rewards, enhances AI&#8217;s ability to balance complex trade-offs like safety and performance, crucial for deploying AI in nuanced, real-world applications.</p><div><hr></div><h4>PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss</h4><p>Feb 02, 2026 | <a href="https://arxiv.org/abs/2602.02493">arxiv.org</a></p><p><strong>Why it matters:</strong></p><p>PixelGen&#8217;s ability to outperform latent diffusion models using perceptual loss in pixel space suggests a shift toward simpler architectures, potentially democratizing high-quality image generation by reducing reliance on complex latent representations and making advanced generative capabilities more accessible.</p><div><hr></div><h2>Agentic AI &amp; Reasoning</h2><h4>Introducing OpenAI Frontier</h4><p>Feb 05, 2026 | <a href="https://openai.com/index/introducing-openai-frontier/">openai.com</a></p><p><strong>Why it matters:</strong></p><p>OpenAI Frontier&#8217;s enterprise platform signals a shift toward integrating AI agents as functional team members in business environments, potentially transforming workplace efficiency and collaboration by enabling AI to handle complex tasks with shared context and feedback mechanisms.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.anothercodingblog.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h4>Beyond one-on-one: Authoring, simulating, and testing dynamic human-AI group conversations</h4><p>Feb 10, 2026 | <a href="https://research.google/blog/beyond-one-on-one-authoring-simulating-and-testing-dynamic-human-ai-group-conversations/">research.google</a></p><p><strong>Why it matters:</strong></p><p>DialogLab&#8217;s exploration of multi-party human-AI interactions highlights the shift towards more sophisticated conversational AI, enabling richer, more nuanced group dynamics that could transform collaborative tools and virtual environments, making AI a more integral part of team-based workflows and social interactions.</p><div><hr></div><h4>How AI tools can redefine universal design to increase accessibility</h4><p>Feb 05, 2026 | <a href="https://research.google/blog/how-ai-agents-can-redefine-universal-design-to-increase-accessibility/">research.google</a></p><p><strong>Why it matters:</strong></p><p>Embedding adaptive AI tools into interfaces through Google&#8217;s Natively Adaptive Interfaces framework enhances personalization and accessibility, potentially setting a new standard for universal design and making technology more inclusive for users with diverse needs.</p><div><hr></div><h2>Real-World Use Cases</h2><h4>IBM to Support Missile Defense Agency SHIELD Contract</h4><p>Feb 05, 2026 | <a href="https://newsroom.ibm.com/2026-02-05-IBM-to-Support-Missile-Defense-Agency-SHIELD-Contract">ibm.com</a></p><p><strong>Why it matters:</strong></p><p>IBM&#8217;s AI-driven contract with the Missile Defense Agency highlights the strategic integration of AI in national defense, emphasizing the industry&#8217;s role in enhancing decision-making speed and agility, which could set a precedent for future defense contracts and AI&#8217;s critical function in national security infrastructure.</p><div><hr></div><h4>AT&amp;T, AWS, and Amazon Leo Collaborate to Accelerate Modernization of Nation&#8217;s Connectivity Infrastructure</h4><p>Feb 04, 2026 | <a href="https://markets.financialcontent.com/prnews.pre/article/bizwire-2026-2-4-at-and-t-aws-and-amazon-leo-collaborate-to-accelerate-modernization-of-nations-connectivity-infrastructure">financialcontent.com</a></p><p><strong>Why it matters:</strong></p><p>AT&amp;T&#8217;s collaboration with AWS and Amazon Leo leverages AI to enhance network infrastructure, potentially transforming U.S. connectivity by improving scalability and resilience, while expanding AI-driven services and setting a precedent for future telecom modernization.</p><div><hr></div><h4>Humana Redefines the Member Experience with Agent Assist Built with Google Cloud</h4><p>Feb 03, 2026 | <a href="https://www.googlecloudpresscorner.com/2026-02-03-Humana-Redefines-the-Member-Experience-with-Agent-Assist-Built-with-Google-Cloud">googlecloudpresscorner.com</a></p><p><strong>Why it matters:</strong></p><p>Humana&#8217;s AI-powered Agent Assist highlights the growing trend of AI augmenting rather than replacing human roles, enhancing service delivery in high-volume environments and setting a precedent for scalable, empathetic customer interaction solutions in the healthcare industry.</p><div><hr></div><h2>Thought Leadership</h2><h4>Natively Adaptive Interfaces: A new framework for AI accessibility</h4><p>Feb 05, 2026 | <a href="https://blog.google/company-news/outreach-and-initiatives/accessibility/natively-adaptive-interfaces-ai-accessibility/">blog.google.com</a></p><p><strong>Why it matters:</strong></p><p>Embedding accessibility directly into AI design through Natively Adaptive Interfaces shifts the industry toward more inclusive technology, ensuring that personalization and accessibility are inherent, not optional, features&#8212;broadening AI&#8217;s usability and relevance across diverse user demographics.</p><div><hr></div><h4>Collaborating on a nationwide randomized study of AI in real-world virtual care</h4><p>Feb 03, 2026 | <a href="https://research.google/blog/collaborating-on-a-nationwide-randomized-study-of-ai-in-real-world-virtual-care/">research.google</a></p><p><strong>Why it matters:</strong></p><p>Google&#8217;s study with Included Health provides critical real-world data on AI&#8217;s role in telemedicine, potentially transforming healthcare delivery by optimizing physician time and expanding access to expertise, setting a precedent for AI&#8217;s integration into everyday clinical practice.</p><div><hr></div><h2>Industry Investment &amp; Business Moves</h2><h4>Testing ads in ChatGPT</h4><p>Feb 9, 2026 | <a href="https://openai.com/index/testing-ads-in-chatgpt/">openai.com</a></p><p><strong>Why it matters:</strong></p><p>OpenAI begins testing ads in ChatGPT for Free and Go tier users in the U.S. Ads appear at the bottom of responses, matched to conversation topics. Plus, Pro, Business, Enterprise, and Education tiers remain ad-free. OpenAI states ads won&#8217;t influence ChatGPT&#8217;s answers and conversations stay private from advertisers. </p><h4>Machina Labs raises $124M to build AI-driven &#8216;Intelligent Factory&#8217;</h4><p>Feb 04, 2026 | <a href="https://www.axios.com/2026/02/04/machina-labs-seriesc-intelligent-factory">axios.com</a></p><p><strong>Why it matters:</strong></p><p>Machina Labs&#8217; funding and factory initiative underscore AI&#8217;s pivotal role in revitalizing U.S. manufacturing, enhancing efficiency and precision in aerospace and defense production, and signaling a shift towards more automated, AI-driven industrial processes.</p><div><hr></div><h4>AppFactor raises $4M seed to deliver an agentic orchestration platform for enterprise software maintenance</h4><p>Feb 04, 2026 | <a href="https://venturebeat.com/business/appfactor-raises-4m-seed-to-deliver-an-agentic-orchestration-platform-for-enterprise-software-maintenance/">venturebeat.com</a></p><p><strong>Why it matters:</strong></p><p>AppFactor&#8217;s platform could significantly lower enterprise software upkeep costs and ease the transition from outdated systems, highlighting AI&#8217;s growing role in automating complex IT processes and enhancing operational efficiency.</p><div><hr></div><h4>Snowflake and OpenAI partner to bring frontier intelligence to enterprise data</h4><p>Feb 02, 2026 | <a href="https://openai.com/index/snowflake-partnership/">openai.com</a></p><p><strong>Why it matters:</strong></p><p>Integrating OpenAI&#8217;s models into Snowflake&#8217;s platform enhances enterprise data analytics, allowing businesses to harness AI&#8217;s reasoning capabilities directly on their proprietary datasets, which could redefine data-driven decision-making and automation in corporate environments.</p><div><hr></div><h2>Regulatory &amp; Policy</h2><h4>UK ICO launches formal investigation of Grok AI chatbot over harmful deepfakes</h4><p>Feb 03, 2026 | <a href="https://ico.org.uk/about-the-ico/media-centre/news-and-blogs/2026/02/ico-announces-investigation-into-grok/">ico.org.uk</a></p><p><strong>Why it matters:</strong></p><p>The ICO&#8217;s investigation into Grok AI underscores the urgent need for robust safeguards against misuse of AI-generated content, highlighting the growing regulatory scrutiny on AI systems to protect personal data and prevent harmful deepfakes, which could reshape compliance standards across the industry.</p><div><hr></div><h4>Making AI work for everyone, everywhere</h4><p>Feb 06, 2026 | <a href="https://openai.com/index/our-approach-to-localization/">openai.com</a></p><p><strong>Why it matters:</strong></p><p>OpenAI&#8217;s commitment to localization ensures AI models are more culturally and legally attuned, fostering global inclusivity and compliance, which is crucial for equitable AI adoption and minimizing biases across diverse markets.</p><div><hr></div><h4>Bringing ChatGPT to GenAI.mil</h4><p>Feb 09, 2026 | <a href="https://openai.com/index/bringing-chatgpt-to-genaimil/">openai.com</a></p><p><strong>Why it matters:</strong></p><p>Integrating ChatGPT into GenAI.mil signifies a pivotal shift toward AI-enhanced national security, highlighting the increasing role of AI in defense strategies and setting a precedent for secure, government-focused AI deployments.</p><div><hr></div><h2>AI Safety &amp; Ethics</h2><h4>UN Announces Independent International AI Science Panel to Guide AI Safety</h4><p>Feb 04, 2026 | <a href="https://www.un.org/sg/en/content/highlight/2026-02-04.html">un.org</a></p><p><strong>Why it matters:</strong></p><p>The UN&#8217;s establishment of an independent AI science panel signifies a critical step toward creating universal AI safety standards, promoting global cooperation, and ensuring that AI advancements align with ethical considerations, which is essential for mitigating risks and fostering trust in AI technologies worldwide.</p><div><hr></div><h4>International AI Safety Report 2026 Published by Global Experts</h4><p>Feb 03, 2026 | <a href="https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026">internationalaisafetyreport.org</a></p><p><strong>Why it matters:</strong></p><p>The International AI Safety Report 2026 establishes a critical benchmark for global AI governance, equipping policymakers with essential strategies to mitigate risks, thereby shaping safer AI development and fostering international collaboration on ethical AI deployment.</p><div><hr></div><h4>Anthropic&#8217;s Responsible Scaling Policy (update)</h4><p>Feb 10, 2026 | <a href="https://www.anthropic.com/rsp-updates">anthropic.com</a></p><p><strong>Why it matters:</strong></p><p>Anthropic&#8217;s updated Responsible Scaling Policy, with its Sabotage Risk Report for Claude Opus 4.6, highlights the increasing need for transparency and robust safeguards in AI development, setting a precedent for industry-wide accountability and risk management practices.</p><div><hr></div><h2>Dev Tools &amp; Infrastructure</h2><h4>Android Studio Panda 2 (2025.3.2) Canary 3 Now Available</h4><p>Feb 05, 2026 | <a href="https://androidstudio.googleblog.com/2026/02/">androidstudio.googleblog.com</a></p><p><strong>Why it matters:</strong></p><p>Enhanced bug fixes and improved code-assistance in Android Studio Panda 2 streamline AI development for mobile apps, enabling developers to build more reliable and efficient applications, thereby accelerating innovation and deployment in the rapidly evolving AI-driven mobile ecosystem.</p><div><hr></div><h4>GitHub Actions Runner Scale Set Client (Public Preview)</h4><p>Feb 05, 2026 | <a href="https://github.blog/changelog/2026-02-05-github-actions-early-february-2026-updates/">github.blog</a></p><p><strong>Why it matters:</strong></p><p>Customizable autoscaling for GitHub Actions runners enhances CI/CD efficiency, enabling AI developers to handle complex workflows and large-scale projects more effectively, thus accelerating AI model development and deployment.</p><div><hr></div><h4>Milvus 2.6.10 Released with Performance and Security Enhancements</h4><p>Feb 05, 2026 | <a href="https://milvus.io/docs/id/release_notes.md">milvus.io</a></p><p><strong>Why it matters:</strong></p><p>Enhanced security and faster inference in Milvus 2.6.10 streamline AI deployment, crucial for industries relying on vector databases to handle vast data efficiently, while improved stability ensures reliability, making it a strategic update for developers focused on optimizing AI-driven applications.</p>]]></content:encoded></item><item><title><![CDATA[Recursive Language Models Work, But Not Every Time]]></title><description><![CDATA[Empirical comparison of RLM, RAG, and chunking across 2.2 million tokens and 120 runs reveals that model selection, retry strategies, and task type matter more than method choice]]></description><link>https://www.anothercodingblog.com/p/recursive-language-models-work-but</link><guid isPermaLink="false">https://www.anothercodingblog.com/p/recursive-language-models-work-but</guid><dc:creator><![CDATA[Taylor Ortiz]]></dc:creator><pubDate>Sat, 07 Feb 2026 06:14:23 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/393d423d-d92b-4052-8ab2-df393d88875a_2816x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>Executive Summary</h2><p>This research evaluates Recursive Language Models (RLM) from <a href="https://arxiv.org/html/2512.24601v1">arXiv:2512.24601 </a>through rigorous empirical testing. We tested two RLM implementations: <strong>a Custom RLM</strong> we built following the paper&#8217;s approach, and the<strong> DSPy RLM module </strong>(dspy.RLM). We compare both against RAG (Retrieval Augmented Generation) and traditional chunking approaches across multiple tasks and models.</p><p>We conducted two layers of testing:</p><ol><li><p><strong>Model comparison:</strong> We tested both Custom RLM and DSPy RLM across multiple OpenAI models, including standard models (gpt-4o-mini, gpt-4o) and reasoning models (gpt-5-mini, gpt-5.2, gpt-5-nano). <strong>This revealed that model selection is critical: </strong>standard models scored 0/6 on aggregation tasks while reasoning models scored 6/6 with identical code and prompts.</p></li><li><p><strong>Variance testing (n=30):</strong> After identifying gpt-5-mini as the best-performing model for RLM tasks, we ran Custom RLM, DSPy RLM, and RAG <strong>30 times each</strong> with identical inputs to measure variance. Variance captures how much results differ between runs of the same system, and understanding it is essential for deciding whether to deploy these methods in production.</p></li></ol><h3>Key Findings</h3><h4>1. Variance is the story.</h4><p>Multi-document aggregation revealed significant variance in both RLM implementations. Scores ranged from complete failure (0/6) to perfect accuracy (6/6) across 30 identical runs.</p><h4>2. Task type determines reliability.</h4><p>Single-document analysis (one book, deep questions) showed lower variance (std=0.75) than multi-document aggregation (six books, synthesizing across all). Both RLM implementations are more reliable for focused analysis than cross-document synthesis.</p><h4>3. Model selection matters more than method.</h4><p>Frontier reasoning models (gpt-5-mini, gpt-5.2) succeeded where standard models (gpt-4o, gpt-4o-mini) failed completely. Same code, same prompts, but 0/6 vs 6/6.</p><h4>4. RAG wins on consistency.</h4><p>RAG achieved the most stable results on single-document reasoning (std=0.63), but struggled with multi-document aggregation where systematic coverage matters more than semantic similarity.      </p><h4>5. Cost-variance tradeoff.</h4><p>DSPy RLM costs ~2x more than Custom RLM but shows lower variance on reasoning tasks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m-OY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72c2d0d4-8d21-4147-94b2-f818fcc87f79_2086x898.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m-OY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72c2d0d4-8d21-4147-94b2-f818fcc87f79_2086x898.png 424w, https://substackcdn.com/image/fetch/$s_!m-OY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72c2d0d4-8d21-4147-94b2-f818fcc87f79_2086x898.png 848w, https://substackcdn.com/image/fetch/$s_!m-OY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72c2d0d4-8d21-4147-94b2-f818fcc87f79_2086x898.png 1272w, https://substackcdn.com/image/fetch/$s_!m-OY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72c2d0d4-8d21-4147-94b2-f818fcc87f79_2086x898.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m-OY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72c2d0d4-8d21-4147-94b2-f818fcc87f79_2086x898.png" width="1456" height="627" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72c2d0d4-8d21-4147-94b2-f818fcc87f79_2086x898.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:627,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:105923,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72c2d0d4-8d21-4147-94b2-f818fcc87f79_2086x898.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!m-OY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72c2d0d4-8d21-4147-94b2-f818fcc87f79_2086x898.png 424w, https://substackcdn.com/image/fetch/$s_!m-OY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72c2d0d4-8d21-4147-94b2-f818fcc87f79_2086x898.png 848w, https://substackcdn.com/image/fetch/$s_!m-OY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72c2d0d4-8d21-4147-94b2-f818fcc87f79_2086x898.png 1272w, https://substackcdn.com/image/fetch/$s_!m-OY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72c2d0d4-8d21-4147-94b2-f818fcc87f79_2086x898.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>1. Introduction</h2><h3>The Problem with Long Documents</h3><p>LLMs face a fundamental challenge: context windows have limits. A 2.2 million token corpus cannot be processed directly. Even 700K token documents strain budgets.</p><h3>The RLM Promise</h3><p>Recursive Language Models (<a href="https://arxiv.org/html/2512.24601v1">arXiv:2512.24601</a>) propose an elegant solution:</p><ol><li><p>Store the full document as a variable in a sandboxed Python environment</p></li><li><p>Let the LLM iteratively generate code to explore the document</p></li><li><p>Execute code, return results, repeat</p></li><li><p>The model searches, slices, and reasons programmatically</p></li></ol><p><strong>Theoretical advantage:</strong> Instead of processing millions of tokens at once, the model strategically samples relevant sections.</p><h3>Our Contribution</h3><p>The original RLM paper reports single-run results on synthetic benchmarks. We contribute:</p><ul><li><p><strong>Statistical rigor:</strong> n=30 runs per condition reveals variance hidden by single-run reporting</p></li><li><p><strong>Real-world tasks:</strong> Literary analysis across 2.2M tokens of classic novels</p></li><li><p><strong>Method comparison:</strong> RLM vs RAG vs Chunking on identical tasks</p></li><li><p><strong>Practical guidance:</strong> When to use each approach</p></li></ul><h3>Research Questions</h3><ol><li><p>How reliable is RLM? (variance across runs)</p></li><li><p>Under what conditions does RLM excel?</p></li><li><p>How does model selection affect outcomes?</p></li><li><p>What are the cost/quality trade-offs?</p></li></ol><h2>2. Methodology</h2><h3>Test Corpus</h3><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!26vc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97e831e2-d44c-4446-9cf6-7c1bc71c8851_1480x323.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!26vc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97e831e2-d44c-4446-9cf6-7c1bc71c8851_1480x323.png 424w, https://substackcdn.com/image/fetch/$s_!26vc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97e831e2-d44c-4446-9cf6-7c1bc71c8851_1480x323.png 848w, https://substackcdn.com/image/fetch/$s_!26vc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97e831e2-d44c-4446-9cf6-7c1bc71c8851_1480x323.png 1272w, https://substackcdn.com/image/fetch/$s_!26vc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97e831e2-d44c-4446-9cf6-7c1bc71c8851_1480x323.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!26vc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97e831e2-d44c-4446-9cf6-7c1bc71c8851_1480x323.png" width="1456" height="318" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97e831e2-d44c-4446-9cf6-7c1bc71c8851_1480x323.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:318,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:38634,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97e831e2-d44c-4446-9cf6-7c1bc71c8851_1480x323.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!26vc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97e831e2-d44c-4446-9cf6-7c1bc71c8851_1480x323.png 424w, https://substackcdn.com/image/fetch/$s_!26vc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97e831e2-d44c-4446-9cf6-7c1bc71c8851_1480x323.png 848w, https://substackcdn.com/image/fetch/$s_!26vc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97e831e2-d44c-4446-9cf6-7c1bc71c8851_1480x323.png 1272w, https://substackcdn.com/image/fetch/$s_!26vc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97e831e2-d44c-4446-9cf6-7c1bc71c8851_1480x323.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>The Mega Corpus combines: </strong>War and Peace, Great Expectations, A Tale of Two Cities, Oliver Twist, David Copperfield, and Moby Dick.</p><h3>Methods Compared</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v8h4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e007c86-2cfc-44ca-b16a-7abaa6b28956_1480x505.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v8h4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e007c86-2cfc-44ca-b16a-7abaa6b28956_1480x505.png 424w, https://substackcdn.com/image/fetch/$s_!v8h4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e007c86-2cfc-44ca-b16a-7abaa6b28956_1480x505.png 848w, https://substackcdn.com/image/fetch/$s_!v8h4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e007c86-2cfc-44ca-b16a-7abaa6b28956_1480x505.png 1272w, https://substackcdn.com/image/fetch/$s_!v8h4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e007c86-2cfc-44ca-b16a-7abaa6b28956_1480x505.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v8h4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e007c86-2cfc-44ca-b16a-7abaa6b28956_1480x505.png" width="1456" height="497" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e007c86-2cfc-44ca-b16a-7abaa6b28956_1480x505.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:497,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:61966,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e007c86-2cfc-44ca-b16a-7abaa6b28956_1480x505.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v8h4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e007c86-2cfc-44ca-b16a-7abaa6b28956_1480x505.png 424w, https://substackcdn.com/image/fetch/$s_!v8h4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e007c86-2cfc-44ca-b16a-7abaa6b28956_1480x505.png 848w, https://substackcdn.com/image/fetch/$s_!v8h4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e007c86-2cfc-44ca-b16a-7abaa6b28956_1480x505.png 1272w, https://substackcdn.com/image/fetch/$s_!v8h4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e007c86-2cfc-44ca-b16a-7abaa6b28956_1480x505.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Statistical Design</h3><p>We ran each condition <strong>30 times </strong>with identical inputs.</p><p><strong>Why n=30?</strong> With 30+ samples, the sampling distribution of the mean typically stabilizes enough to estimate mean and variance reasonably. This is standard in behavioral research for detecting medium effect sizes.</p><p><strong>Why temperature=1.0? </strong>We wanted to measure natural variability under realistic &#8220;creative exploration&#8221; settings. Lower temperatures would reduce randomness but wouldn&#8217;t eliminate the path-dependence inherent to agentic systems: once the model commits to exploring one section first, its subsequent decisions cascade from there. Temperature=1.0 captures this real-world behavior.</p><h3>Tasks and Scoring</h3><p>We designed two tasks to test different capabilities: deep reasoning within a single document, and information aggregation across multiple documents.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.anothercodingblog.com/subscribe?"><span>Subscribe now</span></a></p><h3>Reasoning Task</h3><p><strong>Corpus:</strong> War and Peace (722K tokens)</p><p><strong>Question:</strong> &#8220;How does Pierre Bezukhov&#8217;s understanding of happiness change throughout the novel?&#8221;</p><p><strong>What we&#8217;re measuring:</strong> Can the model navigate a massive document, find the relevant sections about Pierre&#8217;s character arc, and synthesize them into a coherent answer?</p><p><strong>Scoring approach:</strong> We identified 8 key terms that a comprehensive answer should reference. These are actual names and terms from the novel:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SusA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905c5988-2ebb-4e74-a09f-5391d57cd0f5_1480x774.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SusA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905c5988-2ebb-4e74-a09f-5391d57cd0f5_1480x774.png 424w, https://substackcdn.com/image/fetch/$s_!SusA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905c5988-2ebb-4e74-a09f-5391d57cd0f5_1480x774.png 848w, https://substackcdn.com/image/fetch/$s_!SusA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905c5988-2ebb-4e74-a09f-5391d57cd0f5_1480x774.png 1272w, https://substackcdn.com/image/fetch/$s_!SusA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905c5988-2ebb-4e74-a09f-5391d57cd0f5_1480x774.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SusA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905c5988-2ebb-4e74-a09f-5391d57cd0f5_1480x774.png" width="1456" height="761" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/905c5988-2ebb-4e74-a09f-5391d57cd0f5_1480x774.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:761,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:84386,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905c5988-2ebb-4e74-a09f-5391d57cd0f5_1480x774.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SusA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905c5988-2ebb-4e74-a09f-5391d57cd0f5_1480x774.png 424w, https://substackcdn.com/image/fetch/$s_!SusA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905c5988-2ebb-4e74-a09f-5391d57cd0f5_1480x774.png 848w, https://substackcdn.com/image/fetch/$s_!SusA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905c5988-2ebb-4e74-a09f-5391d57cd0f5_1480x774.png 1272w, https://substackcdn.com/image/fetch/$s_!SusA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F905c5988-2ebb-4e74-a09f-5391d57cd0f5_1480x774.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We scored answers by checking whether these terms appeared (substring matching). If an answer mentioned &#8220;Karataev,&#8221; we inferred the model had successfully found and referenced that section of the book. Two terms (pierre, happiness) are essentially baselines since they appear in the question itself. The remaining terms test whether the model found the relevant plot points.</p><p><strong>Pass threshold:</strong> 4/8 terms (finding at least half the key plot points indicates the model successfully navigated the document rather than guessing)</p><p><strong>Limitations:</strong> This approach rewards finding the right sections and using exact terminology. A model that described &#8220;the peasant who changed his worldview&#8221; without naming Karataev would receive no credit. However, automated scoring enabled consistent evaluation across 30 runs.</p><h3>Aggregation Task</h3><p><strong>Corpus:</strong> Mega Corpus (2.2M tokens across 6 novels)</p><p><strong>Question:</strong> &#8220;What is the final fate of the protagonist in each of the 6 books?&#8221;</p><p><strong>What we&#8217;re measuring:</strong> Can the model systematically explore multiple documents, identify the protagonist of each, and correctly describe their ending?</p><p><strong>Scoring approach:</strong> Each book scored 1 point if the answer correctly identified both the protagonist and their fate. For example: Pip in Great Expectations ends up reunited with Estella (or alone, depending on the edition). Scoring was binary per book: partial credit (correct protagonist, wrong fate) was not awarded.  </p><p><strong>Pass threshold:</strong> 3/6 books correct (correctly covering at least half the corpus indicates systematic exploration rather than partial success on one or two books)</p><h2>3. Results</h2><h3>The Variance Problem</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OPtB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce3bd8ce-f8fa-4614-ae93-af709d6cf2ae_2079x898.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OPtB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce3bd8ce-f8fa-4614-ae93-af709d6cf2ae_2079x898.png 424w, https://substackcdn.com/image/fetch/$s_!OPtB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce3bd8ce-f8fa-4614-ae93-af709d6cf2ae_2079x898.png 848w, https://substackcdn.com/image/fetch/$s_!OPtB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce3bd8ce-f8fa-4614-ae93-af709d6cf2ae_2079x898.png 1272w, https://substackcdn.com/image/fetch/$s_!OPtB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce3bd8ce-f8fa-4614-ae93-af709d6cf2ae_2079x898.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OPtB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce3bd8ce-f8fa-4614-ae93-af709d6cf2ae_2079x898.png" width="1456" height="629" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ce3bd8ce-f8fa-4614-ae93-af709d6cf2ae_2079x898.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:629,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:122432,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce3bd8ce-f8fa-4614-ae93-af709d6cf2ae_2079x898.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OPtB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce3bd8ce-f8fa-4614-ae93-af709d6cf2ae_2079x898.png 424w, https://substackcdn.com/image/fetch/$s_!OPtB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce3bd8ce-f8fa-4614-ae93-af709d6cf2ae_2079x898.png 848w, https://substackcdn.com/image/fetch/$s_!OPtB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce3bd8ce-f8fa-4614-ae93-af709d6cf2ae_2079x898.png 1272w, https://substackcdn.com/image/fetch/$s_!OPtB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce3bd8ce-f8fa-4614-ae93-af709d6cf2ae_2079x898.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>This is the central finding of our research.</strong> Identical inputs, identical model, dramatically different outputs. Each dot in the chart above represents one run. The spread tells the story.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!R6Rv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17dc61fd-f435-44ad-a378-5174daa2851c_1480x295.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!R6Rv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17dc61fd-f435-44ad-a378-5174daa2851c_1480x295.png 424w, https://substackcdn.com/image/fetch/$s_!R6Rv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17dc61fd-f435-44ad-a378-5174daa2851c_1480x295.png 848w, https://substackcdn.com/image/fetch/$s_!R6Rv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17dc61fd-f435-44ad-a378-5174daa2851c_1480x295.png 1272w, https://substackcdn.com/image/fetch/$s_!R6Rv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17dc61fd-f435-44ad-a378-5174daa2851c_1480x295.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!R6Rv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17dc61fd-f435-44ad-a378-5174daa2851c_1480x295.png" width="1456" height="290" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/17dc61fd-f435-44ad-a378-5174daa2851c_1480x295.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:290,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:35492,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17dc61fd-f435-44ad-a378-5174daa2851c_1480x295.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!R6Rv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17dc61fd-f435-44ad-a378-5174daa2851c_1480x295.png 424w, https://substackcdn.com/image/fetch/$s_!R6Rv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17dc61fd-f435-44ad-a378-5174daa2851c_1480x295.png 848w, https://substackcdn.com/image/fetch/$s_!R6Rv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17dc61fd-f435-44ad-a378-5174daa2851c_1480x295.png 1272w, https://substackcdn.com/image/fetch/$s_!R6Rv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17dc61fd-f435-44ad-a378-5174daa2851c_1480x295.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZNP9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a9ad3b-ec54-4313-9989-d08dff2dfee8_1480x295.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZNP9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a9ad3b-ec54-4313-9989-d08dff2dfee8_1480x295.png 424w, https://substackcdn.com/image/fetch/$s_!ZNP9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a9ad3b-ec54-4313-9989-d08dff2dfee8_1480x295.png 848w, https://substackcdn.com/image/fetch/$s_!ZNP9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a9ad3b-ec54-4313-9989-d08dff2dfee8_1480x295.png 1272w, https://substackcdn.com/image/fetch/$s_!ZNP9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a9ad3b-ec54-4313-9989-d08dff2dfee8_1480x295.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZNP9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a9ad3b-ec54-4313-9989-d08dff2dfee8_1480x295.png" width="1456" height="290" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62a9ad3b-ec54-4313-9989-d08dff2dfee8_1480x295.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:290,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:35144,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a9ad3b-ec54-4313-9989-d08dff2dfee8_1480x295.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZNP9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a9ad3b-ec54-4313-9989-d08dff2dfee8_1480x295.png 424w, https://substackcdn.com/image/fetch/$s_!ZNP9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a9ad3b-ec54-4313-9989-d08dff2dfee8_1480x295.png 848w, https://substackcdn.com/image/fetch/$s_!ZNP9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a9ad3b-ec54-4313-9989-d08dff2dfee8_1480x295.png 1272w, https://substackcdn.com/image/fetch/$s_!ZNP9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62a9ad3b-ec54-4313-9989-d08dff2dfee8_1480x295.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>What this means: </strong>If you ran RLM once on the aggregation task and got 0/6, you might conclude &#8220;RLM doesn&#8217;t work.&#8221; If you got 6/6, you might conclude &#8220;RLM is perfect.&#8221; Both conclusions would be wrong.</p><p><strong>Failure rates tell the deployment story. </strong>For aggregation tasks, the probability of near-complete failure (score &#8804; 1) was:</p><ul><li><p>Custom RLM: 10% of runs</p></li><li><p>DSPy RLM: 17% of runs</p></li><li><p>RAG: 33% of runs</p></li></ul><p>These failure rates matter more than mean scores for production systems. A method with high mean but 17% catastrophic failure rate may be unacceptable for critical applications.</p><p><strong>Are the differences statistically significant? </strong>With n=30, we can compute 95% confidence intervals:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s_FB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928e41cb-8a3b-484c-9a09-c4ba4e40a474_1480x400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s_FB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928e41cb-8a3b-484c-9a09-c4ba4e40a474_1480x400.png 424w, https://substackcdn.com/image/fetch/$s_!s_FB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928e41cb-8a3b-484c-9a09-c4ba4e40a474_1480x400.png 848w, https://substackcdn.com/image/fetch/$s_!s_FB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928e41cb-8a3b-484c-9a09-c4ba4e40a474_1480x400.png 1272w, https://substackcdn.com/image/fetch/$s_!s_FB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928e41cb-8a3b-484c-9a09-c4ba4e40a474_1480x400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s_FB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928e41cb-8a3b-484c-9a09-c4ba4e40a474_1480x400.png" width="1456" height="394" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/928e41cb-8a3b-484c-9a09-c4ba4e40a474_1480x400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:394,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:40842,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928e41cb-8a3b-484c-9a09-c4ba4e40a474_1480x400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!s_FB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928e41cb-8a3b-484c-9a09-c4ba4e40a474_1480x400.png 424w, https://substackcdn.com/image/fetch/$s_!s_FB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928e41cb-8a3b-484c-9a09-c4ba4e40a474_1480x400.png 848w, https://substackcdn.com/image/fetch/$s_!s_FB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928e41cb-8a3b-484c-9a09-c4ba4e40a474_1480x400.png 1272w, https://substackcdn.com/image/fetch/$s_!s_FB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F928e41cb-8a3b-484c-9a09-c4ba4e40a474_1480x400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The CIs for Custom RLM and DSPy RLM overlap. A t-test confirms the difference is not statistically significant (p=0.12). While Custom RLM shows a higher mean, the difference could be due to chance. RAG&#8217;s lower performance, however, is statistically significant compared to both RLM variants.</p><h3>Reasoning Task Results</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9yFR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f49d50b-5422-407d-99d4-c5e255b5a540_1487x890.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9yFR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f49d50b-5422-407d-99d4-c5e255b5a540_1487x890.png 424w, https://substackcdn.com/image/fetch/$s_!9yFR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f49d50b-5422-407d-99d4-c5e255b5a540_1487x890.png 848w, https://substackcdn.com/image/fetch/$s_!9yFR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f49d50b-5422-407d-99d4-c5e255b5a540_1487x890.png 1272w, https://substackcdn.com/image/fetch/$s_!9yFR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f49d50b-5422-407d-99d4-c5e255b5a540_1487x890.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9yFR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f49d50b-5422-407d-99d4-c5e255b5a540_1487x890.png" width="1456" height="871" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f49d50b-5422-407d-99d4-c5e255b5a540_1487x890.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:871,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:85215,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f49d50b-5422-407d-99d4-c5e255b5a540_1487x890.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9yFR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f49d50b-5422-407d-99d4-c5e255b5a540_1487x890.png 424w, https://substackcdn.com/image/fetch/$s_!9yFR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f49d50b-5422-407d-99d4-c5e255b5a540_1487x890.png 848w, https://substackcdn.com/image/fetch/$s_!9yFR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f49d50b-5422-407d-99d4-c5e255b5a540_1487x890.png 1272w, https://substackcdn.com/image/fetch/$s_!9yFR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f49d50b-5422-407d-99d4-c5e255b5a540_1487x890.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AXd9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e01428d-4f18-4537-97d8-e2f75f53fd38_1480x445.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AXd9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e01428d-4f18-4537-97d8-e2f75f53fd38_1480x445.png 424w, https://substackcdn.com/image/fetch/$s_!AXd9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e01428d-4f18-4537-97d8-e2f75f53fd38_1480x445.png 848w, https://substackcdn.com/image/fetch/$s_!AXd9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e01428d-4f18-4537-97d8-e2f75f53fd38_1480x445.png 1272w, https://substackcdn.com/image/fetch/$s_!AXd9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e01428d-4f18-4537-97d8-e2f75f53fd38_1480x445.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AXd9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e01428d-4f18-4537-97d8-e2f75f53fd38_1480x445.png" width="1456" height="438" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e01428d-4f18-4537-97d8-e2f75f53fd38_1480x445.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:438,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52765,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e01428d-4f18-4537-97d8-e2f75f53fd38_1480x445.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AXd9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e01428d-4f18-4537-97d8-e2f75f53fd38_1480x445.png 424w, https://substackcdn.com/image/fetch/$s_!AXd9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e01428d-4f18-4537-97d8-e2f75f53fd38_1480x445.png 848w, https://substackcdn.com/image/fetch/$s_!AXd9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e01428d-4f18-4537-97d8-e2f75f53fd38_1480x445.png 1272w, https://substackcdn.com/image/fetch/$s_!AXd9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e01428d-4f18-4537-97d8-e2f75f53fd38_1480x445.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Key observations:</h4><ol><li><p><strong>DSPy RLM achieved highest mean score </strong>(6.30) with moderate variance</p></li><li><p><strong>Chunking is perfectly consistent </strong>but 4-9x more expensive</p></li><li><p><strong>Custom RLM has high variance</strong> (scores ranged 0-8)</p></li><li><p><strong>RAG is cheap and consistent</strong> but lower accuracy</p></li></ol><h3>Aggregation Task Results</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!w6P9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4dcddf0-becf-4389-845c-e781b9de706b_1487x890.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!w6P9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4dcddf0-becf-4389-845c-e781b9de706b_1487x890.png 424w, https://substackcdn.com/image/fetch/$s_!w6P9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4dcddf0-becf-4389-845c-e781b9de706b_1487x890.png 848w, https://substackcdn.com/image/fetch/$s_!w6P9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4dcddf0-becf-4389-845c-e781b9de706b_1487x890.png 1272w, https://substackcdn.com/image/fetch/$s_!w6P9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4dcddf0-becf-4389-845c-e781b9de706b_1487x890.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!w6P9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4dcddf0-becf-4389-845c-e781b9de706b_1487x890.png" width="1456" height="871" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a4dcddf0-becf-4389-845c-e781b9de706b_1487x890.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:871,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:86996,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4dcddf0-becf-4389-845c-e781b9de706b_1487x890.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!w6P9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4dcddf0-becf-4389-845c-e781b9de706b_1487x890.png 424w, https://substackcdn.com/image/fetch/$s_!w6P9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4dcddf0-becf-4389-845c-e781b9de706b_1487x890.png 848w, https://substackcdn.com/image/fetch/$s_!w6P9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4dcddf0-becf-4389-845c-e781b9de706b_1487x890.png 1272w, https://substackcdn.com/image/fetch/$s_!w6P9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4dcddf0-becf-4389-845c-e781b9de706b_1487x890.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SBsP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2593541c-cd10-4f60-adc8-6542d31f9ef9_1480x370.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SBsP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2593541c-cd10-4f60-adc8-6542d31f9ef9_1480x370.png 424w, https://substackcdn.com/image/fetch/$s_!SBsP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2593541c-cd10-4f60-adc8-6542d31f9ef9_1480x370.png 848w, https://substackcdn.com/image/fetch/$s_!SBsP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2593541c-cd10-4f60-adc8-6542d31f9ef9_1480x370.png 1272w, https://substackcdn.com/image/fetch/$s_!SBsP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2593541c-cd10-4f60-adc8-6542d31f9ef9_1480x370.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SBsP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2593541c-cd10-4f60-adc8-6542d31f9ef9_1480x370.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2593541c-cd10-4f60-adc8-6542d31f9ef9_1480x370.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43656,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2593541c-cd10-4f60-adc8-6542d31f9ef9_1480x370.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SBsP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2593541c-cd10-4f60-adc8-6542d31f9ef9_1480x370.png 424w, https://substackcdn.com/image/fetch/$s_!SBsP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2593541c-cd10-4f60-adc8-6542d31f9ef9_1480x370.png 848w, https://substackcdn.com/image/fetch/$s_!SBsP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2593541c-cd10-4f60-adc8-6542d31f9ef9_1480x370.png 1272w, https://substackcdn.com/image/fetch/$s_!SBsP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2593541c-cd10-4f60-adc8-6542d31f9ef9_1480x370.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h4>Key observations:</h4><ol><li><p><strong>Aggregation shows higher variance than reasoning</strong> for all methods</p></li><li><p><strong>Custom RLM outperformed DSPy RLM</strong> on mean score (4.60 vs 3.77)</p></li><li><p><strong>RAG struggled</strong> with multi-document aggregation (33% failure rate). Unlike semantic similarity tasks, aggregation requires systematic coverage with correct book-to-protagonist mapping. RAG&#8217;s top-k retrieval pulls the most semantically similar chunks, which may cluster around 2-3 books rather than sampling each of the 6 systematically. This explains RAG&#8217;s counterintuitively high failure rate despite its reputation for consistency.  </p></li><li><p><strong>Both RLM variants showed full-range variance</strong> (0 to 6)</p></li></ol><h3>Model Selection Effect</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_73q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f696d2-8444-43b5-8803-be2ef84412ab_1479x882.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_73q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f696d2-8444-43b5-8803-be2ef84412ab_1479x882.png 424w, https://substackcdn.com/image/fetch/$s_!_73q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f696d2-8444-43b5-8803-be2ef84412ab_1479x882.png 848w, https://substackcdn.com/image/fetch/$s_!_73q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f696d2-8444-43b5-8803-be2ef84412ab_1479x882.png 1272w, https://substackcdn.com/image/fetch/$s_!_73q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f696d2-8444-43b5-8803-be2ef84412ab_1479x882.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_73q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f696d2-8444-43b5-8803-be2ef84412ab_1479x882.png" width="1456" height="868" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e8f696d2-8444-43b5-8803-be2ef84412ab_1479x882.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:868,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:68277,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f696d2-8444-43b5-8803-be2ef84412ab_1479x882.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_73q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f696d2-8444-43b5-8803-be2ef84412ab_1479x882.png 424w, https://substackcdn.com/image/fetch/$s_!_73q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f696d2-8444-43b5-8803-be2ef84412ab_1479x882.png 848w, https://substackcdn.com/image/fetch/$s_!_73q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f696d2-8444-43b5-8803-be2ef84412ab_1479x882.png 1272w, https://substackcdn.com/image/fetch/$s_!_73q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe8f696d2-8444-43b5-8803-be2ef84412ab_1479x882.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>A striking finding:</strong> model capability determines RLM viability.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eqZf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ac4761-2f22-4b4f-8d74-f16b248d8fc5_1480x445.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eqZf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ac4761-2f22-4b4f-8d74-f16b248d8fc5_1480x445.png 424w, https://substackcdn.com/image/fetch/$s_!eqZf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ac4761-2f22-4b4f-8d74-f16b248d8fc5_1480x445.png 848w, https://substackcdn.com/image/fetch/$s_!eqZf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ac4761-2f22-4b4f-8d74-f16b248d8fc5_1480x445.png 1272w, https://substackcdn.com/image/fetch/$s_!eqZf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ac4761-2f22-4b4f-8d74-f16b248d8fc5_1480x445.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eqZf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ac4761-2f22-4b4f-8d74-f16b248d8fc5_1480x445.png" width="1456" height="438" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08ac4761-2f22-4b4f-8d74-f16b248d8fc5_1480x445.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:438,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48756,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ac4761-2f22-4b4f-8d74-f16b248d8fc5_1480x445.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eqZf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ac4761-2f22-4b4f-8d74-f16b248d8fc5_1480x445.png 424w, https://substackcdn.com/image/fetch/$s_!eqZf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ac4761-2f22-4b4f-8d74-f16b248d8fc5_1480x445.png 848w, https://substackcdn.com/image/fetch/$s_!eqZf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ac4761-2f22-4b4f-8d74-f16b248d8fc5_1480x445.png 1272w, https://substackcdn.com/image/fetch/$s_!eqZf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ac4761-2f22-4b4f-8d74-f16b248d8fc5_1480x445.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Why reasoning models succeed: </strong>They can plan a systematic exploration strategy before executing. Standard models dive deep into the first interesting thread and exhaust their iteration budget.</p><h3>Cost Analysis</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UcgI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f33f14e-9aa5-4253-896d-7900b485b6bf_1479x890.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UcgI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f33f14e-9aa5-4253-896d-7900b485b6bf_1479x890.png 424w, https://substackcdn.com/image/fetch/$s_!UcgI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f33f14e-9aa5-4253-896d-7900b485b6bf_1479x890.png 848w, https://substackcdn.com/image/fetch/$s_!UcgI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f33f14e-9aa5-4253-896d-7900b485b6bf_1479x890.png 1272w, https://substackcdn.com/image/fetch/$s_!UcgI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f33f14e-9aa5-4253-896d-7900b485b6bf_1479x890.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UcgI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f33f14e-9aa5-4253-896d-7900b485b6bf_1479x890.png" width="1456" height="876" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f33f14e-9aa5-4253-896d-7900b485b6bf_1479x890.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:876,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:91382,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f33f14e-9aa5-4253-896d-7900b485b6bf_1479x890.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UcgI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f33f14e-9aa5-4253-896d-7900b485b6bf_1479x890.png 424w, https://substackcdn.com/image/fetch/$s_!UcgI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f33f14e-9aa5-4253-896d-7900b485b6bf_1479x890.png 848w, https://substackcdn.com/image/fetch/$s_!UcgI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f33f14e-9aa5-4253-896d-7900b485b6bf_1479x890.png 1272w, https://substackcdn.com/image/fetch/$s_!UcgI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f33f14e-9aa5-4253-896d-7900b485b6bf_1479x890.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BLkM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81f1b0a-4b65-45eb-a446-ecef88551b93_1480x668.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BLkM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81f1b0a-4b65-45eb-a446-ecef88551b93_1480x668.png 424w, https://substackcdn.com/image/fetch/$s_!BLkM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81f1b0a-4b65-45eb-a446-ecef88551b93_1480x668.png 848w, https://substackcdn.com/image/fetch/$s_!BLkM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81f1b0a-4b65-45eb-a446-ecef88551b93_1480x668.png 1272w, https://substackcdn.com/image/fetch/$s_!BLkM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81f1b0a-4b65-45eb-a446-ecef88551b93_1480x668.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BLkM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81f1b0a-4b65-45eb-a446-ecef88551b93_1480x668.png" width="1456" height="657" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f81f1b0a-4b65-45eb-a446-ecef88551b93_1480x668.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:657,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:72514,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81f1b0a-4b65-45eb-a446-ecef88551b93_1480x668.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BLkM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81f1b0a-4b65-45eb-a446-ecef88551b93_1480x668.png 424w, https://substackcdn.com/image/fetch/$s_!BLkM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81f1b0a-4b65-45eb-a446-ecef88551b93_1480x668.png 848w, https://substackcdn.com/image/fetch/$s_!BLkM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81f1b0a-4b65-45eb-a446-ecef88551b93_1480x668.png 1272w, https://substackcdn.com/image/fetch/$s_!BLkM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81f1b0a-4b65-45eb-a446-ecef88551b93_1480x668.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Total variance testing cost:</strong> ~$28 for 120 RLM runs</p><h3>The Retry Strategy: Best-of-3</h3><p>If variance is unavoidable, can we mitigate it by running multiple times? We simulated a best-of-3 strategy using our existing 30 runs (taking the max score from each group of 3):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uYLs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11d70e15-dec3-4737-96c7-c75eec5c9b4a_1480x400.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uYLs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11d70e15-dec3-4737-96c7-c75eec5c9b4a_1480x400.png 424w, https://substackcdn.com/image/fetch/$s_!uYLs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11d70e15-dec3-4737-96c7-c75eec5c9b4a_1480x400.png 848w, https://substackcdn.com/image/fetch/$s_!uYLs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11d70e15-dec3-4737-96c7-c75eec5c9b4a_1480x400.png 1272w, https://substackcdn.com/image/fetch/$s_!uYLs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11d70e15-dec3-4737-96c7-c75eec5c9b4a_1480x400.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uYLs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11d70e15-dec3-4737-96c7-c75eec5c9b4a_1480x400.png" width="1456" height="394" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/11d70e15-dec3-4737-96c7-c75eec5c9b4a_1480x400.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:394,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:49133,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11d70e15-dec3-4737-96c7-c75eec5c9b4a_1480x400.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uYLs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11d70e15-dec3-4737-96c7-c75eec5c9b4a_1480x400.png 424w, https://substackcdn.com/image/fetch/$s_!uYLs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11d70e15-dec3-4737-96c7-c75eec5c9b4a_1480x400.png 848w, https://substackcdn.com/image/fetch/$s_!uYLs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11d70e15-dec3-4737-96c7-c75eec5c9b4a_1480x400.png 1272w, https://substackcdn.com/image/fetch/$s_!uYLs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11d70e15-dec3-4737-96c7-c75eec5c9b4a_1480x400.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>The practical takeaway: </strong>Running Custom RLM three times and taking the best result achieves 100% pass rate in our sample on aggregation at ~3&#215; the cost of a single run. This transforms an unreliable method into a deployable one.</p><h2>4. Discussion</h2><h3>Why Variance Matters</h3><p>We treat mean score as a measure of <strong>capability</strong> and failure probability as a measure of <strong>reliability</strong>. Both matter for deployment decisions, but they answer different questions.</p><p>The distributions we observed are heavy-tailed, with occasional catastrophic failures even when mean performance is high. This is why variance matters so much for agentic systems.</p><p>Single-run benchmarks are standard practice in AI research. Our findings suggest this practice may systematically mislead:</p><ol><li><p><strong>Cherry-picking risk: </strong>Researchers (consciously or not) may report favorable runs</p></li><li><p><strong>Reproducibility crisis:</strong> Others cannot replicate &#8220;good&#8221; results</p></li><li><p><strong>Deployment surprise: </strong>Production systems encounter the full variance distribution</p></li></ol><p><strong>Recommendation: </strong>Report mean and standard deviation from multiple runs, especially for agentic/iterative systems like RLM.</p><h3>When to Use Each Method</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Jdzv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb31ce64-3774-4052-aa41-74d43de0a08f_1780x565.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Jdzv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb31ce64-3774-4052-aa41-74d43de0a08f_1780x565.png 424w, https://substackcdn.com/image/fetch/$s_!Jdzv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb31ce64-3774-4052-aa41-74d43de0a08f_1780x565.png 848w, https://substackcdn.com/image/fetch/$s_!Jdzv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb31ce64-3774-4052-aa41-74d43de0a08f_1780x565.png 1272w, https://substackcdn.com/image/fetch/$s_!Jdzv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb31ce64-3774-4052-aa41-74d43de0a08f_1780x565.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Jdzv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb31ce64-3774-4052-aa41-74d43de0a08f_1780x565.png" width="1456" height="462" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb31ce64-3774-4052-aa41-74d43de0a08f_1780x565.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:462,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:87820,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb31ce64-3774-4052-aa41-74d43de0a08f_1780x565.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Jdzv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb31ce64-3774-4052-aa41-74d43de0a08f_1780x565.png 424w, https://substackcdn.com/image/fetch/$s_!Jdzv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb31ce64-3774-4052-aa41-74d43de0a08f_1780x565.png 848w, https://substackcdn.com/image/fetch/$s_!Jdzv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb31ce64-3774-4052-aa41-74d43de0a08f_1780x565.png 1272w, https://substackcdn.com/image/fetch/$s_!Jdzv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb31ce64-3774-4052-aa41-74d43de0a08f_1780x565.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>The Reasoning Model Requirement</h3><h4>RLM&#8217;s effectiveness depends critically on model capability:</h4><ul><li><p><strong>Standard models (gpt-4o, gpt-4o-mini): </strong>Cannot execute systematic exploration strategies. Get &#8220;stuck&#8221; in local optima.</p></li><li><p><strong>Reasoning models (gpt-5-mini, gpt-5.2):</strong> Plan before acting. Enumerate documents before diving deep.</p></li></ul><p><strong>Practical implication: </strong>Do not use RLM with standard models for complex tasks. The cost savings are not worth the reliability loss.</p><h3>Library vs Custom Implementation</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WA8W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b92205-be3c-4f85-8c8e-7097417b9709_1780x565.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WA8W!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b92205-be3c-4f85-8c8e-7097417b9709_1780x565.png 424w, https://substackcdn.com/image/fetch/$s_!WA8W!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b92205-be3c-4f85-8c8e-7097417b9709_1780x565.png 848w, https://substackcdn.com/image/fetch/$s_!WA8W!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b92205-be3c-4f85-8c8e-7097417b9709_1780x565.png 1272w, https://substackcdn.com/image/fetch/$s_!WA8W!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b92205-be3c-4f85-8c8e-7097417b9709_1780x565.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WA8W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b92205-be3c-4f85-8c8e-7097417b9709_1780x565.png" width="1456" height="462" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93b92205-be3c-4f85-8c8e-7097417b9709_1780x565.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:462,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:66468,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b92205-be3c-4f85-8c8e-7097417b9709_1780x565.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WA8W!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b92205-be3c-4f85-8c8e-7097417b9709_1780x565.png 424w, https://substackcdn.com/image/fetch/$s_!WA8W!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b92205-be3c-4f85-8c8e-7097417b9709_1780x565.png 848w, https://substackcdn.com/image/fetch/$s_!WA8W!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b92205-be3c-4f85-8c8e-7097417b9709_1780x565.png 1272w, https://substackcdn.com/image/fetch/$s_!WA8W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b92205-be3c-4f85-8c8e-7097417b9709_1780x565.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We compared two approaches: using DSPy&#8217;s built-in RLM module versus building a custom implementation following the paper&#8217;s methodology.</p><p><strong>Why did our custom implementation outperform on aggregation?</strong> Our custom prompts explicitly guided the model to sample the beginning of documents first, understand naming conventions, and systematically enumerate all books before diving deep. DSPy&#8217;s generic RLM module lacks this task-specific guidance, which may explain why it excelled at depth (single-document reasoning) but struggled with breadth (multi-document coverage).</p><p><strong>Recommendation: </strong>For single-document reasoning where consistency matters, use an existing library like DSPy&#8217;s RLM module. For multi-document synthesis where mean accuracy matters more than run-to-run variance, building a custom implementation with task-specific prompts may yield better results.</p><h2>5. Limitations</h2><ol><li><p><strong>Literary corpus only:</strong> Results may differ on technical, legal, or scientific documents</p></li><li><p><strong>Training data contamination: </strong>These classic novels are almost certainly in the training data of frontier models. We cannot determine how much the models &#8220;remember&#8221; versus genuinely discover through RLM exploration. Results on proprietary or novel documents may differ.</p></li><li><p><strong>Single model family: </strong>All tests used OpenAI models; other providers may show different patterns</p></li><li><p><strong>English only: </strong>Non-English documents not tested</p></li><li><p><strong>Scoring subjectivity:</strong> Key-term matching is imperfect for nuanced questions</p></li></ol><h2>6. Conclusion</h2><p>RLM delivers on its promise of efficient long-document processing, but with important caveats:</p><p><strong>Variance is real and significant.</strong></p><p>Plan for it. Run multiple times for important queries.</p><p><strong>Model selection is critical.</strong></p><p>Reasoning models are not optional; they&#8217;re required for reliable RLM.</p><p><strong>Task type matters.</strong></p><p>RLM excels at single-document reasoning; struggles more with multi-document aggregation.</p><p><strong>Tradeoffs are real.</strong></p><p>Lower token costs come with higher variance. Chunking&#8217;s brute-force consistency has value.</p><p><strong>For practitioners:</strong> If you need consistent results on stable corpora, invest in RAG infrastructure. If you need flexible ad-hoc queries without setup, use RLM with a reasoning model, but run it multiple times and aggregate results.</p><p><strong>For researchers: </strong>Report variance. Single-run benchmarks on agentic systems may be systematically misleading.</p><h2>Appendix: Raw Data</h2><p>All n=30 results available in CSV format upon request.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6gsu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdab3d09-7605-4af5-a12e-c7ae83ef1e75_2079x1181.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6gsu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdab3d09-7605-4af5-a12e-c7ae83ef1e75_2079x1181.png 424w, https://substackcdn.com/image/fetch/$s_!6gsu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdab3d09-7605-4af5-a12e-c7ae83ef1e75_2079x1181.png 848w, https://substackcdn.com/image/fetch/$s_!6gsu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdab3d09-7605-4af5-a12e-c7ae83ef1e75_2079x1181.png 1272w, https://substackcdn.com/image/fetch/$s_!6gsu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdab3d09-7605-4af5-a12e-c7ae83ef1e75_2079x1181.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6gsu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdab3d09-7605-4af5-a12e-c7ae83ef1e75_2079x1181.png" width="1456" height="827" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fdab3d09-7605-4af5-a12e-c7ae83ef1e75_2079x1181.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:827,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:124659,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.anothercodingblog.com/i/186914127?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdab3d09-7605-4af5-a12e-c7ae83ef1e75_2079x1181.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6gsu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdab3d09-7605-4af5-a12e-c7ae83ef1e75_2079x1181.png 424w, https://substackcdn.com/image/fetch/$s_!6gsu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdab3d09-7605-4af5-a12e-c7ae83ef1e75_2079x1181.png 848w, https://substackcdn.com/image/fetch/$s_!6gsu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdab3d09-7605-4af5-a12e-c7ae83ef1e75_2079x1181.png 1272w, https://substackcdn.com/image/fetch/$s_!6gsu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffdab3d09-7605-4af5-a12e-c7ae83ef1e75_2079x1181.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>What this chart shows:</strong> Each panel plots score (y-axis) against run number (x-axis) for one method/task combination. The black horizontal line is the mean. Notice there&#8217;s no pattern: run #1 isn&#8217;t better or worse than run #30. The variance is truly random, not a warmup effect or degradation over time. This confirms the variance we observed is inherent to the method, not an artifact of our testing procedure.</p><p><em>Research conducted February 2026. Code and data available upon request.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.anothercodingblog.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Another Coding Blog is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>