<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Congestion Control | 木叶吟</title><link>https://yezhisheng.me/tag/congestion-control/</link><atom:link href="https://yezhisheng.me/tag/congestion-control/index.xml" rel="self" type="application/rss+xml"/><description>Congestion Control</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><copyright> 又拍云提供CDN服务
京ICP备16021535号-1</copyright><lastBuildDate>Sun, 17 May 2026 13:10:00 +0800</lastBuildDate><image><url>https://yezhisheng.me/media/icon_hu585778a5d9441f07b7d64e1beae1be58_320895_512x512_fill_lanczos_center_3.png</url><title>Congestion Control</title><link>https://yezhisheng.me/tag/congestion-control/</link></image><item><title>CONCUR: Controlling Mid-Phase Thrashing in Agentic Batch Inference</title><link>https://yezhisheng.me/post/concur/</link><pubDate>Sun, 17 May 2026 13:10:00 +0800</pubDate><guid>https://yezhisheng.me/post/concur/</guid><description>&lt;p>Batch inference for LLMs used to be shaped by requests. A request arrives, the server schedules prefill and decode, the KV cache grows for that sequence, and the request eventually leaves.&lt;/p>
&lt;p>Agentic workloads are different. An agent is not one request. It is a long-running loop of planning, tool calls, observations, and follow-up generations. Many agents can stay alive at the same time, and each one gradually accumulates KV state. The server may still see individual requests, but the resource pressure is created by agent lifetimes.&lt;/p>
&lt;p>&lt;a href="https://yezhisheng.me/publication/concur/">CONCUR&lt;/a> focuses on the pathology that appears in this setting: mid-phase thrashing.&lt;/p>
&lt;h2 id="what-is-mid-phase-thrashing">What Is Mid-Phase Thrashing?&lt;/h2>
&lt;p>A long-running batch of agents does not fail immediately. At the beginning, most agents have short histories, KV cache demand is modest, and throughput looks healthy. Near the end, many agents have already completed, so pressure drops again.&lt;/p>
&lt;p>The hard part is the middle.&lt;/p>
&lt;p>In the mid phase, many agents are still active and their histories have grown. The aggregate KV cache footprint becomes large, but the GPU memory may not be completely exhausted yet. This is what makes the problem subtle: the system can look feasible by a capacity check, while the cache is already becoming inefficient.&lt;/p>
&lt;p>When KV pressure crosses a threshold, request-level cache management starts fighting itself. A serving system may evict old KV blocks to make room for new ones. But agentic workloads soon return to those evicted histories, because the same agents keep generating, calling tools, and continuing. The server then has to recompute or reload context, which consumes GPU time and causes more cache churn. More churn leads to more eviction. More eviction leads to more recomputation. Throughput collapses before memory capacity is formally exhausted.&lt;/p>
&lt;p>That collapse is mid-phase thrashing.&lt;/p>
&lt;h2 id="why-request-level-control-is-too-late">Why Request-Level Control Is Too Late&lt;/h2>
&lt;p>The root cause is a mismatch of control granularity. The serving runtime manages individual requests, but the pressure source is the number of active agents.&lt;/p>
&lt;p>If too many agents are admitted together, each agent continues to grow its own history. A reactive cache policy can only respond after the KV cache is already congested. LRU-style eviction may be locally reasonable for a single request stream, but it is a poor global signal for agentic workloads. It does not know that an evicted block belongs to a still-living agent that will likely need it again soon.&lt;/p>
&lt;p>In other words, the system is not just running out of memory. It is admitting too many long-lived state machines into a shared cache.&lt;/p>
&lt;p>CONCUR changes the question from &amp;ldquo;which KV block should we evict now?&amp;rdquo; to &amp;ldquo;how many agents should be active at the same time?&amp;rdquo;&lt;/p>
&lt;h2 id="agent-level-admission-control">Agent-Level Admission Control&lt;/h2>
&lt;p>CONCUR adds a lightweight control layer above the LLM serving engine. It does not replace the backend cache manager. Instead, it regulates agent admission so the aggregate active-agent pressure stays below the point where cache efficiency collapses.&lt;/p>
&lt;p>The design borrows the spirit of congestion control. The KV cache is treated as a shared bottleneck resource, and the number of concurrently active agents becomes the control window. When runtime cache signals indicate the system is healthy, CONCUR increases concurrency to use more capacity. When the signals show congestion, it backs off before thrashing takes over.&lt;/p>
&lt;p>This is closer to AIMD-style control than static batching. Additive increase lets the system cautiously probe for more parallelism. Multiplicative decrease reacts quickly when cache pressure becomes dangerous. The important detail is that the control unit is an agent, not a request. Pausing admission of new agents preserves execution continuity for already-admitted agents and avoids repeatedly evicting the histories they will soon reuse.&lt;/p>
&lt;p>This proactive control also preserves compatibility. Existing LLM serving systems can continue to manage request scheduling and KV placement internally. CONCUR only decides how many agents should be allowed into the active set based on cache-aware feedback.&lt;/p>
&lt;h2 id="why-it-works">Why It Works&lt;/h2>
&lt;p>Mid-phase thrashing is caused by cumulative state pressure, so the solution has to act before the cache reaches the thrashing regime. By bounding active agents, CONCUR reduces the number of long-lived contexts competing for KV cache at once. The system may run fewer agents concurrently, but each active agent experiences less eviction and recomputation, so useful generation throughput improves.&lt;/p>
&lt;p>The paper reports that CONCUR prevents mid-phase thrashing across large models and real-world agent workloads, improving batch inference throughput by up to 4.09x on Qwen3-32B and 1.9x on DeepSeek-V3.&lt;/p>
&lt;p>The lesson is simple but easy to miss: for agentic inference, the right scheduling object is not always the request. Sometimes the request is only a symptom. The agent is the entity accumulating state, consuming cache over time, and returning to the same history again and again. CONCUR makes that entity visible to the serving system.&lt;/p>
&lt;p>Paper: &lt;a href="https://yezhisheng.me/publication/concur/">CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control&lt;/a>&lt;br>
Preprint: &lt;a href="https://arxiv.org/abs/2601.22705" target="_blank" rel="noopener">arXiv:2601.22705&lt;/a>&lt;/p></description></item><item><title>CONCUR：让 Agent 批量推理避开中期拥塞</title><link>https://yezhisheng.me/zh/post/concur/</link><pubDate>Sun, 17 May 2026 13:10:00 +0800</pubDate><guid>https://yezhisheng.me/zh/post/concur/</guid><description>&lt;p>过去的 LLM 批量推理主要围绕请求组织。一个请求到达后，服务端调度 prefill 和 decode，为这条序列维护 KV cache，请求结束后再释放相关状态。&lt;/p>
&lt;p>Agentic workload 不一样。一个 agent 不是一次请求，而是一个长时间运行的循环：planning、tool call、observation，然后继续 generation。许多 agent 可以同时存活，并逐步累积自己的 KV state。服务端看到的仍然是一个个请求，但真正制造资源压力的是 agent 的生命周期。&lt;/p>
&lt;p>&lt;a href="https://yezhisheng.me/publication/concur/">CONCUR&lt;/a> 关注的就是这种场景下出现的一类病理现象：mid-phase thrashing。&lt;/p>
&lt;h2 id="什么是中期抖动">什么是中期抖动&lt;/h2>
&lt;p>一批长时间运行的 agent 并不会一开始就失败。刚开始时，大多数 agent 的上下文还很短，KV cache 需求有限，吞吐看起来很健康。接近结束时，很多 agent 已经完成，压力也会下降。&lt;/p>
&lt;p>真正困难的是中间阶段。&lt;/p>
&lt;p>真正困难的是中期：大量 agent 仍然活跃，而且它们的上下文已经明显增长。整体 KV cache footprint 变得很大，但 GPU memory 可能还没有被完全耗尽。这让问题很隐蔽：单看容量检查，系统似乎还能运行；但从 cache efficiency 看，它已经接近失控。&lt;/p>
&lt;p>当 KV pressure 跨过某个阈值后，请求级 cache 管理会开始和自己打架。Serving system 可能会为了给新内容腾空间而 evict 旧的 KV blocks。但 agentic workload 很快又会回到这些被 evict 的历史，因为同一个 agent 会继续生成、调用工具、再继续推理。于是服务端不得不重新计算或重新加载上下文，消耗 GPU 时间，并带来更多 cache churn。更多 churn 导致更多 eviction，更多 eviction 又导致更多 recomputation。内存容量还没正式耗尽，吞吐已经先崩了。&lt;/p>
&lt;p>这就是 mid-phase thrashing。&lt;/p>
&lt;h2 id="为什么只管请求已经太晚">为什么只管请求已经太晚&lt;/h2>
&lt;p>根因是控制粒度错位。Serving runtime 管的是单个请求，但压力来源是活跃 agent 的数量。&lt;/p>
&lt;p>如果一开始放进来的 agent 太多，每个 agent 都会继续增长自己的上下文。Reactive cache policy 只能在 KV cache 已经拥塞之后再反应。LRU 这类 eviction 对单个请求流可能局部合理，但对 agentic workload 来说不是一个好的全局信号。它不知道某个被 evict 的 block 属于一个仍然存活的 agent，而这个 agent 很可能很快又需要它。&lt;/p>
&lt;p>换句话说，系统并不只是“内存不够了”。它是把太多长生命周期的 state machine 同时放进了一个共享 cache。&lt;/p>
&lt;p>CONCUR 把问题从“现在该 evict 哪个 KV block”改写成“同一时间应该允许多少 agent 保持活跃”。&lt;/p>
&lt;h2 id="在-agent-层做准入控制">在 Agent 层做准入控制&lt;/h2>
&lt;p>CONCUR 在 LLM serving engine 上方加入了一个轻量控制层。它不替换后端 cache manager，而是调节 agent admission，让活跃 agent 的整体压力停留在 cache efficiency 崩溃点之前。&lt;/p>
&lt;p>这个设计借鉴了 congestion control 的思路。KV cache 被视为共享瓶颈资源，并发活跃 agent 数量就是 control window。当运行时 cache signal 表明系统健康时，CONCUR 增加并发，进一步利用容量；当 signal 显示系统开始拥塞时，它会在抖动失控之前退让。&lt;/p>
&lt;p>这更接近 AIMD-style control，而不是静态 batching。Additive increase 让系统谨慎探索更多并行度，multiplicative decrease 则在 cache pressure 变危险时快速反应。关键细节是，控制单位是 agent，而不是请求。暂停接纳新 agent 可以保留已经进入系统的 agent 的执行连续性，也避免反复 evict 它们很快还会复用的历史。&lt;/p>
&lt;p>这种主动控制还保持了兼容性。已有 LLM serving system 仍然可以在内部管理请求调度和 KV placement。CONCUR 只根据 cache-aware feedback 决定 active set 中允许存在多少 agent。&lt;/p>
&lt;h2 id="为什么有效">为什么有效&lt;/h2>
&lt;p>Mid-phase thrashing 的根源是累积状态压力，所以解决方案必须在 cache 进入 thrashing regime 之前介入。通过约束活跃 agent 数量，CONCUR 减少了同一时间争抢 KV cache 的长生命周期上下文。系统可能同时跑更少 agent，但每个活跃 agent 经历的 eviction 和 recomputation 更少，所以有效 generation throughput 反而提升。&lt;/p>
&lt;p>论文报告显示，CONCUR 可以在大模型和真实 agent workload 上避免 mid-phase thrashing，并将批量推理吞吐在 Qwen3-32B 上最多提升 4.09x，在 DeepSeek-V3 上提升 1.9x。&lt;/p>
&lt;p>这里的经验很简单，但很容易忽略：对于 agentic inference，正确的调度对象不一定总是请求。有时候请求只是表象。真正累积状态、持续消耗 cache、并一遍遍回到同一段历史的实体是 agent。CONCUR 让 serving system 看见了这个实体。&lt;/p>
&lt;p>Paper: &lt;a href="https://yezhisheng.me/publication/concur/">CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control&lt;/a>&lt;br>
Preprint: &lt;a href="https://arxiv.org/abs/2601.22705" target="_blank" rel="noopener">arXiv:2601.22705&lt;/a>&lt;/p></description></item></channel></rss>