Posts

Helix: Automating Communication-Computation Overlap with Graph Scheduling

A technical note on Helix, a compiler-based graph scheduling system that overlaps communication and computation for n-D model parallel training and inference.

Zhisheng YE

May 18, 2026 9 min read

Helix: Automating Communication-Computation Overlap with Graph Scheduling

ResiHP: Surviving LLM Training Failures with Dynamic Hybrid Parallelism

A technical report on ResiHP, a resilient training system that detects fail-slow devices under noisy sequence-length variation and dynamically adapts 3D parallelism.

Zhisheng YE

May 17, 2026 3 min read

CONCUR: Controlling Mid-Phase Thrashing in Agentic Batch Inference

A technical note on CONCUR, an agent-level admission control layer that prevents KV cache collapse during long-running agentic LLM inference.

Zhisheng YE

May 17, 2026 5 min read

CONCUR: Controlling Mid-Phase Thrashing in Agentic Batch Inference

ASTRAEA: Fairness Is More Than Counting GPUs

A technical note on ASTRAEA, a multi-tenant GPU scheduler that measures fairness by long-term GPU-time instead of instantaneous allocation or finish time alone.

Zhisheng YE

May 17, 2026 5 min read

ASTRAEA: Fairness Is More Than Counting GPUs

Hydro: Squeezing Hyperparameter Tuning into Pipeline Bubbles

A technical story behind Hydro’s Bubble Squeezer, which runs surrogate hyperparameter tuning trials inside the idle bubbles of pipeline-parallel large-model training.

Zhisheng YE

May 17, 2026 8 min read

Hydro: Squeezing Hyperparameter Tuning into Pipeline Bubbles

GPU Cluster Scheduling: A Map for Deep Learning Workloads

A technical guide to GPU datacenter scheduling based on our ACM Computing Surveys paper, covering training, inference, HPO, mixed workloads, and future scheduler design.

Zhisheng YE

May 16, 2026 7 min read

GPU Cluster Scheduling: A Map for Deep Learning Workloads

GPU Pause, Resume, and Migration: The Missing Primitive in Cluster Scheduling

A technical note on GPU checkpoint/restore for schedulers, using FlowGPU as the main reference and my cudaw prototype as the first version of the codebase.

Zhisheng YE

May 15, 2026 8 min read

Optimizations and Services

Serveral optimizations and current services related to this site.

Zhisheng YE

Aug 30, 2021 2 min read