木叶吟
木叶吟
Home
Experience
Publications
Posts
CV
Light
Dark
Automatic
English
中文 (简体)
Posts
GPU Pause, Resume, and Migration: The Missing Primitive in Cluster Scheduling
A technical note on GPU checkpoint/restore for schedulers, using FlowGPU as the main reference and my cudaw prototype as the first version of the codebase.
Zhisheng YE
May 17, 2026
8 min read
GPU Cluster Scheduling: A Map for Deep Learning Workloads
A technical guide to GPU datacenter scheduling based on our ACM Computing Surveys paper, covering training, inference, HPO, mixed workloads, and future scheduler design.
Zhisheng YE
May 17, 2026
7 min read
ResiHP: Surviving LLM Training Failures with Dynamic Hybrid Parallelism
A technical report on ResiHP, a resilient training system that detects fail-slow devices under noisy sequence-length variation and dynamically adapts 3D parallelism.
Zhisheng YE
May 17, 2026
3 min read
CONCUR: Controlling Mid-Phase Thrashing in Agentic Batch Inference
A technical note on CONCUR, an agent-level admission control layer that prevents KV cache collapse during long-running agentic LLM inference.
Zhisheng YE
May 17, 2026
4 min read
ASTRAEA: Fairness Is More Than Counting GPUs
A technical note on ASTRAEA, a multi-tenant GPU scheduler that measures fairness by long-term GPU-time instead of instantaneous allocation or finish time alone.
Zhisheng YE
May 17, 2026
4 min read
Hydro: Squeezing Hyperparameter Tuning into Pipeline Bubbles
A technical story behind Hydro’s Bubble Squeezer, which runs surrogate hyperparameter tuning trials inside the idle bubbles of pipeline-parallel large-model training.
Zhisheng YE
May 17, 2026
7 min read
Optimizations and Services
Serveral optimizations and current services related to this site.
Zhisheng YE
Aug 30, 2021
2 min read
Cite
×