GPU Scheduling | 木叶吟

基于我们的 ACM Computing Surveys 论文，梳理 GPU 数据中心里的训练、推理、HPO、混合负载以及未来调度器设计。

Zhisheng YE

May 17, 2026

ASTRAEA: Fairness Is More Than Counting GPUs

A technical note on ASTRAEA, a multi-tenant GPU scheduler that measures fairness by long-term GPU-time instead of instantaneous allocation or finish time alone.

Zhisheng YE

May 17, 2026 5 min read

ASTRAEA: Fairness Is More Than Counting GPUs

ASTRAEA：GPU 集群里的公平，不只是分到几张卡

一篇关于 ASTRAEA 的技术笔记：它面向多租户 GPU 集群，用长期 GPU-time 衡量公平性，避免只看瞬时分配或任务完成时间。

Zhisheng YE

May 17, 2026

Hydro: Squeezing Hyperparameter Tuning into Pipeline Bubbles

A technical story behind Hydro’s Bubble Squeezer, which runs surrogate hyperparameter tuning trials inside the idle bubbles of pipeline-parallel large-model training.

Zhisheng YE

May 17, 2026 8 min read

Hydro: Squeezing Hyperparameter Tuning into Pipeline Bubbles

GPU Cluster Scheduling: A Map for Deep Learning Workloads

A technical guide to GPU datacenter scheduling based on our ACM Computing Surveys paper, covering training, inference, HPO, mixed workloads, and future scheduler design.

Zhisheng YE

May 16, 2026 7 min read

GPU Cluster Scheduling: A Map for Deep Learning Workloads

GPU Pause, Resume, and Migration: The Missing Primitive in Cluster Scheduling

A technical note on GPU checkpoint/restore for schedulers, using FlowGPU as the main reference and my cudaw prototype as the first version of the codebase.

Zhisheng YE

May 15, 2026 8 min read

FlowGPU: Transparent and Efficient GPU Checkpointing and Restore

GPU checkpointing and restore promises to enable emerging tasks, such as deep learning, to benefit from functionalities like task …

Zehua Yang, Xiao Zheng, Yonghao Zou, Junyang Zhang, Zhisheng YE, Feng Xie, Xiaolin Wang, Yingwei Luo, Zhenlin Wang, Diyu Zhou

Characterization of Large Language Model Development in the Datacenter

Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to …

Qinghao Hu, Zhisheng YE, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang

Characterization of Large Language Model Development in the Datacenter

UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands

We present UniSched, a unified scheduler to optimize different types of scheduling objectives (e.g., guaranteeing the deadlines of SLO jobs, minimizing the latency of best-effort jobs).

Wei Gao, Zhisheng YE, Peng Sun, Tianwei Zhang, Yonggang Wen

UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and …

Zhisheng YE, Wei Gao, Qinghao Hu, Peng Sun, Xiaolin Wang, Yingwei Luo, Tianwei Zhang, Yonggang Wen

Deep Learning Workload Scheduling in GPU Datacenters: A Survey