木叶吟
木叶吟
Home
Experience
Posts
Publications
Services
CV
Light
Dark
Automatic
English
中文 (简体)
GPU Scheduling
ASTRAEA: Fairness Is More Than Counting GPUs
A technical note on ASTRAEA, a multi-tenant GPU scheduler that measures fairness by long-term GPU-time instead of instantaneous allocation or finish time alone.
Zhisheng YE
May 17, 2026
4 min read
ASTRAEA:GPU 集群里的公平,不只是分到几张卡
一篇关于 ASTRAEA 的技术笔记:它面向多租户 GPU 集群,用长期 GPU-time 衡量公平性,避免只看瞬时分配或任务完成时间。
Zhisheng YE
May 17, 2026
Hydro: Squeezing Hyperparameter Tuning into Pipeline Bubbles
A technical story behind Hydro’s Bubble Squeezer, which runs surrogate hyperparameter tuning trials inside the idle bubbles of pipeline-parallel large-model training.
Zhisheng YE
May 17, 2026
7 min read
GPU Cluster Scheduling: A Map for Deep Learning Workloads
A technical guide to GPU datacenter scheduling based on our ACM Computing Surveys paper, covering training, inference, HPO, mixed workloads, and future scheduler design.
Zhisheng YE
May 16, 2026
7 min read
GPU Pause, Resume, and Migration: The Missing Primitive in Cluster Scheduling
A technical note on GPU checkpoint/restore for schedulers, using FlowGPU as the main reference and my cudaw prototype as the first version of the codebase.
Zhisheng YE
May 15, 2026
8 min read
FlowGPU: Transparent and Efficient GPU Checkpointing and Restore
GPU checkpointing and restore promises to enable emerging tasks, such as deep learning, to benefit from functionalities like task …
Zehua Yang
,
Xiao Zheng
,
Yonghao Zou
,
Junyang Zhang
,
Zhisheng YE
,
Feng Xie
,
Xiaolin Wang
,
Yingwei Luo
,
Zhenlin Wang
,
Diyu Zhou
PDF
Cite
Characterization of Large Language Model Development in the Datacenter
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to …
Qinghao Hu
,
Zhisheng YE
,
Zerui Wang
,
Guoteng Wang
,
Meng Zhang
,
Qiaoling Chen
,
Peng Sun
,
Dahua Lin
,
Xiaolin Wang
,
Yingwei Luo
,
Yonggang Wen
,
Tianwei Zhang
Preprint
Cite
UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands
We present UniSched, a unified scheduler to optimize different types of scheduling objectives (e.g., guaranteeing the deadlines of SLO jobs, minimizing the latency of best-effort jobs).
Wei Gao
,
Zhisheng YE
,
Peng Sun
,
Tianwei Zhang
,
Yonggang Wen
Preprint
PDF
Cite
DOI
Deep Learning Workload Scheduling in GPU Datacenters: A Survey
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and …
Zhisheng YE
,
Wei Gao
,
Qinghao Hu
,
Peng Sun
,
Xiaolin Wang
,
Yingwei Luo
,
Tianwei Zhang
,
Yonggang Wen
Preprint
PDF
Cite
Project
DOI
Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
Hyperparameter tuning is an essential step in deep learning model development that provides better model performance at the cost of …
Qinghao Hu
,
Zhisheng YE
,
Meng Zhang
,
Qiaoling Chen
,
Peng Sun
,
Yonggang Wen
,
Tianwei Zhang
PDF
Cite
Code
Slides
Video
»
Cite
×