木叶吟
木叶吟
Home
Experience
Publications
Posts
CV
Light
Dark
Automatic
English
中文 (简体)
Deep Learning Systems
GPU Pause, Resume, and Migration: The Missing Primitive in Cluster Scheduling
A technical note on GPU checkpoint/restore for schedulers, using FlowGPU as the main reference and my cudaw prototype as the first version of the codebase.
Zhisheng YE
May 17, 2026
8 min read
GPU 任务的暂停、恢复与迁移:调度器一直缺的那块拼图
一篇关于 GPU checkpoint/restore 的技术笔记:以 FlowGPU 为主线,介绍 cudaw prototype 如何探索透明的暂停、恢复与迁移。
Zhisheng YE
May 17, 2026
GPU Cluster Scheduling: A Map for Deep Learning Workloads
A technical guide to GPU datacenter scheduling based on our ACM Computing Surveys paper, covering training, inference, HPO, mixed workloads, and future scheduler design.
Zhisheng YE
May 17, 2026
7 min read
GPU 集群调度:深度学习任务该如何排队、放置与共享
基于我们的 ACM Computing Surveys 论文,梳理 GPU 数据中心里的训练、推理、HPO、混合负载以及未来调度器设计。
Zhisheng YE
May 17, 2026
ASTRAEA: Fairness Is More Than Counting GPUs
A technical note on ASTRAEA, a multi-tenant GPU scheduler that measures fairness by long-term GPU-time instead of instantaneous allocation or finish time alone.
Zhisheng YE
May 17, 2026
4 min read
Cite
×