Zhisheng YE

DSc in Computer Architecture

School of Computer Science, Peking University

Biography

Hi, there! This is Zhisheng Ye. I received my Ph.D. in Institute of Networking and Energy-efficient Computing (NEEC) at Peking University in 2024, under the joint supervision of Prof. Yingwei Luo, the director of NEEC, and Prof. Xiaolin Wang. Previously, I received a B.S. degree in Computer Science and Technology from the School of Electronics Engineering and Computer Science (EECS) at Peking University, China, in 2019.

My research interests include resource management in machine learning systems and building efficient and practical systems for training and serving next-generation DNNs. I am also interested in high performance computing and GPU systems, as a former member of PKUSC. I also received mentorship from Prof. Tianwei Zhang of NTU and had strong collaborations with his students, including Wei Gao, Dr. Qinghao Hu, Meng Zhang, and Qiaoling Chen. Moreover, I received mentorship from and collaborated with Dr. Peng Sun of NDS Group at Shanghai AI Lab.

Download my CV.

Interests

Distributed Systems
Machine Learning Systems
Resource Management

Education

DSc in Computer Architecture, 2024

Peking University
BSc in Computer Science and Technology, 2019

Peking University

Experience

Research Intern

Shanghai AI Laboratory

Jul 2022 – Jan 2024 Beijing, China

Large scale model (e.g., LLM, MoE) training infrastructure optimization.
Deeply involved in the development of InternLM.

Research Intern

Sensetime Research

Sep 2019 – Jun 2022 Beijing, China

Supercomputing cluster scheduling and optimization for deep learning training workloads in Sensetime Research (now SenseCore).
Design and implementation of a fair scheduler for DLT jobs as first author.

Research Intern

Peng Cheng Laboratory

Jul 2018 – Sep 2021 Shenzhen, China

Contributed to development of OpenI-Octopus, an open-sourced scheduler for deep learning training workloads based on Kubernetes.
Safe GPU sharing and efficient migration mechanisms on Kubernetes.
Monitoring and logging systems.

Team member

Peking University Cluster Competition Team

Sep 2018 – Jun 2019 Beijing, China

Participated in analyzing, compiling, profiling, optimizing, and improving parallelizability of general HPC tasks.
First Price (Team), ASC19 Student Supercomputer Challenge

Recent Publications

Qinghao Hu, Zhisheng YE, Zerui Wang, Guoteng Wang, Meng Zhang, Qiaoling Chen, Peng Sun, Dahua Lin, Xiaolin Wang, Yingwei Luo, Yonggang Wen, Tianwei Zhang (2024). Characterization of Large Language Model Development in the Datacenter. In NSDI.

Preprint Cite

Wei Gao, Zhisheng YE, Peng Sun, Tianwei Zhang, Yonggang Wen (2024). UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands. In ToC.

Preprint PDF Cite DOI

Zhisheng YE, Wei Gao, Qinghao Hu, Peng Sun, Xiaolin Wang, Yingwei Luo, Tianwei Zhang, Yonggang Wen (2023). Deep Learning Workload Scheduling in GPU Datacenters: A Survey. In CSUR.

Preprint PDF Cite Project DOI

Qinghao Hu, Zhisheng YE, Meng Zhang, Qiaoling Chen, Peng Sun, Yonggang Wen, Tianwei Zhang (2023). Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters. In OSDI.

PDF Cite Code Slides Video

Zehua Yang, Zhisheng YE, Tianhao Fu, Jing Luo, Xiong Wei, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, Tianwei Zhang (2022). Tear Up the Bubble Boom: Lessons Learned From a Deep Learning Research and Development Cluster. In ICCD.

PDF Cite Dataset DOI

See all publications

Zhisheng YE

DSc in Computer Architecture

School of Computer Science, Peking University

Biography

Experience

Recent Publications

Recent Posts

Popular Topics