Biography

Hi, there! This is Zhisheng Ye. I received my Ph.D. in Institute of Networking and Energy-efficient Computing (NEEC) at Peking University in 2024, under the joint supervision of Prof. Yingwei Luo, the director of NEEC, and Prof. Xiaolin Wang. Previously, I received a B.S. degree in Computer Science and Technology from the School of Electronics Engineering and Computer Science (EECS) at Peking University, China, in 2019.

My research interests include resource management in machine learning systems and building efficient and practical systems for training and serving next-generation DNNs. I am also interested in high performance computing and GPU systems, as a former member of PKUSC. I also received mentorship from Prof. Tianwei Zhang of NTU and had strong collaborations with his students, including Wei Gao, Dr. Qinghao Hu, Meng Zhang, and Qiaoling Chen. Moreover, I received mentorship from and collaborated with Dr. Peng Sun of NDS Group at Shanghai AI Lab.

Download my CV.

Interests
  • Distributed Systems
  • Machine Learning Systems
  • Resource Management
Education
  • DSc in Computer Architecture, 2024

    Peking University

  • BSc in Computer Science and Technology, 2019

    Peking University

Experience

 
 
 
 
 
Shanghai AI Laboratory
Research Intern
Jul 2022 – Jan 2024 Beijing, China
  • Large scale model (e.g., LLM, MoE) training infrastructure optimization.
  • Deeply involved in the development of InternLM.
 
 
 
 
 
Sensetime Research
Research Intern
Sep 2019 – Jun 2022 Beijing, China
  • Supercomputing cluster scheduling and optimization for deep learning training workloads in Sensetime Research (now SenseCore).
  • Design and implementation of a fair scheduler for DLT jobs as first author.
 
 
 
 
 
Peng Cheng Laboratory
Research Intern
Jul 2018 – Sep 2021 Shenzhen, China
  • Contributed to development of OpenI-Octopus, an open-sourced scheduler for deep learning training workloads based on Kubernetes.
  • Safe GPU sharing and efficient migration mechanisms on Kubernetes.
  • Monitoring and logging systems.
 
 
 
 
 
Peking University Cluster Competition Team
Team member
Sep 2018 – Jun 2019 Beijing, China
  • Participated in analyzing, compiling, profiling, optimizing, and improving parallelizability of general HPC tasks.
  • First Price (Team), ASC19 Student Supercomputer Challenge

Recent Publications

(2024). Characterization of Large Language Model Development in the Datacenter. In NSDI.

Preprint Cite

(2023). Deep Learning Workload Scheduling in GPU Datacenters: A Survey. In CSUR.

Preprint PDF Cite Project DOI

(2023). Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters. In OSDI.

PDF Cite Code Slides Video

(2022). Tear Up the Bubble Boom: Lessons Learned From a Deep Learning Research and Development Cluster. In ICCD.

PDF Cite Dataset DOI