ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism

Abstract

Hybrid parallelism underpins large-scale LLM training across tens of thousands of GPUs. At such scale, hardware failures on individual devices lead to performance skew across devices, diminishing overall training efficiency. Existing resilient systems overlook sequence length variability in datasets and device performance skew under hybrid parallelism. As a result, they (1) misidentify iteration time fluctuations as failures and (2) employ individual adaptations in hybrid parallelism to mitigate failures, leading to inaccurate failure detection and inefficient resilient training. To respond, this paper presents ResiHP, a resilient system that enables accurate failure detection and fine-grained adaptation for hybrid parallel training. First, we develop a Detector to accurately identify failures. In particular, it employs a FLOPs-aware normalization that disentangles failures from iteration time fluctuations while remaining lightweight for online detection. Second, we design a Scheduler that dynamically adapts parallelism group sizes, model partitioning, workload scheduling policies, and workload allocation to improve training efficiency under failures. Third, we implement an Executor that optimizes the communication strategy of the dynamically adapted training parallelism strategies, further enhancing overall efficiency. Experiments show that ResiHP achieves near-optimal failure detection accuracy and improves training throughput by 1.13-2.22 times compared with state-of-the-art resilient training systems under diverse failure scenarios in a 256-GPU cluster.

Publication
Preprint
Zhisheng YE
Zhisheng YE
Machine Learning Systems Researcher

My research interests include algorithm–system co-design for machine learning systems and resource management, etc.

Related