ResiHP: Surviving LLM Training Failures with Dynamic Hybrid Parallelism

Reference reading: 大模型训练遇到 GPU 故障怎么办?我们的做法是动态调整 3D 并行.

Large-scale LLM training is not one distributed system problem. It is several stacked on top of each other.

At the scale of hundreds or thousands of GPUs, failures are no longer rare events. Some devices disappear completely. Others stay alive but become slower. The second case is especially unpleasant: a fail-slow GPU does not crash the job, but it drags the whole synchronous training iteration behind it. In hybrid parallel training, that delay can propagate through tensor parallelism, pipeline parallelism, and data parallelism until one weak device quietly dictates the speed of the entire job.

ResiHP is built for this setting. Its central idea is to make hybrid parallelism dynamic. Instead of treating the 3D parallel layout as fixed after launch, ResiHP detects unhealthy devices and reshapes the training plan around them.

Why Failure Detection Is Hard

The naive signal is iteration time. If one iteration becomes much slower, maybe a device is failing.

That logic is too brittle for LLM training.

Modern LLM workloads often use variable-length sequences. Even when the token budget is controlled by sequence packing, the true attention cost still depends on sequence lengths inside each micro-batch. A packed batch with many long sequences can naturally take longer than a packed batch with shorter ones. Pipeline scheduling adds another layer of noise: the observed iteration time is not just one micro-batch cost, but the critical path of forward, backward, and weight-update chunks across pipeline stages.

This is the point emphasized in the Zhihu writeup: the detector cannot stare at raw iteration time and call every spike a failure. It first needs to ask what the iteration should have cost if all devices were healthy.

FLOPs-Aware Detection

ResiHP’s Detector normalizes iteration time by expected computation.

At the micro-batch level, it estimates the work from the packed sequence structure. Attention is not linear in sequence length, so the model considers the quadratic attention cost rather than only counting tokens. At the pipeline level, ResiHP simulates the schedule of forward, backward, and weight-update chunks to predict the critical path for a healthy iteration.

Only after this normalization does ResiHP compare observed time with expected time. If the gap remains abnormal, the system treats it as a fail-slow signal rather than ordinary sequence-length variation. Fail-stop cases are handled separately through missing heartbeats.

This distinction matters because false positives are costly. A resilient training system that constantly misidentifies normal workload skew as hardware failure will keep reshaping the job for no reason. ResiHP tries to make detection lightweight enough for online use, but accurate enough that adaptation is reserved for real trouble.

Why Hybrid Parallelism Makes Recovery Tricky

Once a device is identified as unhealthy, the simple response is to remove it.

That is rarely enough.

In pure data parallelism, losing one worker mostly reduces replica count. In hybrid parallelism, a device participates in a structure. It may be one rank of a tensor-parallel group, one stage of a pipeline, and one member of a data-parallel replica at the same time. If a tensor-parallel rank fails, the whole TP group is affected. If one pipeline stage slows down, upstream and downstream stages wait. If one data-parallel replica lags, synchronization suffers.

The failure is local, but the performance damage is global.

ResiHP therefore adapts at multiple levels instead of applying one generic workaround. It changes parallelism group sizes, repartitions model layers across pipeline stages, adjusts workload scheduling, and reallocates work among replicas.

Dynamic Hybrid Parallelism

The Scheduler is the part of ResiHP that turns detection into a new training plan.

For tensor parallelism, ResiHP can shrink or re-form TP groups around healthy devices. The goal is not simply to drop every device in the affected group, because that may waste too many healthy GPUs. Instead, the scheduler searches for a better group size and membership that preserves useful computation while avoiding the slow or failed rank.

For pipeline parallelism, ResiHP can rebalance model partitioning. A slow stage should not keep the same layer load as healthy stages. If one stage becomes slower, the scheduler can assign it fewer layers and shift work to healthier stages, reducing the pipeline bottleneck.

For data parallelism, ResiHP uses workload migration. If one replica is falling behind while another has capacity, the scheduler can move work so progress becomes more balanced. This is especially useful because data-parallel replicas are logically symmetric, but their actual speed may diverge after a device failure or performance degradation.

The important engineering point is that these adaptations are coordinated. Adjusting TP alone may create pipeline imbalance. Adjusting PP alone may leave healthy GPUs underused. Adjusting DP alone may not remove the original bottleneck. ResiHP treats the layout as a connected 3D object.

Executor Support

A new plan is only useful if the runtime can execute it without turning recovery into a second failure.

ResiHP’s Executor handles the mechanics of dynamic reconfiguration. It reconstructs model and optimizer states under the new parallel layout, updates communication strategies, and supports efficient data movement for the adapted groups. This is where the system moves from scheduling policy to actual fault-tolerant training.

The Executor also matters for fail-stop recovery. If a GPU disappears, the system must preserve training continuity while redistributing the affected model shards and workloads. If a GPU merely slows down, the system must avoid overreacting while still reducing its influence on the global critical path.

What ResiHP Buys

ResiHP was evaluated on a 256-GPU cluster under diverse failure scenarios. The paper reports near-optimal failure detection accuracy and a training throughput improvement of 1.13x to 2.22x compared with state-of-the-art resilient training systems.

The broader lesson is that resilience for LLM training cannot be bolted on as a checkpoint-and-restart loop. Hybrid parallelism is already the structure that makes training possible at scale, so resilience has to understand that structure. ResiHP does this by separating three questions:

  • Is this slowdown a real failure or just sequence-length variation?
  • Which part of the 3D parallel layout is actually damaged?
  • How should TP, PP, and DP change together so the job keeps making progress?

That is the shift I like in ResiHP: it treats failure handling as a dynamic parallelism problem, not merely as a device replacement problem.

Paper: ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
Preprint: arXiv:2605.06374

Zhisheng YE
Zhisheng YE
Machine Learning Systems Researcher

My research interests include AI Infra for LLMs, algorithm–system co-design for machine learning systems and resource management.

Related