ETTR, aka, how good is the training goodput
Effective training time ratio (ETTR) defines the ratio of time spent in training to the wall clock time. In intuitive terms, it is the goodput of the training job. Any time spent doing non-productive (i.e., non-training) work takes away from the goodput. Non-productive time includes reading and writing checkpoints. It also includes the time that was spent in training but had to be repeated because it was not captured in a checkpoint and lost due to a failure.
Formulating ETTR
The figure above shows the split of a training job during the total duration of MTTF (mean time to failure). MTTF, as the name implies, is the expected time between two failures of the training job. The entire training run can be thought of as being split into MTTF-sized runs. Each MTTF-sized run starts with reading the last good checkpoint (except for the first run, but we can ignore that for now). The total run is further split up into consecutive checkpoint periods. The training job writes out a checkpoint to capture the training during that period. The failure (marking the end of the MTTF-sized run) can happen at any point during a checkpoint period and the training work from this last partial checkpoint period would be repeated in the next MTTF-sized run.
By definition of ETTR, we have:
Let's consider each of the non training time components. The first component is the time to read the checkpoint . Next, each checkpoint period includes the time to write the checkpoint . If the length of a checkpoint period is then there are number of checkpoint periods in the MTTF sized run. This leads to time allocated to writing checkpoints. Lastly there is the time spent in unproductive training in the last checkpoint period which would be repeated in the next MTTF-sized run. So, we get
Now let's consider . Given that a failure can happen at any point in the training, the expected value of is . This gives
This formulation leads to three conclusions:
(i) Large MTTFs will lead to higher ETTR.
(ii) Faster checkpoint reads and writes will increase ETTF.
(iii) The effect of checkpoint period is more interesting. On one hand, larger checkpoint periods mean less time spent writing checkpoint. But on the other hand, it also increases the amount of training time that would be wasted when a failure happens. A paper from Meta derives the optimal checkpointing period duration based on the above formulation. The maths in that paper is beyond what I can comprehend though.
Effect of asynchronous checkpoint writes
One interesting aspect is how this ETTR formulation changes when we consider asynchronous checkpoint writes discussed previously. The most obvious change is that becomes smaller, which is good for ETTR. However, asynchronous writes can also change the wasted training time . With asynchronous checkpointing, training in a new checkpoint period starts as soon as the previous period's checkpoint is written to the CPU memory. However, this in-memory checkpoint is not persistent yet. The CPU persists the checkpoint from the previous period in parallel with the current period's training running on the GPUs.
If a failure happens after the checkpoint is persisted, only the most recent period's training is wasted. However, if a failure happens before the previous period's checkpoint is persisted, even the previous period's training is wasted.
This changes the expected value of . Let's say that the time to persist the checkpoint is and a failure happens after time measured from the start of the checkpoint period.
Replacing with its expected value of :
With this, the ETTR becomes:
To conclude, asynchronous checkpoint helps increase ETTR by reducing the time spent writing checkpoints repeatedly (i.e., reducing ). However, it also introduces an additional cost incurred once for each MTTF-sized run.
Checkpoint reads and the challenge with resharding
A lot of research works have focussed on speeding up checkpoint writes and checkpoint reads have received less attention. However, ByteDance recently published an interesting paper about checkpoint reads. In particular, they brought attention to the fact that often a checkpoint is read by a different number of GPUs that wrote it. The are multiple possible reasons for this, with the most intuitive one being that if the failure brought down some GPUs, those won't be used for the resumed training.
Regardless of the reason, a change in the number of GPUs has interesting implications for the checkpoint read performance. Typically, each GPU writes its part of the checkpoint state in a separate file. If the number of GPUs remains the same upon restarting after a failure, each GPU would read the file it wrote in its entirety. However, if the number of GPUs change, the parallelism configuration (e.g., data parallelism, tensor parallelism) used for training changes. This leads to multiple GPUs reading a single checkpoint file and/or a GPU reading disjoint parts of a file. Such reads lead to loss of performance because of contention and/or small reads, respectively. The shows up as a higher and lower ETTR. The paper presented a somewhat obvious solution to the problem and I won't go into it here, but it was good to see this problem being discussed.