After Hours Academic

ETTR, aka, how good is the training goodput

Effective training time ratio (ETTR) defines the ratio of time spent in training to the wall clock time. In intuitive terms, it is the goodput of the training job. Any time spent doing non-productive (i.e., non-training) work takes away from the goodput. Non-productive time includes reading and writing checkpoints. It also includes the time that was spent in training but had to be repeated because it was not captured in a checkpoint and lost due to a failure.

Formulating ETTR

The figure above shows the split of a training job during the total duration of MTTF (mean time to failure). MTTF, as the name implies, is the expected time between two failures of the training job. The entire training run can be thought of as being split into MTTF-sized runs. Each MTTF-sized run starts with reading the last good checkpoint (except for the first run, but we can ignore that for now). The total run is further split up into consecutive checkpoint periods. The training job writes out a checkpoint to capture the training during that period. The failure (marking the end of the MTTF-sized run) can happen at any point during a checkpoint period and the training work from this last partial checkpoint period would be repeated in the next MTTF-sized run.

By definition of ETTR, we have:

ETTR=training timeMTTF=1non training timeMTTF

Let's consider each of the non training time components. The first component is the time to read the checkpoint Tcp_read. Next, each checkpoint period includes the time to write the checkpoint Tcp_write. If the length of a checkpoint period is Tcp_period then there are MTTFTcp_period number of checkpoint periods in the MTTF sized run. This leads to MTTFTcp_period*Tcp_write time allocated to writing checkpoints. Lastly there is the Twasted_training time spent in unproductive training in the last checkpoint period which would be repeated in the next MTTF-sized run. So, we get

ETTR=1Tcp_read+MTTFTcp_period*Tcp_write+Twasted_trainingMTTF

Now let's consider Twasted_training. Given that a failure can happen at any point in the training, the expected value of Twasted_training is Tcp_period2. This gives

ETTR=1Tcp_read+MTTFTcp_period*Tcp_write+Tcp_period2MTTF

This formulation leads to three conclusions:
(i) Large MTTFs will lead to higher ETTR.
(ii) Faster checkpoint reads and writes will increase ETTF.
(iii) The effect of checkpoint period is more interesting. On one hand, larger checkpoint periods mean less time spent writing checkpoint. But on the other hand, it also increases the amount of training time that would be wasted when a failure happens. A paper from Meta derives the optimal checkpointing period duration based on the above formulation. The maths in that paper is beyond what I can comprehend though.

Effect of asynchronous checkpoint writes

One interesting aspect is how this ETTR formulation changes when we consider asynchronous checkpoint writes discussed previously. The most obvious change is that Tcp_write becomes smaller, which is good for ETTR. However, asynchronous writes can also change the wasted training time Twasted_training. With asynchronous checkpointing, training in a new checkpoint period starts as soon as the previous period's checkpoint is written to the CPU memory. However, this in-memory checkpoint is not persistent yet. The CPU persists the checkpoint from the previous period in parallel with the current period's training running on the GPUs.

If a failure happens after the checkpoint is persisted, only the most recent period's training is wasted. However, if a failure happens before the previous period's checkpoint is persisted, even the previous period's training is wasted.

This changes the expected value of Twasted training. Let's say that the time to persist the checkpoint is Tcp_persist and a failure happens after Tfailure time measured from the start of the checkpoint period.

E(Twasted_training)=P(Tfailure>Tcp_persist)*Tfailure+P(TfailureTcp_persist)*(Tfailure+Tcp_period)


E(Twasted_training)=Tcp_periodTcp_persistTcp_period*Tfailure+Tcp_persistTcp_period*(Tfailure+Tcp_period)


E(Twasted_training)=Tfailure+Tcp_persist

Replacing Tfailure with its expected value of Tcp_period2:

E(Twated_training)=Tcp_period2+Tcp_persist

With this, the ETTR becomes:

ETTR=1Tcp_read+MTTFTcp_period*Tcp_write+Tcp_period2+Tcp_persistMTTF

To conclude, asynchronous checkpoint helps increase ETTR by reducing the time spent writing checkpoints repeatedly (i.e., reducing Tcp_write). However, it also introduces an additional Tcp_persist cost incurred once for each MTTF-sized run.

Checkpoint reads and the challenge with resharding

A lot of research works have focussed on speeding up checkpoint writes and checkpoint reads have received less attention. However, ByteDance recently published an interesting paper about checkpoint reads. In particular, they brought attention to the fact that often a checkpoint is read by a different number of GPUs that wrote it. The are multiple possible reasons for this, with the most intuitive one being that if the failure brought down some GPUs, those won't be used for the resumed training.

Regardless of the reason, a change in the number of GPUs has interesting implications for the checkpoint read performance. Typically, each GPU writes its part of the checkpoint state in a separate file. If the number of GPUs remains the same upon restarting after a failure, each GPU would read the file it wrote in its entirety. However, if the number of GPUs change, the parallelism configuration (e.g., data parallelism, tensor parallelism) used for training changes. This leads to multiple GPUs reading a single checkpoint file and/or a GPU reading disjoint parts of a file. Such reads lead to loss of performance because of contention and/or small reads, respectively. The shows up as a higher Tcp_read and lower ETTR. The paper presented a somewhat obvious solution to the problem and I won't go into it here, but it was good to see this problem being discussed.

#checkpoints #computer-science #llm #machine-learning