You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Mar 14, 2024. It is now read-only.
Summary:
I recently discovered that the checkpointing of training stats (introduced in D15977931) had a few issues:
- it could incur in race conditions (different trainers appending to the file at the same time), which put the file in a corrupted state, crashing all the successive runs trying to resume from the checkpoint (even if we avoided the crashing, we can't trust the data in there).
- it could be "ahead" of the rest of the checkpoint, meaning that it could contain stats for training steps whose results weren't checkpointed yet (i.e., we checkpoint only at the end of a pass, but we log stats for each bucket within a stat), thus after resuming we would have duplicate stats in the checkpoint, as some were recomputed without the old ones being purged.
There were several solutions to this:
- Locking the file (using low-level fnctl calls) before appending to it. Although this is typically supported by filesystems, even distributed ones, it's a bit of an advanced feature, so I was afraid of using it. Also, the rest of our filesystem code avoids race conditions through higher-level logic (assigning writes to different files to different trainers), so it would be a bit dissonnant to solve these race conditions at such a low level. Additionally, it would mean that each new storage needs to reinvent a different low-level way to lock files.
- Having each trainer write stats to a different file. This would mean that the checkpoint format depends on the number of trainers, which is something that isn't currently the case. Philosophically speaking, the number of trainers is a detail of the execution, not intrinsic to the data, it shouldn't transpire in the checkpoint. Practically, this means that if we start a run with 10 trainers then resume it using only 9 we would not know that there is one extra stats checkpoint (the last one), and thus we wouldn't load it.
- Find a serialization format that produces blobs of the same size for all the stats, and write them to the file at an offset determined by the stats index. This means that each stats gets its own region of the file, disjoint from all other stats, determined "intrinsically". Thus even if two writes occurred at the same time they would not touch the same region of the file. This again seems an ad-hoc solution for this particular storage, and moreover it makes the format rather hard to extend (if we want to add more metrics).
- Finally, the solution I went with here, is to collect the stats on a single trainer and have it checkpoint them, ideally at the end of the training pass (at the same time as all the rest of the data). The natural place to do this is the lock server, as stats can be reported by trainers when they release the bucket they are working on. The lock server can then also know which bucket these stats are for (this is an information that currently gets lost). This may in the future allow fancier representations of these stats (immagine, for each pass, a 2d matrix of the loss for each bucket, to see whether one row/column is learning more poorly than others).
Reviewed By: adamlerer, chandlerzuo
Differential Revision: D17605390
fbshipit-source-id: 9720ddbd0a30624fe101a7146acfacf868a10429
0 commit comments