Fix race condition when checkpointing stats

lw · facebook-github-bot · commit 5d430509d613 · 2019-09-28T08:00:14.000-07:00
Summary:
I recently discovered that the checkpointing of training stats (introduced in D15977931) had a few issues:
- it could incur in race conditions (different trainers appending to the file at the same time), which put the file in a corrupted state, crashing all the successive runs trying to resume from the checkpoint (even if we avoided the crashing, we can't trust the data in there).
- it could be "ahead" of the rest of the checkpoint, meaning that it could contain stats for training steps whose results weren't checkpointed yet (i.e., we checkpoint only at the end of a pass, but we log stats for each bucket within a stat), thus after resuming we would have duplicate stats in the checkpoint, as some were recomputed without the old ones being purged.

There were several solutions to this:
- Locking the file (using low-level fnctl calls) before appending to it. Although this is typically supported by filesystems, even distributed ones, it's a bit of an advanced feature, so I was afraid of using it. Also, the rest of our filesystem code avoids race conditions through higher-level logic (assigning writes to different files to different trainers), so it would be a bit dissonnant to solve these race conditions at such a low level. Additionally, it would mean that each new storage needs to reinvent a different low-level way to lock files.
- Having each trainer write stats to a different file. This would mean that the checkpoint format depends on the number of trainers, which is something that isn't currently the case. Philosophically speaking, the number of trainers is a detail of the execution, not intrinsic to the data, it shouldn't transpire in the checkpoint. Practically, this means that if we start a run with 10 trainers then resume it using only 9 we would not know that there is one extra stats checkpoint (the last one), and thus we wouldn't load it.
- Find a serialization format that produces blobs of the same size for all the stats, and write them to the file at an offset determined by the stats index. This means that each stats gets its own region of the file, disjoint from all other stats, determined "intrinsically". Thus even if two writes occurred at the same time they would not touch the same region of the file. This again seems an ad-hoc solution for this particular storage, and moreover it makes the format rather hard to extend (if we want to add more metrics).
- Finally, the solution I went with here, is to collect the stats on a single trainer and have it checkpoint them, ideally at the end of the training pass (at the same time as all the rest of the data). The natural place to do this is the lock server, as stats can be reported by trainers when they release the bucket they are working on. The lock server can then also know which bucket these stats are for (this is an information that currently gets lost). This may in the future allow fancier representations of these stats (immagine, for each pass, a 2d matrix of the loss for each bucket, to see whether one row/column is learning more poorly than others).

Reviewed By: adamlerer, chandlerzuo

Differential Revision: D17605390

fbshipit-source-id: 9720ddbd0a30624fe101a7146acfacf868a10429
diff --git a/test/test_functional.py b/test/test_functional.py
@@ -230,15 +230,18 @@ def assertIsStatsDict(self, stats: Mapping[str, Union[int, SerializedStats]]) ->
         self.assertIsInstance(stats, dict)
         self.assertIn("index", stats)
         for k, v in stats.items():
-            if k == "index":
+            if k in ("epoch_idx", "edge_path_idx", "edge_chunk_idx",
+                     "lhs_partition", "rhs_partition", "index"):
                 self.assertIsInstance(v, int)
-            else:
+            elif k in ("stats", "eval_stats_before", "eval_stats_after"):
                 self.assertIsInstance(v, dict)
                 self.assertCountEqual(v.keys(), ["count", "metrics"])
                 self.assertIsInstance(v["count"], int)
                 self.assertIsInstance(v["metrics"], dict)
                 for m in v["metrics"].values():
                     self.assertIsInstance(m, float)
+            else:
+                self.fail(f"Unknown stats key: {k}")
 
     def assertCheckpointWritten(self, config: ConfigSchema, *, version: int) -> None:
         with open(os.path.join(config.checkpoint_path, "checkpoint_version.txt"), "rt") as tf:
diff --git a/torchbiggraph/bucket_scheduling.py b/torchbiggraph/bucket_scheduling.py
@@ -9,12 +9,13 @@
 import logging
 import random
 from abc import ABC, abstractmethod
-from typing import Dict, List, Optional, Set, Tuple
+from typing import Dict, List, NamedTuple, Optional, Set, Tuple
 
 from torch_extensions.rpc.rpc import Client, Server
 
 from torchbiggraph.config import BucketOrder
 from torchbiggraph.distributed import Startable
+from torchbiggraph.stats import Stats
 from torchbiggraph.types import Bucket, EntityName, Partition, Rank, Side
 
 
@@ -25,6 +26,16 @@
 ###   Bucket scheduling interface.
 ###
 
+class BucketStats(NamedTuple):
+    lhs_partition: int
+    rhs_partition: int
+    # A global sequence number, tracking the order in which buckets are trained.
+    index: int
+    train: Stats
+    eval_before: Optional[Stats] = None
+    eval_after: Optional[Stats] = None
+
+
 class AbstractBucketScheduler(ABC):
 
     @abstractmethod
@@ -36,7 +47,7 @@ def acquire_bucket(self) -> Tuple[Optional[Bucket], int]:
         pass
 
     @abstractmethod
-    def release_bucket(self, bucket: Bucket) -> None:
+    def release_bucket(self, bucket: Bucket, stats: BucketStats) -> None:
         pass
 
     @abstractmethod
@@ -47,6 +58,10 @@ def check_and_set_dirty(self, entity: EntityName, part: Partition) -> bool:
     def peek(self) -> Optional[Bucket]:
         pass
 
+    @abstractmethod
+    def get_stats_for_pass(self) -> List[BucketStats]:
+        pass
+
 
 ###
 ###   Implementation for single-machine mode.
@@ -259,6 +274,7 @@ def __init__(self, nparts_lhs: int, nparts_rhs: int, order: BucketOrder) -> None
         self.order = order
 
         self.buckets: List[Bucket] = []
+        self.stats: List[BucketStats] = []
 
     def new_pass(self, is_first: bool) -> None:
         self.buckets = create_ordered_buckets(
@@ -267,6 +283,7 @@ def new_pass(self, is_first: bool) -> None:
             order=self.order,
             generator=random.Random(),
         )
+        self.stats = []
 
         # Print buckets
         logger.debug("Partition pairs:")
@@ -282,8 +299,10 @@ def acquire_bucket(self) -> Tuple[Optional[Bucket], int]:
         remaining = len(self.buckets)
         return bucket, remaining
 
-    def release_bucket(self, bucket: Bucket) -> None:
-        pass
+    def release_bucket(self, bucket: Bucket, stats: BucketStats) -> None:
+        if stats.lhs_partition != bucket.lhs or stats.rhs_partition != bucket.rhs:
+            raise ValueError(f"Bucket and stats don't match: {bucket}, {stats}")
+        self.stats.append(stats)
 
     def check_and_set_dirty(self, entity: EntityName, part: Partition) -> bool:
         return False
@@ -294,6 +313,9 @@ def peek(self) -> Optional[Bucket]:
         except IndexError:
             return None
 
+    def get_stats_for_pass(self) -> List[BucketStats]:
+        return self.stats.copy()
+
 
 ###
 ###   Implementation for distributed training mode.
@@ -325,13 +347,15 @@ def __init__(
         self.active: Dict[Bucket, Rank] = {}
         self.done: Set[Bucket] = set()
         self.dirty: Set[Tuple[EntityName, Partition]] = set()
+        self.stats: List[BucketStats] = []
         self.initialized_partitions: Optional[Set[Partition]] = None
 
     def new_pass(self, is_first: bool = False) -> None:
         """Start a new epoch of training."""
         self.active = {}
         self.done = set()
         self.dirty = set()
+        self.stats = []
         if self.init_tree and is_first:
             self.initialized_partitions = {Partition(0)}
         else:
@@ -404,13 +428,15 @@ def acquire_bucket(
 
         return None, remaining
 
-    def release_bucket(self, bucket: Bucket) -> None:
+    def release_bucket(self, bucket: Bucket, stats: BucketStats) -> None:
         """
         Releases the lock on lhs and rhs, and marks this pair as done.
         """
-        if bucket.lhs is not None:
-            self.active.pop(bucket)
-            logger.info(f"Bucket {bucket} released: active= {self.active}")
+        if stats.lhs_partition != bucket.lhs or stats.rhs_partition != bucket.rhs:
+            raise ValueError(f"Bucket and stats don't match: {bucket}, {stats}")
+        self.active.pop(bucket)
+        self.stats.append(stats)
+        logger.info(f"Bucket {bucket} released: active= {self.active}")
 
     def check_and_set_dirty(self, entity: EntityName, part: Partition) -> bool:
         """
@@ -424,6 +450,9 @@ def check_and_set_dirty(self, entity: EntityName, part: Partition) -> bool:
         self.dirty.add(key)
         return res
 
+    def get_stats_for_pass(self) -> List[BucketStats]:
+        return sorted(self.stats, key=lambda s: s.index)
+
 
 class LockClient(Client):
 
@@ -450,11 +479,14 @@ def acquire_bucket(self) -> Tuple[Optional[Bucket], int]:
             self.old_b = bucket
         return bucket, remaining
 
-    def release_bucket(self, bucket: Bucket) -> None:
-        self.client.release_bucket(bucket)
+    def release_bucket(self, bucket: Bucket, stats: BucketStats) -> None:
+        self.client.release_bucket(bucket, stats)
 
     def check_and_set_dirty(self, entity: EntityName, part: Partition) -> bool:
         return self.client.check_and_set_dirty(entity, part)
 
     def peek(self) -> Optional[Bucket]:
         return None
+
+    def get_stats_for_pass(self) -> List[BucketStats]:
+        return self.client.get_stats_for_pass()
diff --git a/torchbiggraph/checkpoint_manager.py b/torchbiggraph/checkpoint_manager.py
@@ -20,7 +20,6 @@
     Dict,
     Generator,
     List,
-    Mapping,
     Optional,
     Set,
     Tuple,
@@ -432,9 +431,9 @@ def read_config(self) -> ConfigSchema:
 
     def append_stats(
         self,
-        stats: Mapping[str, Union[int, SerializedStats]],
+        stats: List[Dict[str, Union[int, SerializedStats]]],
     ) -> None:
-        self.storage.append_stats(json.dumps(stats))
+        self.storage.append_stats([json.dumps(s) for s in stats])
 
     def read_stats(self) -> Generator[Dict[str, Union[int, SerializedStats]], None, None]:
         for line in self.storage.load_stats():
diff --git a/torchbiggraph/checkpoint_storage.py b/torchbiggraph/checkpoint_storage.py
@@ -11,7 +11,7 @@
 import os
 from abc import ABC, abstractmethod
 from pathlib import Path
-from typing import Any, Dict, Generator, NamedTuple, Optional, Tuple
+from typing import Any, Dict, Generator, List, NamedTuple, Optional, Tuple
 
 import h5py
 import numpy as np
@@ -117,7 +117,7 @@ def load_config(self) -> str:
         pass
 
     @abstractmethod
-    def append_stats(self, stats_json: str) -> None:
+    def append_stats(self, stats_json: List[str]) -> None:
         pass
 
     @abstractmethod
@@ -430,9 +430,9 @@ def load_config(self) -> str:
         except FileNotFoundError as err:
             raise CouldNotLoadData() from err
 
-    def append_stats(self, stats_json: str) -> None:
+    def append_stats(self, stats_json: List[str]) -> None:
         with self.get_stats_file().open("at") as tf:
-            tf.write(f"{stats_json}\n")
+            tf.write("".join(f"{s}\n" for s in stats_json))
 
     def load_stats(self) -> Generator[str, None, None]:
         try:
diff --git a/torchbiggraph/train.py b/torchbiggraph/train.py
@@ -26,6 +26,7 @@
 )
 from torchbiggraph.bucket_scheduling import (
     AbstractBucketScheduler,
+    BucketStats,
     DistributedBucketScheduler,
     LockServer,
     SingleMachineBucketScheduler,
@@ -557,6 +558,7 @@ def load_embeddings(
     def swap_partitioned_embeddings(
         old_b: Optional[Bucket],
         new_b: Optional[Bucket],
+        old_stats: Optional[BucketStats],
     ):
         # 0. given the old and new buckets, construct data structures to keep
         #    track of old and new embedding (entity, part) tuples
@@ -577,6 +579,8 @@ def swap_partitioned_embeddings(
         # 1. checkpoint embeddings that will not be used in the next pair
         #
         if old_b is not None:  # there are previous embeddings to checkpoint
+            if old_stats is None:
+                raise TypeError("Got old bucket but not its stats")
             logger.info("Writing partitioned embeddings")
             for entity, part in to_checkpoint:
                 side = old_parts[(entity, part)]
@@ -593,7 +597,7 @@ def swap_partitioned_embeddings(
                 del embs
                 del optim_state
 
-            bucket_scheduler.release_bucket(old_b)
+            bucket_scheduler.release_bucket(old_b, old_stats)
 
         # 2. copy old embeddings that will be used in the next pair
         #    into a temporary dictionary
@@ -669,19 +673,22 @@ def swap_partitioned_embeddings(
         sync.barrier()
 
         remaining = total_buckets
-        cur_b = None
+        cur_b: Optional[Bucket] = None
+        cur_stats: Optional[BucketStats] = None
         while remaining > 0:
-            old_b = cur_b
+            old_b: Optional[Bucket] = cur_b
+            old_stats: Optional[BucketStats] = cur_stats
             io_time = 0.
             io_bytes = 0
             cur_b, remaining = bucket_scheduler.acquire_bucket()
             logger.info(f"still in queue: {remaining}")
             if cur_b is None:
+                cur_stats = None
                 if old_b is not None:
                     # if you couldn't get a new pair, release the lock
                     # to prevent a deadlock!
                     tic = time.time()
-                    io_bytes += swap_partitioned_embeddings(old_b, None)
+                    io_bytes += swap_partitioned_embeddings(old_b, None, old_stats)
                     io_time += time.time() - tic
                 time.sleep(1)  # don't hammer td
                 continue
@@ -690,7 +697,7 @@ def swap_partitioned_embeddings(
 
             tic = time.time()
 
-            io_bytes += swap_partitioned_embeddings(old_b, cur_b)
+            io_bytes += swap_partitioned_embeddings(old_b, cur_b, old_stats)
 
             current_index = \
                 (iteration_manager.iteration_idx + 1) * total_buckets - remaining
@@ -803,19 +810,18 @@ def swap_partitioned_embeddings(
                 eval_stats_after = Stats.sum(all_eval_stats_after).average()
                 bucket_logger.info(f"Stats after training: {eval_stats_after}")
 
-            # Add train/eval metrics to queue
-            stats_dict = {
-                "index": current_index,
-                "stats": stats.to_dict(),
-            }
-            if eval_stats_before is not None:
-                stats_dict["eval_stats_before"] = eval_stats_before.to_dict()
-            if eval_stats_after is not None:
-                stats_dict["eval_stats_after"] = eval_stats_after.to_dict()
-            checkpoint_manager.append_stats(stats_dict)
             yield current_index, eval_stats_before, stats, eval_stats_after
 
-        swap_partitioned_embeddings(cur_b, None)
+            cur_stats = BucketStats(
+                lhs_partition=cur_b.lhs,
+                rhs_partition=cur_b.rhs,
+                index=current_index,
+                train=stats,
+                eval_before=eval_stats_before,
+                eval_after=eval_stats_after,
+            )
+
+        swap_partitioned_embeddings(cur_b, None, cur_stats)
 
         # Distributed Processing: all machines can leave the barrier now.
         sync.barrier()
@@ -858,6 +864,25 @@ def swap_partitioned_embeddings(
                 OptimizerStateDict(trainer.global_optimizer.state_dict()),
             )
 
+            logger.info("Writing the training stats")
+            all_stats_dicts: List[Dict[...]] = []
+            for stats in bucket_scheduler.get_stats_for_pass():
+                stats_dict = {
+                    "epoch_idx": epoch_idx,
+                    "edge_path_idx": edge_path_idx,
+                    "edge_chunk_idx": edge_chunk_idx,
+                    "lhs_partition": stats.lhs_partition,
+                    "rhs_partition": stats.rhs_partition,
+                    "index": stats.index,
+                    "stats": stats.train.to_dict(),
+                }
+                if stats.eval_before is not None:
+                    stats_dict["eval_stats_before"] = stats.eval_before.to_dict()
+                if stats.eval_after is not None:
+                    stats_dict["eval_stats_after"] = stats.eval_after.to_dict()
+                all_stats_dicts.append(stats_dict)
+            checkpoint_manager.append_stats(all_stats_dicts)
+
         logger.info("Writing the checkpoint")
         checkpoint_manager.write_new_version(config)