Distributed tensorflow: the difference between In-graph replication and Between-graph replication

First of all, for some historical context, “in-graph replication” is the first approach that we tried in TensorFlow, and it did not achieve the performance that many users required, so the more complicated “between-graph” approach is the current recommended way to perform distributed training. Higher-level libraries such as tf.learn use the “between-graph” approach for distributed training.

To answer your specific questions:

  1. Does this mean there are multiple tf.Graphs in the between-graph
    replication approach? If yes, where are the corresponding codes in the provided examples?

    Yes. The typical between-graph replication setup will use a separate TensorFlow process for each worker replica, and each of this will build a separate tf.Graph for the model. Usually each process uses the global default graph (accessible through tf.get_default_graph()) and it is not created explicitly.

    (In principle, you could use a single TensorFlow process with the same tf.Graph and multiple tf.Session objects that share the same underlying graph, as long as you configured the tf.ConfigProto.device_filters option for each session differently, but this is an uncommon setup.)

  2. While there is already a between-graph replication example in above link, could anyone provide an in-graph replication implementation (pseudocode is fine) and highlight its main differences from between-graph replication?

    For historical reasons, there are not many examples of in-graph replication (Yaroslav’s gist is one exception). A program using in-graph replication will typically include a loop that creates the same graph structure for each worker (e.g. the loop on line 74 of the gist), and use variable sharing between the workers.

    The one place where in-graph replication persists is for using multiple devices in a single process (e.g. multiple GPUs). The CIFAR-10 example model for multiple GPUs is an example of this pattern (see the loop over GPU devices here).

(In my opinion, the inconsistency between how multiple workers and multiple devices in a single worker are treated is unfortunate. In-graph replication is simpler to understand than between-graph replication, because it doesn’t rely on implicit sharing between the replicas. Higher-level libraries, such as tf.learn and TF-Slim, hide some of these issues, and offer hope that we can offer a better replication scheme in the future.)

  1. Why do we say each client builds a similar graph, but not the same graph?

    Because they aren’t required to be identical (and there is no integrity check that enforces this). In particular, each worker might create a graph with different explicit device assignments ("/job:worker/task:0", "/job:worker/task:1", etc.). The chief worker might create additional operations that are not created on (or used by) the non-chief workers. However, in most cases, the graphs are logically (i.e. modulo device assignments) the same.

    Shouldn’t it be multiple copies of the compute-intensive part of the model, since we have multiple workers?

    Typically, each worker has a separate graph that contains a single copy of the compute-intensive part of the model. The graph for worker i does not contain the nodes for worker j (assuming i ≠ j). (An exception would be the case where you’re using between-graph replication for distributed training, and in-graph replication for using multiple GPUs in each worker. In that case, the graph for a worker would typically contain N copies of the compute-intensive part of the graph, where N is the number of GPUs in that worker.)

  2. Does the example in Replicated training support training on multiple machines, each of which has multiple GPUs?

    The example code only covers training on multiple machines, and says nothing about how to train on multiple GPUs in each machine. However, the techniques compose easily. In this part of the example:

    # Build model...
    loss = ...
    

    …you could add a loop over the GPUs in the local machine, to achieve distributed training multiple workers each with multiple GPUs.

Leave a Comment