Library for running a computation across multiple devices.
See the guide for overview and examples: TensorFlow v2.x, TensorFlow v1.x. # pylint: disable=line-too-long
The intent of this library is that you can write an algorithm in a stylized way
and it will be usable with a variety of different tf.distribute.Strategy
implementations. Each descendant will implement a different strategy for
distributing the algorithm across multiple devices/machines. Furthermore, these
changes can be hidden inside the specific layers and other library classes that
need special treatment to run in a distributed setting, so that most users'
model definition code can run unchanged. The tf.distribute.Strategy
API works
the same way with eager and graph execution.
Glossary
tf.device
). You may have multiple
devices on a single machine, or be connected to devices on multiple
machines. Devices used to run computations are called worker devices.
Devices used to store variables are parameter devices. For some strategies,
such as tf.distribute.MirroredStrategy
, the worker and parameter devices
will be the same (see mirrored variables below). For others they will be
different. For example, tf.distribute.experimental.CentralStorageStrategy
puts the variables on a single device (which may be a worker device or may be
the CPU), and tf.distribute.experimental.ParameterServerStrategy
puts the
variables on separate machines called parameter servers (see below).tf.distribute.experimental.ParameterServerStrategy
). All replicas that want
to operate on a variable retrieve it at the beginning of a step and send an
update to be applied at the end of the step. These can in priniciple support
either sync or async training, but right now we only have support for async
training with parameter servers. Compare to
tf.distribute.experimental.CentralStorageStrategy
, which puts all variables
on a single device on the same machine (and does sync training), and
tf.distribute.MirroredStrategy
, which mirrors variables to multiple devices
(see below).Note that we provide a default version of tf.distribute.Strategy
that is
used when no other strategy is in scope, that provides the same API with
reasonable default behavior.
cluster_resolver
module: Library imports for ClusterResolvers.
experimental
module: Experimental Distribution Strategy library.
class CrossDeviceOps
: Base class for cross-device reduction and broadcasting algorithms.
class HierarchicalCopyAllReduce
: Reduction using hierarchical copy all-reduce.
class InputContext
: A class wrapping information needed by an input function.
class InputReplicationMode
: Replication mode for input function.
class MirroredStrategy
: Mirrors vars to distribute across multiple devices and machines.
class NcclAllReduce
: Reduction using NCCL all-reduce.
class OneDeviceStrategy
: A distribution strategy for running on a single device.
class ReduceOp
: Indicates how a set of values should be reduced.
class ReductionToOneDevice
: Always do reduction to one device first and then do broadcasting.
class ReplicaContext
: tf.distribute.Strategy
API when in a replica context.
class Server
: An in-process TensorFlow server, for use in distributed training.
class Strategy
: A state & compute distribution policy on a list of devices.
class StrategyExtended
: Additional APIs for algorithms that need to be distribution-aware.
experimental_set_strategy(...)
: Set a tf.distribute.Strategy
as current without with strategy.scope()
.
get_replica_context(...)
: Returns the current tf.distribute.ReplicaContext
or None
.
get_strategy(...)
: Returns the current tf.distribute.Strategy
object.
has_strategy(...)
: Return if there is a current non-default tf.distribute.Strategy
.
in_cross_replica_context(...)
: Returns True
if in a cross-replica context.