Class RunConfig
This class specifies the configurations for an Estimator run.
__init__
__init__(
model_dir=None,
tf_random_seed=None,
save_summary_steps=100,
save_checkpoints_steps=_USE_DEFAULT,
save_checkpoints_secs=_USE_DEFAULT,
session_config=None,
keep_checkpoint_max=5,
keep_checkpoint_every_n_hours=10000,
log_step_count_steps=100,
train_distribute=None,
device_fn=None,
protocol=None,
eval_distribute=None,
experimental_distribute=None
)
Constructs a RunConfig.
All distributed training related properties cluster_spec, is_chief,
master , num_worker_replicas, num_ps_replicas, task_id, and
task_type are set based on the TF_CONFIG environment variable, if the
pertinent information is present. The TF_CONFIG environment variable is a
JSON object with attributes: cluster and task.
cluster is a JSON serialized version of ClusterSpec's Python dict from
server_lib.py, mapping task types (usually one of the TaskType enums) to
a list of task addresses.
task has two attributes: type and index, where type can be any of
the task types in cluster. When TF_CONFIG contains said information,
the following properties are set on this class:
cluster_specis parsed fromTF_CONFIG['cluster']. Defaults to {}. If present, must have one and only one node in thechiefattribute ofcluster_spec.task_typeis set toTF_CONFIG['task']['type']. Must set ifcluster_specis present; must beworker(the default value) ifcluster_specis not set.task_idis set toTF_CONFIG['task']['index']. Must set ifcluster_specis present; must be 0 (the default value) ifcluster_specis not set.masteris determined by looking uptask_typeandtask_idin thecluster_spec. Defaults to ''.num_ps_replicasis set by counting the number of nodes listed in thepsattribute ofcluster_spec. Defaults to 0.num_worker_replicasis set by counting the number of nodes listed in theworkerandchiefattributes ofcluster_spec. Defaults to 1.is_chiefis determined based ontask_typeandcluster.
There is a special node with task_type as evaluator, which is not part
of the (training) cluster_spec. It handles the distributed evaluation job.
Example of non-chief node:
cluster = {'chief': ['host0:2222'],
'ps': ['host1:2222', 'host2:2222'],
'worker': ['host3:2222', 'host4:2222', 'host5:2222']}
os.environ['TF_CONFIG'] = json.dumps(
{'cluster': cluster,
'task': {'type': 'worker', 'index': 1}})
config = RunConfig()
assert config.master == 'host4:2222'
assert config.task_id == 1
assert config.num_ps_replicas == 2
assert config.num_worker_replicas == 4
assert config.cluster_spec == server_lib.ClusterSpec(cluster)
assert config.task_type == 'worker'
assert not config.is_chief
Example of chief node:
cluster = {'chief': ['host0:2222'],
'ps': ['host1:2222', 'host2:2222'],
'worker': ['host3:2222', 'host4:2222', 'host5:2222']}
os.environ['TF_CONFIG'] = json.dumps(
{'cluster': cluster,
'task': {'type': 'chief', 'index': 0}})
config = RunConfig()
assert config.master == 'host0:2222'
assert config.task_id == 0
assert config.num_ps_replicas == 2
assert config.num_worker_replicas == 4
assert config.cluster_spec == server_lib.ClusterSpec(cluster)
assert config.task_type == 'chief'
assert config.is_chief
Example of evaluator node (evaluator is not part of training cluster):
cluster = {'chief': ['host0:2222'],
'ps': ['host1:2222', 'host2:2222'],
'worker': ['host3:2222', 'host4:2222', 'host5:2222']}
os.environ['TF_CONFIG'] = json.dumps(
{'cluster': cluster,
'task': {'type': 'evaluator', 'index': 0}})
config = RunConfig()
assert config.master == ''
assert config.evaluator_master == ''
assert config.task_id == 0
assert config.num_ps_replicas == 0
assert config.num_worker_replicas == 0
assert config.cluster_spec == {}
assert config.task_type == 'evaluator'
assert not config.is_chief
N.B.: If save_checkpoints_steps or save_checkpoints_secs is set,
keep_checkpoint_max might need to be adjusted accordingly, especially in
distributed training. For example, setting save_checkpoints_secs as 60
without adjusting keep_checkpoint_max (defaults to 5) leads to situation
that checkpoint would be garbage collected after 5 minutes. In distributed
training, the evaluation job starts asynchronously and might fail to load or
find the checkpoint due to race condition.
Args:
model_dir: directory where model parameters, graph, etc are saved. IfPathLikeobject, the path will be resolved. IfNone, will use a default value set by the Estimator.tf_random_seed: Random seed for TensorFlow initializers. Setting this value allows consistency between reruns.save_summary_steps: Save summaries every this many steps.save_checkpoints_steps: Save checkpoints every this many steps. Can not be specified withsave_checkpoints_secs.save_checkpoints_secs: Save checkpoints every this many seconds. Can not be specified withsave_checkpoints_steps. Defaults to 600 seconds if bothsave_checkpoints_stepsandsave_checkpoints_secsare not set in constructor. If bothsave_checkpoints_stepsandsave_checkpoints_secsare None, then checkpoints are disabled.session_config: a ConfigProto used to set session parameters, or None.keep_checkpoint_max: The maximum number of recent checkpoint files to keep. As new files are created, older files are deleted. If None or 0, all checkpoint files are kept. Defaults to 5 (that is, the 5 most recent checkpoint files are kept.)keep_checkpoint_every_n_hours: Number of hours between each checkpoint to be saved. The default value of 10,000 hours effectively disables the feature.log_step_count_steps: The frequency, in number of global steps, that the global step and the loss will be logged during training. Also controls the frequency that the global steps / s will be logged (and written to summary) during training.train_distribute: An optional instance oftf.contrib.distribute.DistributionStrategy. If specified, then Estimator will distribute the user's model during training, according to the policy specified by that strategy. Settingexperimental_distribute.train_distributeis preferred.device_fn: A callable invoked for everyOperationthat takes theOperationand returns the device string. IfNone, defaults to the device function returned bytf.train.replica_device_setterwith round-robin strategy.protocol: An optional argument which specifies the protocol used when starting server. None means default to grpc.eval_distribute: An optional instance oftf.contrib.distribute.DistributionStrategy. If specified, then Estimator will distribute the user's model during evaluation, according to the policy specified by that strategy. Settingexperimental_distribute.eval_distributeis preferred.experimental_distribute: an optionaltf.contrib.distribute.DistributeConfigobject specifying DistributionStrategy-related configuration. Thetrain_distributeandeval_distributecan be passed as parameters toRunConfigor set inexperimental_distributebut not both.
Raises:
ValueError: If bothsave_checkpoints_stepsandsave_checkpoints_secsare set.
Properties
cluster_spec
device_fn
Returns the device_fn.
If device_fn is not None, it overrides the default
device function used in Estimator.
Otherwise the default one is used.
eval_distribute
Optional tf.contrib.distribute.DistributionStrategy for evaluation.
evaluation_master
global_id_in_cluster
The global id in the training cluster.
All global ids in the training cluster are assigned from an increasing sequence of consecutive integers. The first id is 0.
cluster = {'chief': ['host0:2222'],
'ps': ['host1:2222', 'host2:2222'],
'worker': ['host3:2222', 'host4:2222', 'host5:2222']}
Nodes with task type worker can have id 0, 1, 2. Nodes with task type
ps can have id, 0, 1. So, task_id is not unique, but the pair
(task_type, task_id) can uniquely determine a node in the cluster.
Global id, i.e., this field, is tracking the index of the node among ALL nodes in the cluster. It is uniquely assigned. For example, for the cluster spec given above, the global ids are assigned as:
task_type | task_id | global_id
--------------------------------
chief | 0 | 0
worker | 0 | 1
worker | 1 | 2
worker | 2 | 3
ps | 0 | 4
ps | 1 | 5
Returns:
An integer id.
is_chief
keep_checkpoint_every_n_hours
keep_checkpoint_max
log_step_count_steps
master
model_dir
num_ps_replicas
num_worker_replicas
protocol
Returns the optional protocol value.
save_checkpoints_secs
save_checkpoints_steps
save_summary_steps
service
Returns the platform defined (in TF_CONFIG) service dict.
session_config
task_id
task_type
tf_random_seed
train_distribute
Optional tf.contrib.distribute.DistributionStrategy for training.
Methods
tf.estimator.RunConfig.replace
replace(**kwargs)
Returns a new instance of RunConfig replacing specified properties.
Only the properties in the following list are allowed to be replaced:
model_dir,tf_random_seed,save_summary_steps,save_checkpoints_steps,save_checkpoints_secs,session_config,keep_checkpoint_max,keep_checkpoint_every_n_hours,log_step_count_steps,train_distribute,device_fn,protocol.eval_distribute,experimental_distribute,
In addition, either save_checkpoints_steps or save_checkpoints_secs
can be set (should not be both).
Args:
**kwargs: keyword named properties with new values.
Raises:
ValueError: If any property name inkwargsdoes not exist or is not allowed to be replaced, or bothsave_checkpoints_stepsandsave_checkpoints_secsare set.
Returns:
a new instance of RunConfig.