tf.estimator.RunConfig

This class specifies the configurations for an Estimator run.

View aliases

Compat aliases for migration

tf.compat.v1.estimator.RunConfig, tf.compat.v2.estimator.RunConfig

tf.estimator.RunConfig(
    model_dir=None, tf_random_seed=None, save_summary_steps=100,
    save_checkpoints_steps=_USE_DEFAULT, save_checkpoints_secs=_USE_DEFAULT,
    session_config=None, keep_checkpoint_max=5, keep_checkpoint_every_n_hours=10000,
    log_step_count_steps=100, train_distribute=None, device_fn=None, protocol=None,
    eval_distribute=None, experimental_distribute=None,
    experimental_max_worker_delay_secs=None, session_creation_timeout_secs=7200
)

Args:

model_dir: directory where model parameters, graph, etc are saved. If PathLike object, the path will be resolved. If None, will use a default value set by the Estimator.

tf_random_seed: Random seed for TensorFlow initializers. Setting this value allows consistency between reruns.

save_summary_steps: Save summaries every this many steps.

save_checkpoints_steps: Save checkpoints every this many steps. Can not be specified with save_checkpoints_secs.

save_checkpoints_secs: Save checkpoints every this many seconds. Can not be specified with save_checkpoints_steps. Defaults to 600 seconds if both save_checkpoints_steps and save_checkpoints_secs are not set in constructor. If both save_checkpoints_steps and save_checkpoints_secs are None, then checkpoints are disabled.

session_config: a ConfigProto used to set session parameters, or None.

keep_checkpoint_max: The maximum number of recent checkpoint files to keep. As new files are created, older files are deleted. If None or 0, all checkpoint files are kept. Defaults to 5 (that is, the 5 most recent checkpoint files are kept.)

keep_checkpoint_every_n_hours: Number of hours between each checkpoint to be saved. The default value of 10,000 hours effectively disables the feature.

log_step_count_steps: The frequency, in number of global steps, that the global step and the loss will be logged during training. Also controls the frequency that the global steps / s will be logged (and written to summary) during training.

train_distribute: An optional instance of tf.distribute.Strategy. If specified, then Estimator will distribute the user's model during training, according to the policy specified by that strategy. Setting experimental_distribute.train_distribute is preferred.

device_fn: A callable invoked for every Operation that takes the Operation and returns the device string. If None, defaults to the device function returned by tf.train.replica_device_setter with round-robin strategy.

protocol: An optional argument which specifies the protocol used when starting server. None means default to grpc.

eval_distribute: An optional instance of tf.distribute.Strategy. If specified, then Estimator will distribute the user's model during evaluation, according to the policy specified by that strategy. Setting experimental_distribute.eval_distribute is preferred.

experimental_distribute: An optional tf.contrib.distribute.DistributeConfig object specifying DistributionStrategy-related configuration. The train_distribute and eval_distribute can be passed as parameters to RunConfig or set in experimental_distribute but not both.

experimental_max_worker_delay_secs: An optional integer specifying the maximum time a worker should wait before starting. By default, workers are started at staggered times, with each worker being delayed by up to 60 seconds. This is intended to reduce the risk of divergence, which can occur when many workers simultaneously update the weights of a randomly initialized model. Users who warm-start their models and train them for short durations (a few minutes or less) should consider reducing this default to improve training times.

session_creation_timeout_secs: Max time workers should wait for a session to become available (on initialization or when recovering a session) with MonitoredTrainingSession. Defaults to 7200 seconds, but users may want to set a lower value to detect problems with variable / session (re)-initialization more quickly.

Attributes:

cluster_spec

device_fn: Returns the device_fn.

If device_fn is not None, it overrides the default device function used in Estimator. Otherwise the default one is used.

eval_distribute: Optional tf.distribute.Strategy for evaluation.

evaluation_master

experimental_max_worker_delay_secs

global_id_in_cluster: The global id in the training cluster.

All global ids in the training cluster are assigned from an increasing sequence of consecutive integers. The first id is 0.

Note: Task id (the property field task_id) is tracking the index of the node among all nodes with the SAME task type. For example, given the cluster definition as follows:

cluster = {'chief': ['host0:2222'],
           'ps': ['host1:2222', 'host2:2222'],
           'worker': ['host3:2222', 'host4:2222', 'host5:2222']}

Nodes with task type worker can have id 0, 1, 2. Nodes with task type ps can have id, 0, 1. So, task_id is not unique, but the pair (task_type, task_id) can uniquely determine a node in the cluster.

Global id, i.e., this field, is tracking the index of the node among ALL nodes in the cluster. It is uniquely assigned. For example, for the cluster spec given above, the global ids are assigned as: ```

task_type | task_id | global_id

chief | 0 | 0 worker | 0 | 1 worker | 1 | 2 worker | 2 | 3 ps | 0 | 4 ps | 1 | 5 ```

is_chief

keep_checkpoint_every_n_hours

keep_checkpoint_max

log_step_count_steps

master

model_dir

num_ps_replicas

num_worker_replicas

protocol: Returns the optional protocol value.

save_checkpoints_secs

save_checkpoints_steps

save_summary_steps

service: Returns the platform defined (in TF_CONFIG) service dict.

session_config

session_creation_timeout_secs

task_id

task_type

tf_random_seed

train_distribute: Optional tf.distribute.Strategy for training.

Raises:

ValueError: If both save_checkpoints_steps and save_checkpoints_secs are set.

Methods

replace

replace(
    **kwargs
)

In addition, either save_checkpoints_steps or save_checkpoints_secs can be set (should not be both).

tf.estimator.RunConfig

View aliases

Args:

Attributes:

task_type | task_id | global_id

Raises:

Methods

`replace`

Args:

Raises:

Returns: