tensorflow使用horovod进行多gpu训练
tensorflow使用horovod多gpu训练
要使用Horovod,在程序中添加以下内容。此示例使用TensorFlow。
运行
<span>hvd.init()</span>
使用固定服务器GPU,以供此过程使用
<span>config.gpu_options.visible_device_list</span>通过每个进程一个GPU的典型设置,您可以将其设置为local rank。在这种情况下,服务器上的第一个进程将被分配第一GPU,第二个进程将被分配第二GPU,依此类推。
通过工人人数来衡量学习率
同步分布式培训中的有效批处理规模是根据工人人数来衡量的。学习率的提高弥补了批量大小的增加。
将优化器包装在中
<span>hvd.DistributedOptimizer</span>分布式优化器将梯度计算委派给原始优化器,使用allreduce或allgather对梯度求平均,然后应用这些平均梯度。
添加
<span>hvd.BroadcastGlobalVariablesHook(0)</span>到播放初始变量状态从0级到所有其他进程当使用随机权重开始训练或从检查点恢复训练时,这是确保所有工人进行一致初始化的必要步骤。另外,如果您不使用
<span>MonitoredTrainingSession</span>,则可以<span>hvd.broadcast_global_variables</span>在初始化全局变量之后执行op。
修改您的代码以仅在工作程序0上保存检查点,以防止其他工作程序破坏它们
通过传递
<span>checkpoint_dir=None</span>给<span>tf.train.MonitoredTrainingSession</span>if 来完成此操作。<span>hvd.rank() <span>!= <span>0</span></span></span>
简单示例代码
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model...
loss = ...
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
# Add hook to broadcast variables from rank 0 to all other processes during
# initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operation
train_op = opt.minimize(loss)
# Save checkpoints only on worker 0 to prevent other workers from corrupting them.
checkpoint_dir = ‘/tmp/train_logs‘ if hvd.rank() == 0 else None
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
config=config,
hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(train_op) 相关推荐
Micusd 2020-11-19
lybbb 2020-10-15
lybbb 2020-09-29
ghjk0 2020-09-24
yamaxifeng 2020-09-09
GDGYZL 2020-08-28
lybbb 2020-08-28
Icevivian 2020-08-25
comwayLi 2020-08-16
carbon0 2020-08-16
源式羽语 2020-08-09
sherry颖 2020-08-01
songbinxu 2020-07-19
sherry颖 2020-07-18
Niteowl 2020-07-15
Kindle君 2020-07-15
源式羽语 2020-07-04
源式羽语 2020-06-28