Horovod-Usage
2024-09-08 04:08:19
Usage
代码中要包含以下6步:
- 初始化
Run hvd.init() to initialize Horovod.
- 将每个GPU固定到单个进程以避免资源争用。
一个线程一个GPU,设置到 local rank ,第一个线程将分配给第一个GPU。第二个线程将分配给第二个GPU 向每个 TensorFlow 进程分配一个 GPU
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
- 根据worker的数量,来确定学习率
loss = ...
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())
- 使用 Horovod 优化器包裹每一个常规 TensorFlow 优化器,Horovod 优化器使用 ring-allreduce 平均梯度
opt = hvd.DistributedOptimizer(opt)
- 将变量从第一个流程向其他流程传播,以实现一致性初始化. 从 rank 0 广播到所有的线程
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
- 将checkpoints 保存在worker0上
with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
config=config,
hooks=hooks) as mon_sess:
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model...
loss = ...
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
# Add hook to broadcast variables from rank 0 to all other processes during
# initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operation
train_op = opt.minimize(loss)
# Save checkpoints only on worker 0 to prevent other workers from corrupting them.
checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
config=config,
hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(train_op)
最新文章
- Moon.Orm 入门总指南
- JavaScript数据类型--值类型和引用类型
- 设计模式之美:Proxy(代理)
- ZooKeeper学习第二期--ZooKeeper安装配置
- 实现读入一个彩色视频文件并以灰度格式输出这个视频文件,学习opencv例2-10
- bzoj2434
- 1630/2023: [Usaco2005 Nov]Ant Counting 数蚂蚁
- GD库知识点
- (二)部署solr7.1.0到tomcat
- 2393Cirno的完美算数教室 容斥
- python3中使用builtwith的方法(很详细)
- python模块:subprocess
- python数字图像处理---噪声的应用
- 潭州课堂25班:Ph201805201 django框架 第五课 自定义简单标签,包含标签,模型类创建,梳理类创建 (课堂笔记)
- Exception in thread ";main"; java.lang.StackOverflowError 	at java.util.ArrayList$SubList.rangeCheckForAdd(Unknown Source)
- 6、二、App Components(应用程序组件):1、Intents and Intent Filters(意图和意图过滤器)
- JavaScript基础知识点学习记录
- ip黑白名单防火墙frdev的原理与实现
- linux服务器mysql数据库新建数据库并配置数据库用户
- 品味性能之道<;十>;:Oracle Hint