https://horovod.readthedocs.io/en/stable/docker.html

Step1 构建镜像

GPU

$ mkdir horovod-docker-gpu
$ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu
$ docker build -t horovod:latest horovod-docker-gpu

CPU

$ mkdir horovod-docker-gpu
$ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.cpu
$ docker build -t horovod:latest horovod-docker-cpu

在单机上运行

GPU 的机器,可以使用 nvidia-docker.

$ nvidia-docker run -it horovod:latest
root@c278c88dd552:/examples# horovodrun -np 4 -H localhost:4 python keras_mnist_advanced.py

在多机上运行

(一)多机运行的条件:ssh免密登陆

http://www.linuxproblem.org/art_9.html

  1. First log in on A as user a and generate a pair of authentication keys. Do not enter a passphrase:
a@A:~> ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/a/.ssh/id_rsa):
Created directory '/home/a/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/a/.ssh/id_rsa.
Your public key has been saved in /home/a/.ssh/id_rsa.pub.
The key fingerprint is:
3e:4f:05:79:3a:9f:96:7c:3b:ad:e9:58:37:bc:37:e4 a@A
  1. Now use ssh to create a directory ~/.ssh as user b on B. (The directory may already exist, which is fine):
a@A:~> ssh b@B mkdir -p .ssh
b@B's password:
  1. Finally append a's new public key to b@B:.ssh/authorized_keys and enter b's password one last time:
a@A:~> cat .ssh/id_rsa.pub | ssh b@B 'cat >> .ssh/authorized_keys'
b@B's password:
  1. From now on you can log into B as b from A as a without password:
a@A:~> ssh b@B

(二)主worker

host1$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest
root@c278c88dd552:/examples# horovodrun -np 16 -H host1:4,host2:4,host3:4,host4:4 -p 12345 python keras_mnist_advanced.py

(三)从 workers:

host2$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
host3$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
host4$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
bash -c "/usr/sbin/sshd -p 12345; sleep infinity"

支持远程直接数据存储

$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh --cap-add=IPC_LOCK --device=/dev/infiniband horovod:latest
root@c278c88dd552:/examples# ...

最新文章

  1. 如何获取Flickr图片链接地址作为外链图片
  2. apt-get 与 yum的区别 (转)
  3. 如何让Advanced Installer卸载软件时保留一些文件
  4. 导入GPUImage,实时滤镜相机,GUPImage遇到的问题解决,_OBJC_METACLASS_$_GBGPUImageView in GBGPUImageView.o
  5. Logistic Regression and Classification
  6. Eclipse闪退/打不开/无法启动/一闪而过
  7. Angular2中的Service并不是单例模式
  8. 转:IIS虚拟目录实现与文件服务器网络驱动器映射共享
  9. MySQL函数大全【转载】
  10. vi/vim键盘图-
  11. python xpath学习
  12. require 4种引入方式的区别
  13. 浏览器h5新建文件 保存到本地(相当于浏览器写文件)
  14. zabbix自定义监控项
  15. Pyperclip could not find a copy/paste mechanism for your system.
  16. Codeforces 1103 E. Radix sum
  17. 玩转SpringCloud(F版本) 四.路由网关(zuul)
  18. bing词典vs有道词典对比测试报告——功能篇之细节与用户体验
  19. Git创建本地分支并关联远程分支
  20. Hibernate 菜鸟教程 异常 集锦

热门文章

  1. AForge实现拍照
  2. ForkJoinPool大型图文现场(一阅到底 vs 直接收藏)
  3. Vue学习笔记-Vue.js-2.X 学习(四)===>脚手架Vue-CLI(基本工作和创建)
  4. 对Map进行复合操作(读写)且并发执行时,无法保证业务的行为是正确的,对读写操作进行同步则可以解决。
  5. 元类、orm
  6. C#语言特性及发展史
  7. HBase 数据存储结构
  8. Mac电脑管理员密码丢失解决办法
  9. tibco EMS 8.2.0安装
  10. [个人总结]利用grad-cam实现人民币分类