Horovod in Docker
2024-08-27 10:48:53
https://horovod.readthedocs.io/en/stable/docker.html
Step1 构建镜像
GPU
$ mkdir horovod-docker-gpu
$ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu
$ docker build -t horovod:latest horovod-docker-gpu
CPU
$ mkdir horovod-docker-gpu
$ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.cpu
$ docker build -t horovod:latest horovod-docker-cpu
在单机上运行
GPU 的机器,可以使用 nvidia-docker.
$ nvidia-docker run -it horovod:latest
root@c278c88dd552:/examples# horovodrun -np 4 -H localhost:4 python keras_mnist_advanced.py
在多机上运行
(一)多机运行的条件:ssh免密登陆
http://www.linuxproblem.org/art_9.html
- First log in on A as user a and generate a pair of authentication keys. Do not enter a passphrase:
a@A:~> ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/a/.ssh/id_rsa):
Created directory '/home/a/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/a/.ssh/id_rsa.
Your public key has been saved in /home/a/.ssh/id_rsa.pub.
The key fingerprint is:
3e:4f:05:79:3a:9f:96:7c:3b:ad:e9:58:37:bc:37:e4 a@A
- Now use ssh to create a directory ~/.ssh as user b on B. (The directory may already exist, which is fine):
a@A:~> ssh b@B mkdir -p .ssh
b@B's password:
- Finally append a's new public key to b@B:.ssh/authorized_keys and enter b's password one last time:
a@A:~> cat .ssh/id_rsa.pub | ssh b@B 'cat >> .ssh/authorized_keys'
b@B's password:
- From now on you can log into B as b from A as a without password:
a@A:~> ssh b@B
(二)主worker
host1$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest
root@c278c88dd552:/examples# horovodrun -np 16 -H host1:4,host2:4,host3:4,host4:4 -p 12345 python keras_mnist_advanced.py
(三)从 workers:
host2$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
host3$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
host4$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
支持远程直接数据存储
$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh --cap-add=IPC_LOCK --device=/dev/infiniband horovod:latest
root@c278c88dd552:/examples# ...
最新文章
- 如何获取Flickr图片链接地址作为外链图片
- apt-get 与 yum的区别 (转)
- 如何让Advanced Installer卸载软件时保留一些文件
- 导入GPUImage,实时滤镜相机,GUPImage遇到的问题解决,_OBJC_METACLASS_$_GBGPUImageView in GBGPUImageView.o
- Logistic Regression and Classification
- Eclipse闪退/打不开/无法启动/一闪而过
- Angular2中的Service并不是单例模式
- 转:IIS虚拟目录实现与文件服务器网络驱动器映射共享
- MySQL函数大全【转载】
- vi/vim键盘图-
- python xpath学习
- require 4种引入方式的区别
- 浏览器h5新建文件 保存到本地(相当于浏览器写文件)
- zabbix自定义监控项
- Pyperclip could not find a copy/paste mechanism for your system.
- Codeforces 1103 E. Radix sum
- 玩转SpringCloud(F版本) 四.路由网关(zuul)
- bing词典vs有道词典对比测试报告——功能篇之细节与用户体验
- Git创建本地分支并关联远程分支
- Hibernate 菜鸟教程 异常 集锦
热门文章
- AForge实现拍照
- ForkJoinPool大型图文现场(一阅到底 vs 直接收藏)
- Vue学习笔记-Vue.js-2.X 学习(四)===>;脚手架Vue-CLI(基本工作和创建)
- 对Map进行复合操作(读写)且并发执行时,无法保证业务的行为是正确的,对读写操作进行同步则可以解决。
- 元类、orm
- C#语言特性及发展史
- HBase 数据存储结构
- Mac电脑管理员密码丢失解决办法
- tibco EMS 8.2.0安装
- [个人总结]利用grad-cam实现人民币分类