废话不多说,先说最终成功的版本:系统=>centos7 ,cuda=>10.0 ,cudnn=>7.5 ,nccl=>源码编译, tensorflow=>最新版本源码编译

第一次尝试:cuda=>10.1 cudnn=>7.5 nccl=>2.4.2

1.cuda下载包:*.run,,直接 sh ./*.run 按照提示选择就能安装,一般选择默认路径 /usr/local/cuda方便后续操作

配置环境,在/etc/profile末尾加上

export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local//lib64:$LD_LIBRARY_PATH"

2.cudnn 解压后文件夹为cuda,将头文件和库文件分别拷贝到cuda对应的目录下:

sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64

更改执行权限

sudo chmod a+r /usr/local/cuda/include/cudnn.h
sudo chmod a+r /usr/local/cuda/lib64/libcudnn*

查看nvcc是否成功

nvcc --version

3.安装nccl

目前官网只有*.rpm格式,网上说的deb格式没找到,所以没法试验是否能用,所以使用rpm安装

rpm -ivh nccl*.rpm

但是这一步是解压,会解压到/var/nccl*目录下,发现下面有三个rpm文件,依次rpm安装

4.安装bazel

因为编译tensorflow需要使用google的bazel,看网上教程让下载bazel-0.24.1-dist.zip,解压后编译

./compile.sh 

发现报错,需要安装cmake(见后面)

编译报错,忘了什么错了,搜索无果,重新下载bazel-0.24.1-installer-linux-x86_64.sh版本在线安装,直接运行,成功!

5.安装cmake

下载cmake>3.4的版本,解压编译安装

./configure
gmake
make install

配置环境变量

PATH=/usr/local/cmake/bin:$PATH
export PATH

6.编译tensorflow

按照提示选择路径及插件

Please specify the location of python. [Default is /usr/bin/python]:
Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: n
Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]: n
Do you wish to build TensorFlow with XLA JIT support? [y/N]: n
Do you wish to build TensorFlow with GDR support? [y/N]: N
Do you wish to build TensorFlow with VERBS support? [y/N]: N
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: N
Do you wish to build TensorFlow with CUDA support? [y/N]: Y
Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 10.0]:10.1
Please specify the location where CUDA 10.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]:
Please specify the location where cuDNN library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.1]:
Do you wish to build TensorFlow with TensorRT support? [y/N]: N
Please specify the NCCL version you want to use. [Leave empty to default to NCCL 2]: 2.4.2
Please specify the location where NCCL library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.0]:
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 6.1]
Do you want to use clang as CUDA compiler? [y/N]: N
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: /usr/bin/gcc
Do you wish to build TensorFlow with MPI support? [y/N]: N
Please specify optimization flags to use during compilation when bazel option “–config=opt” is specified [Default is -march=native]:
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:N

使用编译命令

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package 

报错

Cuda Configuration Error: No library found under: /usr/local/cuda-10.1/lib64/libcublas.so.10.1, /usr/local/cuda-10.1/lib64/stubs/libcublas.so.10.1, /usr/local/cuda-10.1/lib/powerpc64le-linux-gnu/libcublas.so.10.1, /usr/local/cuda-10.1/lib/x86_64-linux-gnu/libcublas.so.10.1, /usr/local/cuda-10.1/lib/x64/libcublas.so.10.1, /usr/local/cuda-10.1/lib/libcublas.so.10.1, /usr/local/cuda-10.1/libcublas.so.10.1

搜索后发现大部分人都认为cuda10.1尚不可用,只能放弃,中间试过加入链接(https://github.com/tensorflow/tensorflow/issues/26289)

sudo ln -s /usr/local/cuda-10.1/targets/x86_64-linux/lib/libcublas.so.10.1.0.105 /usr/lib64/libcublas.so.10.0

执行编译后报新的错误

Cuda Configuration Error: None of the libraries match their SONAME: /home/bernard/opt/cuda_test/cuda/lib64/libcublas.so.10.1

决定卸掉10.1,重装10.0

第二次尝试:cuda=>10.0 cudnn=>7.5 nccl=>2.4.2

1.下载cuda10.0的安装包,其他不变

2.编译tensorflow时报新的错误

fatal error: nccl.h: No such file or directory

找不到nccl.h,就是说上面那种方式安装失败

搜索发现需要安装 libnccl2 libnccl-dev libnccl-static ,但是网上教程都是ubuntu的使用apt get 安装,centos只有yum,尝试执行,报错

No package "libnccl" available

3.使用rpm卸载nccl,重新编译安装nccl

github上clone下nccl项目,编译安装

cd nccl
make -j src.build
make src.build
yum install build-essential devscripts debhelper
make pkg.debian.build

4.重新编译tensorflow

Please specify the location of python. [Default is /usr/bin/python]:
Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: n
Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]: n
Do you wish to build TensorFlow with XLA JIT support? [y/N]: n
Do you wish to build TensorFlow with GDR support? [y/N]: N
Do you wish to build TensorFlow with VERBS support? [y/N]: N
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: N
Do you wish to build TensorFlow with CUDA support? [y/N]: Y
Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 10.0]:
Please specify the location where CUDA 10.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]:
Please specify the location where cuDNN library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.0]:
Do you wish to build TensorFlow with TensorRT support? [y/N]: N
Please specify the NCCL version you want to use. [Leave empty to default to NCCL 2]:
Please specify the location where NCCL library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-10.0]:
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 6.1]
Do you want to use clang as CUDA compiler? [y/N]: N
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: /usr/bin/gcc
Do you wish to build TensorFlow with MPI support? [y/N]: N
Please specify optimization flags to use during compilation when bazel option “–config=opt” is specified [Default is -march=native]:
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]:N

标红的做了修改,其他不变,大概等一个小时后编译完成

转换为whl文件

bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

使用pip安装

pip install /tmp/tensorflow_pkg/*.whl

成功截图

5.测试tensorflow,gpu是否可用

import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

报了一个很奇怪的错误

开始以为是没有编译tensorboard依赖,看了源码发现并不需要另外下载,最后查看了一下tensorboard的文件时间,发现是以前安装的没有卸载干净,pip uninstall 卸载后重新安装,一切正常

总结

其实安装完cuda和cudnn后可以直接pip install tensorflow-gpu的,不用自己重新编译(也就不需要安装cmake,bazel),当初以为没有最新版本,所以自己编译,后来发现直接安装的编译环境就是cuda10.0,不过贴合系统的编译总是好用的,哈哈!

下面是直接安装的截图,AVX2没有正常使用,所以还是编译一把好点

最新文章

  1. 简单的后台json,前台解析 操作
  2. Remove openjdk in Ubuntu/Configure jdk and running adb in 64-bit Ubuntu
  3. DPM总结
  4. Java: arr==null vs arr.length==0
  5. Java 实现奇数阶幻方的构造
  6. Integer Inquiry_hdu_1047(大数).java
  7. Android Wear开发 - 数据通讯 - 第零节 : 打包Wear应用(手机和手表应用如何连接)
  8. 为什么针对XML的支持不够好?如何改进?
  9. 面试题:对一个正整数n,算得到1需要的最少操作次数
  10. CSS的position(位置)
  11. group by搭配 order by解决排序问题
  12. day18 python之re模块与正则表达式
  13. ES5-ES6-ES7_iterator和for of
  14. oracle的 表、 procedure、package等对象被锁,处理方法
  15. python中enumerate、变量类型转换
  16. django项目一 CRM表结构
  17. ntp时间服务器--Linux配置
  18. poj_1743 后缀数组
  19. 20155201 2016-2017-2 《Java程序设计》第一周学习总结
  20. 使用 console.time() 计算js代码执行时间

热门文章

  1. intellij idea pycharm phpstorm webstorm 使用 FiraCode 作为编程字体,更新后字符乱码问题解决
  2. Java中List.remove报UnsupportedOperationException异常
  3. oracle 如何完全删除干净
  4. java高级主题
  5. cocos2dx的ui封装
  6. unix网络编程笔记(二)
  7. hdoj 1455 Sticks 【dfs】
  8. linux环境下redis安装
  9. SAP 物料 移动类型
  10. 数组的includes操作