TVM性能评估分析（五）

Figure 3. A futher speed up with operator fusion

Table 1. Performance issue of cuBLAS’ batch matmul

Table 2. Finding the best combination of number_thread. The results are obtained on a NVIDIA M40 GPU device with CUDA8.0.

Figure 4. DLPack provides an intermediate wrapper that is shared between frameworks and TVM

Figure 5. The OpenGL/WebGL Backend

Figure 6. TVM utilizes a unified AST to define kernels, and compiles it to code on different platforms.

Figure 7. The benchmark is run in 4 different settings

Figure 8. Inference Speed of Different Backends on ImageNet

Figure 9. Mali T860 and T880

Figure 10. Inference Speed of Different Backends on ImageNet

Table 3. Inference Speed of FP16 on ImageNet

巴特西