▶ 使用 routine 构件创建的自定义函数,在并行调用上的差别

● 代码,自定义一个 sqab 函数,使用内建函数 fabsf 和 sqrtf 计算一个矩阵所有元素绝对值的平方根

 #include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <openacc.h> #define ROW 8
#define COL 64 #pragma acc routine vector
void sqab(float *a, const int m)
{
#pragma acc loop
for (int idx = ; idx < m; idx++)
a[idx] = sqrtf(fabsf(a[idx]));
} int main()
{
float x[ROW][COL];
int row, col;
for (row = ; row < ROW; row++)
{
for (col = ; col < COL; col++)
x[row][col] = row * + col;
}
printf("\nx[1][1] = %f\n", x[][]); #pragma acc parallel loop vector pcopy(x[0:ROW][0:COL]) // 之后在这里分别添加 gang,worker,vector
for (row = ; row < ROW; row++)
sqab(&x[row][], COL);
printf("\nx[1][1] = %f\n", x[][]); //getchar();
return ;
}

● 输出结果,第 28 行不添加并行级别子句(默认使用 gang)

 D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe
sqab:
, Generating Tesla code
, #pragma acc loop vector /* threadIdx.x */
, Loop is parallelizable
main:
, Generating copy(x[:][:])
Accelerator kernel generated
Generating Tesla code
, #pragma acc loop gang /* blockIdx.x */ D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe x[][] = 11.000000
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= num_gangs= num_workers= vector_length= grid= block= // 8 个 gang 在 blockIdx.x 层级,1 个 worker,vector 在 threadIdx.x 层级 x[][] = 3.316625
PGI: "acc_shutdown" not detected, performance results might be incomplete.
Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete. Accelerator Kernel Timing data
D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c
main NVIDIA devicenum=
time(us):
: compute region reached time
: kernel launched time
grid: [] block: []
elapsed time(us): total= max= min= avg=
: data region reached times
: data copyin transfers:
device time(us): total= max= min= avg=
: data copyout transfers:
device time(us): total= max= min= avg=

● 输出结果,第 28 行添加并行级别子句 worker

 D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe
sqab:
, Generating Tesla code
, #pragma acc loop vector /* threadIdx.x */
, Loop is parallelizable
main:
, Generating copy(x[:][:])
Accelerator kernel generated
Generating Tesla code
, #pragma acc loop worker(4) /* threadIdx.y */
, Loop is parallelizable D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe x[][] = 11.000000
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= num_gangs= num_workers= vector_length= grid= block=32x4 // 1 个 gang,4 个 worker 在 threadIdx.y 层级,使用 2 维线程网格 x[][] = 3.316625
PGI: "acc_shutdown" not detected, performance results might be incomplete.
Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete. Accelerator Kernel Timing data
D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c
main NVIDIA devicenum=
time(us):
: compute region reached time
: kernel launched time
grid: [] block: [32x4]
device time(us): total= max= min= avg=
: data region reached times
: data copyin transfers:
device time(us): total= max= min= avg=
: data copyout transfers:
device time(us): total= max= min= avg=

● 输出结果,第 28 行添加并行级别子句 vector

 D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe
sqab:
, Generating Tesla code
, #pragma acc loop vector /* threadIdx.x */
, Loop is parallelizable
main:
, Generating copy(x[:][:])
Accelerator kernel generated
Generating Tesla code
, #pragma acc loop seq
, Loop is parallelizable D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe x[][] = 11.000000
launch CUDA kernel file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main
line= device= threadid= num_gangs= num_workers= vector_length= grid= block= // 1 个 gang,1 个 worker,并行全都堆在 threadIdx.x 层级上 x[][] = 3.316625
PGI: "acc_shutdown" not detected, performance results might be incomplete.
Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete. Accelerator Kernel Timing data
D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c
main NVIDIA devicenum=
time(us):
: compute region reached time
: kernel launched time
grid: [] block: []
elapsed time(us): total= max= min= avg=
: data region reached times
: data copyin transfers:
device time(us): total= max= min= avg=
: data copyout transfers:
device time(us): total= max= min= avg=

● 如果自定义函数并行子句等级高于主调函数,则主调函数并行子句会变成 seq;如果自定义函数并行子句等级低于内部并行子句等级,则会报 warning,忽略掉内部并行子句:

 #pragma acc routine vector
void sqab(float *a, const int m)
{
#pragma acc loop worker
for (int idx = ; idx < m; idx++)
a[idx] = sqrtf(fabsf(a[idx]));
}

● 编译结果(运行结果通上面的 worker,不写)

D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe
PGC-W--acc loop worker clause ignored in acc routine vector procedure (main.c: )
sqab:
, Generating Tesla code
, #pragma acc loop vector /* threadIdx.x */
, Loop is parallelizable

最新文章

  1. 总结JavaScript事件机制
  2. jquery 使用方法(一)
  3. 字符串反混淆实战 Dotfuscator 4.9 字符串加密技术应对策略
  4. linux系统下who&amp;who am i与whoami的区别,以及与select * from dba_users的区别
  5. Python全栈之路8--迭代器(iter)和生成器(yield)
  6. Oracle Essbase入门系列(二)
  7. mongo(删除操作)
  8. Python文件处理之文件写入方式与写缓存(三)
  9. 让backspace键默认为删除键
  10. facebook登录(集成FBSDKLoginKit) result的isCancelled总是YES token为nil
  11. 基于Verilog HDL 各种实验
  12. Android4.2以后,多屏幕的支持 学习(一)
  13. devexpress实现单元格合并以及依据条件合并单元格
  14. layer.js弹出框
  15. java项目发布
  16. sparkSQL以JDBC为数据源
  17. 〖Android〗超级终端/sdcard/local_profile备份
  18. 和我一起学《HTTP权威指南》——客户端识别与cookie机制
  19. 虚拟机安装Linux中常见异常及解决办法
  20. CentOS7系列--5.3CentOS7中配置和管理Kubernetes

热门文章

  1. LOJ2542. 「PKUWC2018」随机游走【概率期望DP+Min-Max容斥(最值反演)】
  2. day 2克隆虚拟机器minimal需要注意的问题和制作本地yum源和常用的Linux的命令
  3. scanner 在java中的输入
  4. day13 python学习 迭代器,生成器
  5. sdut2408 pick apples (贪心+背包)山东省第三届ACM省赛
  6. 使用MVC5的Entity Framework 6入门 ---- 系列教程
  7. BLE 4.1 和 BLE 4.2
  8. Flask视图函数与普通函数的区别,响应对象Response
  9. 使用Jquery实现Win8开始菜单效果的站点导航
  10. php 生成.csv的文件