OpenACC 计算构建内的自定义函数

▶ 使用 routine 构件创建的自定义函数，在并行调用上的差别

● 代码，自定义一个 sqab 函数，使用内建函数 fabsf 和 sqrtf 计算一个矩阵所有元素绝对值的平方根

 #include <stdio.h>

 #include <stdlib.h>

 #include <math.h>

 #include <openacc.h>

 #define ROW 8

 #define COL 64

 #pragma acc routine vector

 void sqab(float *a, const int m)

 {

 #pragma acc loop

     for (int idx = ; idx < m; idx++)

         a[idx] = sqrtf(fabsf(a[idx]));

 }

 int main()

 {

     float x[ROW][COL];

     int row, col;

     for (row = ; row < ROW; row++)

     {

         for (col = ; col < COL; col++)

             x[row][col] = row *  + col;

     }

     printf("\nx[1][1] = %f\n", x[][]);

 #pragma acc parallel loop vector pcopy(x[0:ROW][0:COL]) // 之后在这里分别添加 gang，worker，vector

     for (row = ; row < ROW; row++)

         sqab(&x[row][], COL);

     printf("\nx[1][1] = %f\n", x[][]);

     //getchar();

     return ;

 }

● 输出结果，第 28 行不添加并行级别子句（默认使用 gang）

 D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe

 sqab:

      , Generating Tesla code

          , #pragma acc loop vector /* threadIdx.x */

      , Loop is parallelizable

 main:

      , Generating copy(x[:][:])

          Accelerator kernel generated

          Generating Tesla code

          , #pragma acc loop gang /* blockIdx.x */

 D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe

 x[][] = 11.000000

 launch CUDA kernel  file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main

 line= device= threadid= num_gangs= num_workers= vector_length= grid= block=      // 8 个 gang 在 blockIdx.x 层级，1 个 worker，vector 在 threadIdx.x 层级

 x[][] = 3.316625

 PGI: "acc_shutdown" not detected, performance results might be incomplete.

  Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete.

 Accelerator Kernel Timing data

 D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c

   main  NVIDIA  devicenum=

     time(us):

     : compute region reached  time

         : kernel launched  time

             grid: []  block: []

             elapsed time(us): total= max= min= avg=

     : data region reached  times

         : data copyin transfers:

              device time(us): total= max= min= avg=

         : data copyout transfers:

              device time(us): total= max= min= avg=

● 输出结果，第 28 行添加并行级别子句 worker

 D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe

 sqab:

      , Generating Tesla code

          , #pragma acc loop vector /* threadIdx.x */

      , Loop is parallelizable

 main:

      , Generating copy(x[:][:])

          Accelerator kernel generated

          Generating Tesla code

          , #pragma acc loop worker(4) /* threadIdx.y */

      , Loop is parallelizable

 D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe

 x[][] = 11.000000

 launch CUDA kernel  file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main

 line= device= threadid= num_gangs= num_workers= vector_length= grid= block=32x4    // 1 个 gang，4 个 worker 在 threadIdx.y 层级，使用 2 维线程网格

 x[][] = 3.316625

 PGI: "acc_shutdown" not detected, performance results might be incomplete.

  Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete.

 Accelerator Kernel Timing data

 D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c

   main  NVIDIA  devicenum=

     time(us):

     : compute region reached  time

         : kernel launched  time

             grid: []  block: [32x4]

              device time(us): total= max= min= avg=

     : data region reached  times

         : data copyin transfers:

              device time(us): total= max= min= avg=

         : data copyout transfers:

              device time(us): total= max= min= avg=

● 输出结果，第 28 行添加并行级别子句 vector

 D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe

 sqab:

      , Generating Tesla code

          , #pragma acc loop vector /* threadIdx.x */

      , Loop is parallelizable

 main:

      , Generating copy(x[:][:])

          Accelerator kernel generated

          Generating Tesla code

          , #pragma acc loop seq

      , Loop is parallelizable

 D:\Code\OpenACC\OpenACCProject\OpenACCProject>main_acc.exe

 x[][] = 11.000000

 launch CUDA kernel  file=D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c function=main

 line= device= threadid= num_gangs= num_workers= vector_length= grid= block=      // 1 个 gang，1 个 worker，并行全都堆在 threadIdx.x 层级上

 x[][] = 3.316625

 PGI: "acc_shutdown" not detected, performance results might be incomplete.

  Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete.

 Accelerator Kernel Timing data

 D:\Code\OpenACC\OpenACCProject\OpenACCProject\main.c

   main  NVIDIA  devicenum=

     time(us):

     : compute region reached  time

         : kernel launched  time

             grid: []  block: []

             elapsed time(us): total= max= min= avg=

     : data region reached  times

         : data copyin transfers:

              device time(us): total= max= min= avg=

         : data copyout transfers:

              device time(us): total= max= min= avg=

● 如果自定义函数并行子句等级高于主调函数，则主调函数并行子句会变成 seq；如果自定义函数并行子句等级低于内部并行子句等级，则会报 warning，忽略掉内部并行子句：

 #pragma acc routine vector

 void sqab(float *a, const int m)

 {

 #pragma acc loop worker

     for (int idx = ; idx < m; idx++)

         a[idx] = sqrtf(fabsf(a[idx]));

 }

● 编译结果（运行结果通上面的 worker，不写）

D:\Code\OpenACC\OpenACCProject\OpenACCProject>pgcc main.c -acc -Minfo -o main_acc.exe

PGC-W--acc loop worker clause ignored in acc routine vector procedure  (main.c: )

sqab:

     , Generating Tesla code

         , #pragma acc loop vector /* threadIdx.x */

     , Loop is parallelizable

巴特西

OpenACC 计算构建内的自定义函数

最新文章

热门文章