Oak Ridge Leadership Computing Facility

CUDA Fortran Vector Addition

Systems: Titan
Categories: Accelerator Programming

Last Update: 1/13/13
Print this article

Introduction

This sample shows a minimal conversion from our vector addition CPU code to PGI CUDA FORTRAN, consider this a CUDA FORTRAN ‘Hello World’. Please direct any questions or comments to help@nccs.gov

vecAdd.cuf

module kernel
    contains
    ! CUDA kernel. Each thread takes care of one element of c
    attributes(global) subroutine vecAdd_kernel(n, a, b, c)
        integer, value :: n
        real(8), device :: a(n), b(n), c(n)
        integer :: id

        ! Get our global thread ID
        id = (blockidx%x-1)*blockdim%x + threadidx%x

        ! Make sure we do not go out of bounds
        if (id <= n) then
            c(id) = a(id) + b(id)
        endif
    end subroutine vecAdd_kernel
end module kernel

program main
    use cudafor
    use kernel

    type(dim3) :: blockSize, gridSize
    real(8) :: sum
    integer :: i

    ! Size of vectors
    integer :: n = 100000

    ! Host input vectors
    real(8),dimension(:),allocatable :: h_a
    real(8),dimension(:),allocatable :: h_b
    !Host output vector
    real(8),dimension(:),allocatable :: h_c

    ! Device input vectors
    real(8),device,dimension(:),allocatable :: d_a
    real(8),device,dimension(:),allocatable :: d_b
    !Host output vector
    real(8),device,dimension(:),allocatable :: d_c

    ! Allocate memory for each vector on host
    allocate(h_a(n))
    allocate(h_b(n))
    allocate(h_c(n))

    ! Allocate memory for each vector on GPU
    allocate(d_a(n))
    allocate(d_b(n))
    allocate(d_c(n))

    ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    do i=1,n
        h_a(i) = sin(i*1D0)*sin(i*1D0)
        h_b(i) = cos(i*1D0)*cos(i*1D0)
    enddo

    ! Implicit copy of host vectors to device
    d_a = h_a(1:n)
    d_b = h_b(1:n)

    ! Number of threads in each thread block
    blockSize = dim3(1024,1,1)

    ! Number of thread blocks in grid
    gridSize = dim3(ceiling(real(n)/real(blockSize%x)) ,1,1)

    ! Execute the kernel
    call vecAdd_kernel<<<gridSize, blockSize>>>(n, d_a, d_b, d_c)

    ! Implicit copy of device array to host
    h_c = d_c(1:n)

    ! Sum up vector c and print result divided by n, this should equal 1 within error
    sum = 0.0;
    do i=1,n
        sum = sum +  h_c(i)
    enddo
    sum = sum/real(n)
    print *, 'final result: ', sum

    ! Release device memory
    deallocate(d_a)
    deallocate(d_b)
    deallocate(d_c)

    ! Release host memory
    deallocate(h_a)
    deallocate(h_b)
    deallocate(h_c)

end program main

Changes

Kernel:

The kernel is the heart of our CUDA code. When we launch a kernel we specify the number of threads per block(blockdim) and number of blocks per grid(griddim). The total number of threads is (blockdim) * (griddim). Each thread evaluates one copy of the kernel.

module kernel
    contains
    !CUDA kernel. Each thread takes care of one element of c
    attributes(global) subroutine vecAdd_kernel(n, a, b, c)
        integer, value :: n
        real(8), device :: a(n), b(n), c(n)
        integer :: id

        !Get our global thread ID
        id = (blockidx%x-1)*blockdim%x + threadidx%x

        !Make sure we do not go out of bounds
        if (id <= n) then
            c(id) = a(id) + b(id)
        endif
    end subroutine vecAdd_kernel
end module kernel

Let’s take a look at what makes up this simple kernel.

module kernel
    contains

end module kernel

The interface for the kernel subroutine must be explicitly known to the calling program, one way of achieving this is placing the subroutine in a module.

attributes(global) subroutine vecAdd_kernel(n, a, b, c)

The global attribute specifies this is a CUDA kernel, otherwise normal Fortran subroutine syntax is used.

!Get our global thread ID
id = (blockidx%x-1)*blockdim%x + threadidx%x

We calculate the threads global id using CUDA Fortran provided derived data types. blockidx contains the blocks position in the grid, ranging from 1 to griddim. threadidx is the threads index inside of it’s associated block, ranging from 1 to blockdim.

if (id <= n) then

Unless we have an array length divisible by our blockdim we will not have the same number of threads launched as valid array components. To avoid overstepping our array we simply test to make sure our global thread ID is less than the length of our array.

c(id) = a(id) + b(id)

the thread ID is used to index the arrays that reside in global device memory. Each thread will load a value from a and b and write the sum to c

Modules

use cudafor
use kernel

The cudafor is a PGI provided module that provides some of the CUDA specific derived types such as dim3. The kernel module we created must also be included.

Array Pointers

!Host input vectors
real(8),dimension(:),allocatable :: h_a
real(8),dimension(:),allocatable :: h_b
!Host output vector
real(8),dimension(:),allocatable :: h_c

!Device input vectors
real(8),device,dimension(:),allocatable :: d_a
real(8),device,dimension(:),allocatable :: d_b
!Host output vector
real(8),device,dimension(:),allocatable :: d_c

With the host CPU and GPU having separate memory spaces we must have two sets of pointers, one set for our host arrays and one set for our device arrays. Here we use the h_ and d_ prefix to differentiate them. Notice that the GPU pointers have the device attribute.

Allocate device memory

!Allocate memory for each vector on GPU
allocate(d_a(n))
allocate(d_b(n))
allocate(d_c(n))

Given a pointer with the device attribute memory will be allocated in the GPUs global memory space.

Copy to device

!Implicit copy of host vectors to device
d_a = h_a(1:n)
d_b = h_b(1:n)

After we initialize the data on the host we must copy it over to the device, to do so we can can use the syntax above. This is a convenient way of doing things but care must be taken. This operation can be very time consuming as we are initiating a DMA transfer of data from the host memory, over the PCI bus, to the device memory.

Thread mapping

!Number of threads in each thread block
blockSize = dim3(1024,1,1)

!number of thread blocks in grid
gridSize = dim3(ceiling(real(n)/real(blockSize%x)) ,1,1)

To launch our kernel we must specify the number of threads per block and the number of blocks in our grid. The maximum value of both is dependent on the devices compute capability. Normally the block Size will be chosen based upon kernel memory requirements and the grid size calculated such that enough threads are launched to cover the kernel domain. dim3 is a CUDA Fortran provided data type that has 3 dimensions, in this case we are dealing with a one dimensional block and grid so we specify a dimensionality of 1 for the other two dimensions.

Launch kernel

call vecAdd_kernel<<< gridSize, blockSize >>>(n, d_a, d_b, d_c)

We launch our kernel with a modified Fortran subroutine syntax that lets us specify the grid and block sizes. Kernel calls are non blocking.

Copy to host

!Implicit copy of device array to host
h_c = d_c(1:n)

The implicit copy from the device to the host will block until the kernel is completed.

Release device memory

deallocate(d_a)
deallocate(d_b)
deallocate(d_c)

Device memory is deallocated in the same programmatic manner as host allocated arrays.

Compiling:

The pgi environment must be set up so before compiling:

$ module load pgi

$ ftn vecAdd.cuf -o vecAdd.out

Running:

$ aprun ./vecAdd.out
final result: 1.000000