Parallel Programming Seminar

August 5-7 2009

Interdisciplinary Mathematics Institute
University of South Carolina

Introduction

This web site provides a summary of the Parallel Programming Seminar, August 5-7 2009 at the Interdisciplinary Mathematics Institute, University of South Carolina. The three day lectures focused on GPU programming, primarily using CUDA. This was a hands-on event, where participants had the opportunity to use CUDA on Apple iMac computers.

Contact: Emil Dotchevski (emil@revergestudios.com).

Special thanks to Matt Hielsberg, Peter Binev and Robert Sharpley for organizing and hosting this seminar.

Installation instructions

This web site contains links for downloading the source code from the hands-on sessions. The source code is portable, but since it was presented on Apple iMacs, only Makefiles for building under Mac OS X are included.

To build and run the source code, you will need to download and install a CUDA-enabled driver from NVIDIA. You also need to ensure that the LD_LIBRARY_PATH environment variable refers to /usr/local/cuda/lib -- open a Terminal window and type:

export LD_LIBRARY_PATH=/usr/local/cuda/lib

Day 1

The first day includes a two-part presentation introducing the basic principles in GPU programming, followed by a simple "Hello World!"-style CUDA program.

Part 1 (PDF) focuses on the design and evolution of the GPU, to help understand the core principles in its operation:

  • Span-based vs. tile-based rasterization
  • Comparison between CPU and GPU design
  • Introduction to SIMD and lock-step execution
  • SIMD-style conditional execution
  • Performance and memory latency
  • Texturing

Part 2 (PDF) presents CUDA as a specific interface for programming NVIDIA GPUs:

  • NVIDIA device hardware model
  • Thread hierarchy
  • Memory architecture
  • Coalesced and non-coalesced memory access
  • Global memory reads using texture cache
  • Local memory access
  • Register/shared memory access
  • Avoiding bank conflicts when accessing shared memory
  • Latency/occupancy
  • Communication between the CPU and the GPU

Part 3 introduces a simple CUDA program that fills a memory buffer with the value 42.

Day 2

The second day is 100% hands-on session designed to introduce various programming techniques in CUDA. For this purpose, we are focusing on a single trivial problem: transposing a matrix. Further, we have a requirement that the transposed matrix is square and of size power-of-two. While unrealistic, this limitation makes the problem exceptionally trivial which highlights the differences between the five implementations we will discuss.

To streamline the presentation, we have created a few simple functions common to all five implementations. They include a couple of classes that use RAII to allocate and free CUDA memory buffers, and functions to parse a simple command line, fill in matrices with random elements, transpose a matrix on the CPU, and a function to verify that a given CUDA transpose implementation is correct. This is arranged in a single header file common.h:

template <class T>
class
cuda_buffer //non-copyable
    {
    private:
    ....

    public:

    cuda_buffer();
    explicit cuda_buffer( int size );
    ~cuda_buffer();
    int size() const;
    T * ptr() const;
    void swap( cuda_buffer & other );
    };

template <class T>
class
cuda_buffer_2d //non-copyable
    {
    private:
    ....

    public:

    cuda_buffer_2d();
    explicit cuda_buffer_2d( int width, int height );
    ~cuda_buffer_2d();
    int width() const;
    int height() const;
    T * ptr() const;
    int pitch() const;
    void swap( cuda_buffer_2d & other );
    };

int cmd_num_runs( int argc, char const * argv[] );
int cmd_matrix_dim( int argc, char const * argv[] );
void make_random_matrix( std::vector<float> &, int dim );
void make_cuda_matrix( cuda_buffer<float> &, std::vector<float> const & m, int dim );
void transpose( std::vector<float> & result, std::vector<float> const & matrix, int dim );
void cout_matrix( std::vector<float> const & matrix, int dim );
int check_error( std::vector<float> const & m1, cuda_buffer<float> const & m2, int dim );

Utilizing this interface, we have five different solutions for transposing a matrix in CUDA:

  1. matrix_transpose_naive
  2. matrix_transpose_shared
  3. matrix_transpose_texture_2d
  4. matrix_transpose_texture_1d
  5. matrix_transpose_swizzle_texture_1d

Click here for the complete source code from Day 2.

Day 3

The last day focuses on image processing and introduces a CUDA program for applying a 3x3 convolution filter to an arbitrary image file. It demonstrates the use of textures, filtering, texture addressing modes, and coalesced memory writes.