Skip to content

Latest commit

 

History

History
134 lines (93 loc) · 5.27 KB

File metadata and controls

134 lines (93 loc) · 5.27 KB

ROCgdb

NOTE: these exercises have been tested on MI210 and MI300A accelerators using a container environment. To see details on the container environment (such as operating system and modules available) please see README.md on this repo.

We show a simple example on how to use the main features of the ROCm debugger rocgdb.

Saxpy Debugging

Let us consider the saxpy kernel in the HIP examples:

cd HPCTrainingExamples/HIP/saxpy

Get an allocation of a GPU and load software modules:

salloc -N 1 --gpus=1
module load rocm

You can see some information on the GPU you will be running on by doing:

rocm-smi

To introduce an error in your program, comment out the hipMalloc calls at lines 81 and 82 in saxpy.hip. You can do this manually or use the following sed commands:

sed -i 's|hipCheck( hipMalloc(\&d_x, size) );|// hipCheck( hipMalloc(\&d_x, size) );|' saxpy.hip
sed -i 's|hipCheck( hipMalloc(\&d_y, size) );|// hipCheck( hipMalloc(\&d_y, size) );|' saxpy.hip

Since the code uses the hipCheck error-checking macro on the hipMemcpy calls at lines 83 and 84, you also need to remove these wrappers. Otherwise, the invalid pointer error will be caught at the hipMemcpy stage before reaching the kernel launch:

sed -i 's|hipCheck( hipMemcpy(d_x, h_x, size, hipMemcpyHostToDevice) );|hipMemcpy(d_x, h_x, size, hipMemcpyHostToDevice);|' saxpy.hip
sed -i 's|hipCheck( hipMemcpy(d_y, h_y, size, hipMemcpyHostToDevice) );|hipMemcpy(d_y, h_y, size, hipMemcpyHostToDevice);|' saxpy.hip

Then compile with:

mkdir build && cd build
cmake ..
make VERBOSE=1

Running the program, you will see the expected runtime error:

./saxpy
Memory access fault by GPU node-2 (Agent handle: 0x2284d90) on address (nil). Reason: Unknown.
Aborted (core dumped)

To run the code with the rocgdb debugger, do:

rocgdb saxpy

Note that there are also two options for graphical user interfaces that can be turned on by doing:

rocgdb -tui saxpy
cgdb -d rocgdb saxpy 

For the latter command above, you need to have cgdb installed on your system.

In the debugger, type run (or just r) and you will get an error similar to this one:

Thread 5 "saxpy" received signal SIGSEGV, Segmentation fault.
[Switching to thread 5, lane 0 (AMDGPU Lane 1:1:1:1/0 (0,0,0)[0,0,0])]
0x00007ffff623198c in saxpy() at saxpy.hip:67
67        y[i] += a*x[i];

Note that the cmake build type is set to RelWithDebInfo (see line 8 in CMakeLists.txt). With this build type, the debugger will be aware of the debug symbols. If that was not the case (for instance if compiling in Release mode), running the code with the debugger you would get an error message without line info, and also a warning like this one:

Reading symbols from saxpy...
(No debugging symbols found in saxpy)

The error report is at a thread on the GPU. We can display information on the threads by typing info threads (or i th). It is also possible to move to a specific thread with thread <ID> (or t <ID>) and see the location of this thread with where. For instance, if we are interested in the thread with ID 1:

i th
th 1
where

You can add breakpoints with break (or b) followed by the line number. For instance to put a breakpoint right after the hipMalloc lines do b 83.

When possible, it is also advised to compile without optimization flags (so using -O0) to avoid seeing breakpoints placed on lines different than those specified with the breakpoint command.

You can also add a breakpoint directly at the start of the GPU kernel with b saxpy. To run to the next breakpoint, type continue (or c).

To list all the breakpoints that have been inserted type info break (or i b):

(gdb) i b
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000000000020b334 in main() at /HPCTrainingExamples/HIP/saxpy/saxpy.hip:85
2       breakpoint     keep y   0x000000000020b350 in main() at /HPCTrainingExamples/HIP/saxpy/saxpy.hip:88

A breakpoint can be removed with delete <Num> (or d <Num>): note that <Num> is the breakpoint ID displayed above. For instance, to remove the breakpoint at line 85, you have to do d 1.

To proceed to the next line you can do next (or n). To step into a function, do step (or s) and to get out do finish. Note that if a breakpoint is at a kernel, doing n or s will switch between different threads. To avoid this behavior, it is necessary to disable the breakpoint at the kernel with disable <Num>.

It is possible to have information on the architecture (below shown on MI250):

(gdb) info agents
  Id State Target Id                  Architecture Device Name                             Cores Threads Location
* 1  A     AMDGPU Agent (GPUID 64146) gfx90a       Aldebaran/MI200 [Instinct MI250X/MI250] 416   3328    29:00.0

We can also get information on the thread grid:

(gdb) info dispatches
  Id   Target Id                      Grid      Workgroup Fence   Kernel Function
* 1    AMDGPU Dispatch 1:1:1 (PKID 0) [256,1,1] [128,1,1] B|Aa|Ra saxpy(int, float const*, int, float*, int)

For the rocgdb documentation, please see: /opt/rocm-<version>/share/doc/rocgdb.