NOTE: these exercises have been tested on MI210 and MI300A accelerators using a container environment.
To see details on the container environment (such as operating system and modules available) please see README.md on this repo.
We show a simple example on how to use the main features of the ROCm debugger rocgdb.
Let us consider the saxpy kernel in the HIP examples:
cd HPCTrainingExamples/HIP/saxpy
Get an allocation of a GPU and load software modules:
salloc -N 1 --gpus=1
module load rocmYou can see some information on the GPU you will be running on by doing:
rocm-smiTo introduce an error in your program, comment out the hipMalloc calls at lines 81 and 82 in saxpy.hip. You can do this manually or use the following sed commands:
sed -i 's|hipCheck( hipMalloc(\&d_x, size) );|// hipCheck( hipMalloc(\&d_x, size) );|' saxpy.hip
sed -i 's|hipCheck( hipMalloc(\&d_y, size) );|// hipCheck( hipMalloc(\&d_y, size) );|' saxpy.hipSince the code uses the hipCheck error-checking macro on the hipMemcpy calls at lines 83 and 84, you also need to remove these wrappers. Otherwise, the invalid pointer error will be caught at the hipMemcpy stage before reaching the kernel launch:
sed -i 's|hipCheck( hipMemcpy(d_x, h_x, size, hipMemcpyHostToDevice) );|hipMemcpy(d_x, h_x, size, hipMemcpyHostToDevice);|' saxpy.hip
sed -i 's|hipCheck( hipMemcpy(d_y, h_y, size, hipMemcpyHostToDevice) );|hipMemcpy(d_y, h_y, size, hipMemcpyHostToDevice);|' saxpy.hipThen compile with:
mkdir build && cd build
cmake ..
make VERBOSE=1Running the program, you will see the expected runtime error:
./saxpy
Memory access fault by GPU node-2 (Agent handle: 0x2284d90) on address (nil). Reason: Unknown.
Aborted (core dumped)To run the code with the rocgdb debugger, do:
rocgdb saxpyNote that there are also two options for graphical user interfaces that can be turned on by doing:
rocgdb -tui saxpy
cgdb -d rocgdb saxpy For the latter command above, you need to have cgdb installed on your system.
In the debugger, type run (or just r) and you will get an error similar to this one:
Thread 5 "saxpy" received signal SIGSEGV, Segmentation fault.
[Switching to thread 5, lane 0 (AMDGPU Lane 1:1:1:1/0 (0,0,0)[0,0,0])]
0x00007ffff623198c in saxpy() at saxpy.hip:67
67 y[i] += a*x[i];Note that the cmake build type is set to RelWithDebInfo (see line 8 in CMakeLists.txt). With this build type, the debugger will be aware of the debug symbols. If that was not the case (for instance if compiling in Release mode), running the code with the debugger you would get an error message without line info, and also a warning like this one:
Reading symbols from saxpy...
(No debugging symbols found in saxpy)The error report is at a thread on the GPU. We can display information on the threads by typing info threads (or i th). It is also possible to move to a specific thread with thread <ID> (or t <ID>) and see the location of this thread with where. For instance, if we are interested in the thread with ID 1:
i th
th 1
whereYou can add breakpoints with break (or b) followed by the line number. For instance to put a breakpoint right after the hipMalloc lines do b 83.
When possible, it is also advised to compile without optimization flags (so using -O0) to avoid seeing breakpoints placed on lines different than those specified with the breakpoint command.
You can also add a breakpoint directly at the start of the GPU kernel with b saxpy. To run to the next breakpoint, type continue (or c).
To list all the breakpoints that have been inserted type info break (or i b):
(gdb) i b
Num Type Disp Enb Address What
1 breakpoint keep y 0x000000000020b334 in main() at /HPCTrainingExamples/HIP/saxpy/saxpy.hip:85
2 breakpoint keep y 0x000000000020b350 in main() at /HPCTrainingExamples/HIP/saxpy/saxpy.hip:88A breakpoint can be removed with delete <Num> (or d <Num>): note that <Num> is the breakpoint ID displayed above. For instance, to remove the breakpoint at line 85, you have to do d 1.
To proceed to the next line you can do next (or n). To step into a function, do step (or s) and to get out do finish. Note that if a breakpoint is at a kernel, doing n or s will switch between different threads. To avoid this behavior, it is necessary to disable the breakpoint at the kernel with disable <Num>.
It is possible to have information on the architecture (below shown on MI250):
(gdb) info agents
Id State Target Id Architecture Device Name Cores Threads Location
* 1 A AMDGPU Agent (GPUID 64146) gfx90a Aldebaran/MI200 [Instinct MI250X/MI250] 416 3328 29:00.0We can also get information on the thread grid:
(gdb) info dispatches
Id Target Id Grid Workgroup Fence Kernel Function
* 1 AMDGPU Dispatch 1:1:1 (PKID 0) [256,1,1] [128,1,1] B|Aa|Ra saxpy(int, float const*, int, float*, int)For the rocgdb documentation, please see: /opt/rocm-<version>/share/doc/rocgdb.