CUDA actually has vector loads/stores, this reduces the number of instructions (3 to 1), which helps out with the memory latency issues we've been seeing in our functions (based on profiling).
Since NumPy defaults to row-major ordering, it makes our data handling a lot easier. I no longer have to convert everything to column-major format. This change also simplifies slicing grid points, which is a nice bonus for improving efficiency in our calculations.
CUDA actually has vector loads/stores, this reduces the number of instructions (3 to 1), which helps out with the memory latency issues we've been seeing in our functions (based on profiling).
Since NumPy defaults to row-major ordering, it makes our data handling a lot easier. I no longer have to convert everything to column-major format. This change also simplifies slicing grid points, which is a nice bonus for improving efficiency in our calculations.