在一个应用程序中,我需要将3D点 Cloud 结合到体素网或3D直方图中 . 最初使用scipy.binned_statistic_dd但是它太慢而无法跟上我们正在构建的实时应用程序 . 为了加快这个过程,我们选择尝试使用cuda进行numba来对gpu进行分级 . 以下程序是我第一次尝试这样做,它是scipy提供的一个很好的加速:

import numba
from numba import cuda
from numba import *
import time
@numba.jit(nopython=True) 
def compute_bin(x, n, xmin, xmax):
    # computes the bin that the given location should be in
    # special case to mirror NumPy behavior for last bin
    if x == xmax:
        return n - 1 # a_max always in last bin

    bin = np.int32(n * (x - xmin) / (xmax - xmin))

    if bin < 0 or bin >= n:
        return None
    else:
        return bin

@cuda.jit()
def histogram(dim1, dim2, dim3, dim1min, dim1max, dim2min, dim2max, dim3min, dim3max, val, histogram_out, histogram_sum_out, histogram_mean_out, nbins):
    # Adds the given pointcloud to the three histograms that are on the gpu
    start = cuda.grid(1)
    stride = cuda.gridsize(1)

    for i in range(start, dim1.shape[0], stride):
        # note that calling a numba.jit function from CUDA automatically
        # compiles an equivalent CUDA device function!
        bin_number_1 = compute_bin(dim1[i], nbins, dim1min, dim1max)
        bin_number_2 = compute_bin(dim2[i], nbins, dim2min, dim2max)
        bin_number_3 = compute_bin(dim3[i], nbins, dim3min, dim3max)

        if (bin_number_1 >= 0 and bin_number_1 < histogram_out.shape[0])\
            and (bin_number_2 >= 0 and bin_number_2 < histogram_out.shape[0])\
            and (bin_number_3 >= 0 and bin_number_3 < histogram_out.shape[0]):
            index = bin_number_3 + bin_number_2 * nbins + bin_number_1 * nbins**2
            cuda.atomic.add(histogram_out, index, 1)
            cuda.atomic.add(histogram_sum_out, index, val[i])
            meanval = histogram_sum_out[index] / histogram_out[index]
            cuda.atomic.min(histogram_mean_out, index, 0)
            cuda.atomic.max(histogram_mean_out, index, meanval)

@cuda.jit
def set_histogram_to_zeros(histogram_out, histogram_sum_out, histogram_mean_out):
    # set all histograms on the gpu back to 0.
    start = cuda.grid(1)
    stride = cuda.gridsize(1)
    for i in range(start, histogram_out.shape[0], stride):
        cuda.atomic.min(histogram_out, i, 0)
        cuda.atomic.min(histogram_sum_out, i, 0)
        cuda.atomic.min(histogram_mean_out, i, 0)

此类用于跟踪分箱数据,其中分箱是三维分档的数量,三个不同的直方图跟踪输入 Cloud 的总和值,计数和平均值 .

class histoholder:
    # class that takes care of the interface to the cuda functions, it creates three histograms of the appropriate size. 
    def __init__(self, bins):
        self.bins = bins
        self.histogram_out = cuda.to_device(np.zeros(shape=(bins * bins * bins,), dtype=np.int32))
        self.histogram_out_sum = cuda.to_device(np.zeros(shape=(bins * bins * bins,), dtype=np.float32))
        self.histogram_out_mean = cuda.to_device(np.zeros(shape=(bins * bins * bins,), dtype=np.float32))

    def numba_gpu_histogram(self, dim1, dim2, dim3, w):

        # Move data to GPU so we can do two operations on it

        dim1_gpu = cuda.to_device(dim1)
        dim2_gpu = cuda.to_device(dim2)
        dim3_gpu = cuda.to_device(dim3)
        w_gpu = cuda.to_device(w)

        # add the pointcloud to the existing histograms
        histogram[128, 512](dim1_gpu, dim2_gpu, dim3_gpu, 0., 50., 0., 50., 0., 50., w_gpu, self.histogram_out,
                         self.histogram_out_sum, self.histogram_out_mean, self.bins)

    def get_histograms(self):
        # copies the histograms on the gpu back to the cpu so they can be used. 
        cuda.synchronize()
        cpu_histogram = self.histogram_out.copy_to_host()
        cpu_histogram_sum = self.histogram_out_sum.copy_to_host()
        cpu_histogram_mean = self.histogram_out_mean.copy_to_host()
        return np.reshape(cpu_histogram, (self.bins, self.bins, self.bins)), \
           np.reshape(cpu_histogram_sum, (self.bins, self.bins, self.bins)), \
           np.reshape(cpu_histogram_mean, (self.bins, self.bins, self.bins))

    def reset_histogram(self):
        # set all bins in the histograms back to 0. 
        set_histogram_to_zeros(self.histogram_out, self.histogram_out_sum, self.histogram_out_mean)
        cuda.synchronize()

在我们的应用程序中使用它时,当在gpu上运行内核时,binning似乎在随机时刻停止任意时间 .

然后,在测试之后,无限期地调用函数来尝试获取停止应用程序的行为(在应用程序之外) . 这个例子将无限期地运行,这似乎告诉我代码不一定有问题 .

def testcycles():

    bins = 100
    hh = histoholder(bins=bins)
    reset_counter = 0
    max_time = 0
    while(True):
        xvals = np.random.rand(70000) * 100.
        yvals = np.random.rand(70000) * 100.
        zvals = np.random.rand(70000) * 100.
        wvals = np.random.rand(70000) * 256.
        reset_counter += 1
        tmptime = time.time()
        hh.numba_gpu_histogram(xvals, yvals, zvals, wvals)
        _, _, _ = hh.get_histograms()

        if reset_counter > 5000:
            reset_counter = 0
            hh.reset_histogram()
        time_taken = time.time() - tmptime
        if time_taken > max_time:
            max_time = time_taken
        print '%.4f, %.5f' % (time_taken, max_time)

在这个测试运行中,这运行完全正常 . 在我正在使用它的应用程序中,但是gpu内核似乎停顿了不确定的时间(从2秒到43秒不等,我不得不杀死应用程序) .

我已经尝试过彻底重新安装所有东西 .

它将运行半小时或10分钟,然后突然挂起(分档频率约为25 Hz,大约70,000个数据点) . 在挂起期间,终端中的“nvidia-smi”呼叫向我显示“Volatile GPU Util”始终为100%,并且运行该过程的cpu也达到100% . GPU的温度大约是50度,这使我怀疑它是一个温度问题(NVIDIA声称高达85度完全没问题) . 卡上使用的内存也不超过100 MB,而内存为8 GB .

我尝试了不同版本的CUDA和不同版本的nvidia驱动程序,但都没有成功 .

但是没有引发错误或异常 . 我用来做计算的GPU是个人电脑上的第二个 . 在ubuntu上,如果gpu未连接到屏幕,则我知道GPU上的内核调用没有给定超时 .

所以我的问题是如何调试此问题 .

GPU:NVIDIA QUADRO M4000,(另一款是GeForce GTX 960)

操作系统:ubuntu 16.04

numba:0.40.1

cuda:9.2.148-1

NVIDIA驱动程序:410.78

关于改进这个numba代码的任何提示也欢迎在场外!