在一个应用程序中,我需要将3D点 Cloud 结合到体素网或3D直方图中 . 最初使用scipy.binned_statistic_dd但是它太慢而无法跟上我们正在构建的实时应用程序 . 为了加快这个过程,我们选择尝试使用cuda进行numba来对gpu进行分级 . 以下程序是我第一次尝试这样做,它是scipy提供的一个很好的加速:
import numba
from numba import cuda
from numba import *
import time
@numba.jit(nopython=True)
def compute_bin(x, n, xmin, xmax):
# computes the bin that the given location should be in
# special case to mirror NumPy behavior for last bin
if x == xmax:
return n - 1 # a_max always in last bin
bin = np.int32(n * (x - xmin) / (xmax - xmin))
if bin < 0 or bin >= n:
return None
else:
return bin
@cuda.jit()
def histogram(dim1, dim2, dim3, dim1min, dim1max, dim2min, dim2max, dim3min, dim3max, val, histogram_out, histogram_sum_out, histogram_mean_out, nbins):
# Adds the given pointcloud to the three histograms that are on the gpu
start = cuda.grid(1)
stride = cuda.gridsize(1)
for i in range(start, dim1.shape[0], stride):
# note that calling a numba.jit function from CUDA automatically
# compiles an equivalent CUDA device function!
bin_number_1 = compute_bin(dim1[i], nbins, dim1min, dim1max)
bin_number_2 = compute_bin(dim2[i], nbins, dim2min, dim2max)
bin_number_3 = compute_bin(dim3[i], nbins, dim3min, dim3max)
if (bin_number_1 >= 0 and bin_number_1 < histogram_out.shape[0])\
and (bin_number_2 >= 0 and bin_number_2 < histogram_out.shape[0])\
and (bin_number_3 >= 0 and bin_number_3 < histogram_out.shape[0]):
index = bin_number_3 + bin_number_2 * nbins + bin_number_1 * nbins**2
cuda.atomic.add(histogram_out, index, 1)
cuda.atomic.add(histogram_sum_out, index, val[i])
meanval = histogram_sum_out[index] / histogram_out[index]
cuda.atomic.min(histogram_mean_out, index, 0)
cuda.atomic.max(histogram_mean_out, index, meanval)
@cuda.jit
def set_histogram_to_zeros(histogram_out, histogram_sum_out, histogram_mean_out):
# set all histograms on the gpu back to 0.
start = cuda.grid(1)
stride = cuda.gridsize(1)
for i in range(start, histogram_out.shape[0], stride):
cuda.atomic.min(histogram_out, i, 0)
cuda.atomic.min(histogram_sum_out, i, 0)
cuda.atomic.min(histogram_mean_out, i, 0)
此类用于跟踪分箱数据,其中分箱是三维分档的数量,三个不同的直方图跟踪输入 Cloud 的总和值,计数和平均值 .
class histoholder:
# class that takes care of the interface to the cuda functions, it creates three histograms of the appropriate size.
def __init__(self, bins):
self.bins = bins
self.histogram_out = cuda.to_device(np.zeros(shape=(bins * bins * bins,), dtype=np.int32))
self.histogram_out_sum = cuda.to_device(np.zeros(shape=(bins * bins * bins,), dtype=np.float32))
self.histogram_out_mean = cuda.to_device(np.zeros(shape=(bins * bins * bins,), dtype=np.float32))
def numba_gpu_histogram(self, dim1, dim2, dim3, w):
# Move data to GPU so we can do two operations on it
dim1_gpu = cuda.to_device(dim1)
dim2_gpu = cuda.to_device(dim2)
dim3_gpu = cuda.to_device(dim3)
w_gpu = cuda.to_device(w)
# add the pointcloud to the existing histograms
histogram[128, 512](dim1_gpu, dim2_gpu, dim3_gpu, 0., 50., 0., 50., 0., 50., w_gpu, self.histogram_out,
self.histogram_out_sum, self.histogram_out_mean, self.bins)
def get_histograms(self):
# copies the histograms on the gpu back to the cpu so they can be used.
cuda.synchronize()
cpu_histogram = self.histogram_out.copy_to_host()
cpu_histogram_sum = self.histogram_out_sum.copy_to_host()
cpu_histogram_mean = self.histogram_out_mean.copy_to_host()
return np.reshape(cpu_histogram, (self.bins, self.bins, self.bins)), \
np.reshape(cpu_histogram_sum, (self.bins, self.bins, self.bins)), \
np.reshape(cpu_histogram_mean, (self.bins, self.bins, self.bins))
def reset_histogram(self):
# set all bins in the histograms back to 0.
set_histogram_to_zeros(self.histogram_out, self.histogram_out_sum, self.histogram_out_mean)
cuda.synchronize()
在我们的应用程序中使用它时,当在gpu上运行内核时,binning似乎在随机时刻停止任意时间 .
然后,在测试之后,无限期地调用函数来尝试获取停止应用程序的行为(在应用程序之外) . 这个例子将无限期地运行,这似乎告诉我代码不一定有问题 .
def testcycles():
bins = 100
hh = histoholder(bins=bins)
reset_counter = 0
max_time = 0
while(True):
xvals = np.random.rand(70000) * 100.
yvals = np.random.rand(70000) * 100.
zvals = np.random.rand(70000) * 100.
wvals = np.random.rand(70000) * 256.
reset_counter += 1
tmptime = time.time()
hh.numba_gpu_histogram(xvals, yvals, zvals, wvals)
_, _, _ = hh.get_histograms()
if reset_counter > 5000:
reset_counter = 0
hh.reset_histogram()
time_taken = time.time() - tmptime
if time_taken > max_time:
max_time = time_taken
print '%.4f, %.5f' % (time_taken, max_time)
在这个测试运行中,这运行完全正常 . 在我正在使用它的应用程序中,但是gpu内核似乎停顿了不确定的时间(从2秒到43秒不等,我不得不杀死应用程序) .
我已经尝试过彻底重新安装所有东西 .
它将运行半小时或10分钟,然后突然挂起(分档频率约为25 Hz,大约70,000个数据点) . 在挂起期间,终端中的“nvidia-smi”呼叫向我显示“Volatile GPU Util”始终为100%,并且运行该过程的cpu也达到100% . GPU的温度大约是50度,这使我怀疑它是一个温度问题(NVIDIA声称高达85度完全没问题) . 卡上使用的内存也不超过100 MB,而内存为8 GB .
我尝试了不同版本的CUDA和不同版本的nvidia驱动程序,但都没有成功 .
但是没有引发错误或异常 . 我用来做计算的GPU是个人电脑上的第二个 . 在ubuntu上,如果gpu未连接到屏幕,则我知道GPU上的内核调用没有给定超时 .
所以我的问题是如何调试此问题 .
GPU:NVIDIA QUADRO M4000,(另一款是GeForce GTX 960)
操作系统:ubuntu 16.04
numba:0.40.1
cuda:9.2.148-1
NVIDIA驱动程序:410.78
关于改进这个numba代码的任何提示也欢迎在场外!