使用TensorFlow-GPU Python多处理时的错误？-Java 学习之路

当我使用TensorFlow-GPU Python多处理时，我注意到了一种奇怪的行为 .

我已经实现了DCGAN一些自定义和我自己的数据集 . 由于我将DCGAN调节到某些功能，我有训练数据和测试数据以供评估 .

由于我的数据集的大小，我编写了数据加载器，它们使用Python的multiprocessing并发运行并预加载到队列中 .

代码的结构大致如下所示：

class ConcurrentLoader:
    def __init__(self, dataset):
        ...

class DCGAN
     ...

net = DCGAN()
training_data = ConcurrentLoader(path_to_training_data)
test_data = ConcurrentLoader(path_to_test_data)

使用CUDA 8.0，此代码在TensorFlow-CPU和TensorFlow-GPU <= 1.3.0上运行良好，但是当我使用 TensorFlow-GPU 1.4.1 and CUDA 9 （截至2017年12月的TF和CUDA的最新版本）运行完全相同的代码时，它崩溃：

2017-12-20 01:15:39.524761: E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2017-12-20 01:15:39.527795: E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2017-12-20 01:15:39.529548: E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2017-12-20 01:15:39.535341: E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2017-12-20 01:15:39.535383: E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM
2017-12-20 01:15:39.535397: F tensorflow/core/kernels/conv_ops.cc:667] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms) 
[1]    32299 abort (core dumped)  python dcgan.py --mode train --save_path ~/tf_run_dir/test --epochs 1

让我感到困惑的是，如果我只删除 test_data ，则不会发生错误 . 因此，出于某种奇怪的原因，TensorFlow-GPU 1.4.1和CUDA 9只能使用一个 ConcurrentLoader ，但在初始化多个加载器时会崩溃 .

更有趣的是（在异常之后）我必须手动关闭python进程，因为GPU的VRAM，系统的RAM甚至python进程在脚本崩溃后仍然存活 .

此外，它必须与Python的 multiprocessing 模块有一些奇怪的连接，因为当我在Keras中实现相同的模型（使用TF后端！）时，代码也运行得很好，有2个并发加载器 . 我猜Keras在某种程度上创造了一个抽象层，使TF不会崩溃 .

我可能在哪里搞砸了 multiprocessing 模块，它会导致像这样的崩溃？

这些是在 ConcurrentLoader 中使用 multiprocessing 的代码部分：

def __init__(self, dataset):
    ...
    self._q = mp.Queue(64)
    self._file_cycler = cycle(img_files)
    self._worker = mp.Process(target=self._worker_func, daemon=True)
    self._worker.start()

def _worker_func(self):
    while True:
        ... # gets next filepaths from self._file_cycler
        buffer = list()
        for im_path in paths:
            ... # uses OpenCV to load each image & puts it into the buffer
        self._q.put(np.array(buffer).astype(np.float32))

......就是这样 .

我在哪里写过"unstable"或"non-pythonic" multiprocessing 代码？我认为 daemon=True 应该确保每个进程在主进程终止后立即被杀死？不幸的是，这种特定错误并非如此 .

我在这里误用了默认的 multiprocessing.Process 或 multiprocessing.Queue 吗？我想只是编写一个类，我将批量图像存储在一个Queue中，并通过方法/实例变量访问它应该没问题 .

1 回答

1
尝试使用tensorflow和多处理时，我遇到了同样的错误
```
E tensorflow/stream_executor/cuda/cuda_blas.cc:366] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
```
但在不同的环境中tf1.4 cuda 8.0 cudnn 6.0 . 示例代码中的matrixMulCUBLAS工作正常 . 我也想知道正确的解决方案！参考failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED on a AWS p2.xlarge instance对我不起作用 .
回复于 2024-05-07T02:32:34+08:00

使用TensorFlow-GPU Python多处理时的错误？

1 回答

相关问题