首页 文章

AWS ec2上的MXNet R版本CUDA故障(ubuntu14.04)

提问于
浏览
0

我正按照以下说明在Amazon Web Service EC2(ubuntu 14.04LTS)上安装MXNet R版本:http://mxnet.io/get_started/ubuntu_setup.html .

首先我从nvidia下载了CUDA 8toolkit .

sudo dpkg -i cuda-repo-ubuntu1404_8.0.61-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda

然后下载最新的cudnn文件(cudnn-8.0-linux-x64-v6.0.tgz)并通过scp将其传输到ec2实例 .

在ec2控制台(通过SSH访问)中,我输入了

tar xvzf cudnn-8.0-linux-x64-v5.1-ga.tgz
sudo cp -P cuda/include/cudnn.h /usr/local/cuda/include
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
sudo ldconfig

(你我最初在/ usr / local /上转移了cuda安装文件 . 所以将两行代码复制到我的本地目录 . )

然后我从git安装mxnet源文件,生成config.mk文件,并将config.mk修改为USE_CUDA = 1,依此类推(用于GPU使用) . 移动到set-utils目录并编译了ubuntu r版本的shell脚本 .

git clone https://github.com/dmlc/mxnet.git ~/mxnet --recursive

cd ~/mxnet
cp make/config.mk .
# If building with GPU, add configurations to config.mk file:
echo "USE_CUDA=1" >>config.mk
echo "USE_CUDA_PATH=/usr/local/cuda" >>config.mk
echo "USE_CUDNN=1" >>config.mk

cd ~/mxnet/setup-utils
bash install-mxnet-ubuntu-r.sh

当然我通过以下命令添加了环境变量:

export CUDA_HOME=/usr/local/cuda-8.0
export CUDA_ROOT=/usr/local/cuda-8.0/bin
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda
PATH=${CUDA_HOME}/bin:${PATH}

仅供参考,我检查过'nvidia-smi'命令正确安装了nvidia驱动程序 .

我推出R和拳打,

library(mxnet)

然后输出是

Rcpp Init>

我为mxnet运行了一些测试代码,它工作正常 .

所以我继续使用代码运行GPU(Lenet):

require(mxnet)
train <- read.csv('train.csv', header=TRUE)
test <- read.csv('test.csv', header=TRUE)
train <- data.matrix(train)
test <- data.matrix(test)

train.x <- train[,-1]
train.y <- train[,1]

train.x <- t(train.x/255)
test <- t(test/255)

# input
data <- mx.symbol.Variable('data')
# first conv
conv1 <- mx.symbol.Convolution(data=data, kernel=c(5,5), num_filter=20)
tanh1 <- mx.symbol.Activation(data=conv1, act_type="tanh")
pool1 <- mx.symbol.Pooling(data=tanh1, pool_type="max",
                      kernel=c(2,2), stride=c(2,2))
# second conv
conv2 <- mx.symbol.Convolution(data=pool1, kernel=c(5,5), num_filter=50)
tanh2 <- mx.symbol.Activation(data=conv2, act_type="tanh")
pool2 <- mx.symbol.Pooling(data=tanh2, pool_type="max",
                      kernel=c(2,2), stride=c(2,2))
# first fullc
flatten <- mx.symbol.Flatten(data=pool2)
fc1 <- mx.symbol.FullyConnected(data=flatten, num_hidden=500)
tanh3 <- mx.symbol.Activation(data=fc1, act_type="tanh")
# second fullc
fc2 <- mx.symbol.FullyConnected(data=tanh3, num_hidden=10)
# loss
lenet <- mx.symbol.SoftmaxOutput(data=fc2)

train.array <- train.x
dim(train.array) <- c(28, 28, 1, ncol(train.x))
test.array <- test
dim(test.array) <- c(28, 28, 1, ncol(test))
n.gpu <- 4
device.gpu <- lapply(0:(n.gpu-1), function(i) {
mx.gpu(i)
})
mx.set.seed(0)
tic <- proc.time()
model <- mx.model.FeedForward.create(lenet, X=train.array, y=train.y,
                                ctx=device.gpu, num.round=5, array.batch.size=100,
                                learning.rate=0.05, momentum=0.9, wd=0.00001,
                                eval.metric=mx.metric.accuracy,
                                  epoch.end.callback=mx.callback.log.train.metric(100))

这是mxnet页面的基本教程代码 .

但我收到以下错误消息:

Auto-select kvstore type = local_update_cpu
Start training with 4 devices
[07:05:37] /root/mxnet/dmlc-core/include/dmlc/logging.h:300: [07:05:37] src/storage/storage.cc:77: Compile with USE_CUDA=1 to enable GPU usage

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f296b8659cc]
[bt] (1) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(+0xed1be3) [0x7f296c51cbe3]
[bt] (2) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(+0xed43c3) [0x7f296c51f3c3]
[bt] (3) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEmNS_7ContextE+0x3f) [0x7f296c51f77f]
[bt] (4) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(MXNDArrayCreate+0x63d) [0x7f296c0e83bd]
[bt] (5) /usr/local/lib/R/site-library/mxnet/libs/mxnet.so(_ZN5mxnet1R7NDArray5EmptyERKN4Rcpp9DimensionERKNS2_6VectorILi19ENS2_15PreserveStorageEEE+0xdd) [0x7f295ac7ebbd]
[bt] (6) /usr/local/lib/R/site-library/mxnet/libs/mxnet.so(_ZN4Rcpp12CppFunction2INS_4XPtrIN5mxnet1R6NDBlobENS_15PreserveStorageEXadL_ZNS_25standard_delete_finalizerIS4_EEvPT_EELb0EEERKNS_9DimensionERKNS_6VectorILi19ES5_EEEclEPP7SEXPREC+0xd2) [0x7f295ac8b552]
[bt] (7) /usr/local/lib/R/site-library/Rcpp/libs/Rcpp.so(_Z23InternalFunction_invokeP7SEXPREC+0xd1) [0x7f2971c69cd1]
[bt] (8) /usr/lib/R/lib/libR.so(+0xce3c1) [0x7f29762a83c1]
[bt] (9) /usr/lib/R/lib/libR.so(Rf_eval+0x6fb) [0x7f29762ed5ab]

Error in mx.nd.internal.empty.array(shape, ctx) :
  [07:05:37] src/storage/storage.cc:77: Compile with USE_CUDA=1 to enable GPU usage

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f296b8659cc]
[bt] (1) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(+0xed1be3) [0x7f296c51cbe3]
[bt] (2) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(+0xed43c3) [0x7f296c51f3c3]
[bt] (3) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(_ZN5mxnet11StorageImpl5AllocEmNS_7ContextE+0x3f) [0x7f296c51f77f]
[bt] (4) /usr/local/lib/R/site-library/mxnet/libs/libmxnet.so(MXNDArrayCreate+0x63d) [0x7f296c0e83bd]
[bt] (5) /usr/local/lib/R/site-library/mxnet/libs/mxnet.so(_ZN5mxnet1R7NDArray5EmptyERKN4Rcpp9DimensionERKNS2_6VectorILi19ENS2_15PreserveStorageEEE+0xdd) [0x7f295ac7ebbd]
[bt] (6) /usr/local/lib/R/site-library/mxnet/libs/mxnet.so(_ZN4Rcpp12CppFunction2INS_4XPtrIN5mxnet1R6NDBlobENS_15PreserveStorageEXadL_ZNS_25standard_delete_finalizerIS4_EEvPT_EEL

我想确定一下:

  • 我修改了config.mk文件'before'我通过'bash install--mxnet-ubuntu-r.sh'命令进行了法律编译 .

  • 尽可能多地改变了环境变量 .

  • 重复上述步骤至少7次 .

  • 我的最终目标是通过批处理文件运行包含mxnet lenet的代码(R CMD BATCH~.R)

如果有人能设法解决我的问题,我将非常感激 .

1 回答

  • 1

    看来你编译时没有CUDA支持 . 对于R和Linux,需要以下构建命令:

    make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1
    

    请注意,openblas也存在依赖关系 .

    我已按照installation page(选择Linux | R | GPU)上的确切说明进行操作 . 当我在R环境中导入mxnet库时,我花了大约一个小时才从头开始:

    > library(mxnet)
    libdc1394 error: Failed to initialize libdc1394
    

    但根据this post它与相机有关,所以我通过以下方式禁用它:

    sudo ln /dev/null /dev/raw1394
    

    之后,我运行安装页面上提供的示例,它工作正常:

    > library(mxnet)
    > a <- mx.nd.ones(c(2,3), ctx = mx.gpu())
    > b <- a * 2 + 1
    > b
         [,1] [,2] [,3]
    [1,]    3    3    3
    [2,]    3    3    3
    >
    

相关问题