我目前正在培训一些需要大约12Gb GPU内存的自定义模型 . 我的设置有大约96Gb的GPU内存和python / Jupyter仍然设法占用所有的GPU内存到我得到资源耗尽错误的程度 . 我暂时停留在这个特殊的问题,因此任何帮助将不胜感激 .

现在,在加载类似于此的基于vgg的模型时:

from keras.applications.vgg16 import VGG16
from keras.models import Model

import keras

from keras.models import Model, Sequential
from keras.models import Input
input_shape = (512, 512, 3)
base_model = VGG16(input_shape=input_shape, weights=None, include_top=False)

pixel_branch = base_model.output
pixel_branch = Flatten()(pixel_branch)

new_model = Model(inputs=base_model.input, outputs=pixel_branch)

text_branch = Sequential()
text_branch.add(Dense(32, input_shape=(1,), activation='relu'))

# merged = Merge([new_model, text_branch], mode='concat')
merged = keras.layers.concatenate([new_model.output, text_branch.output])

age = Dense(1000, activation='relu')(merged)
age = Dense(1000, activation='relu')(age)
age = Dense(1)(age)

# show model
# model.summary()
model = Model(inputs=[base_model.input, text_branch.input], outputs=age)

当我使用此代码运行jupyter单元并使用nvidia-smi监控GPU使用情况时,它为0% . 但是,我用以下内容替换上面Jupyter单元格中的代码:

from keras.applications.inception_v3 import InceptionV3
from keras.models import Model
import keras
from keras.models import Model
from keras.models import Sequential

from keras.models import Input
input_shape = (512, 512, 3)
base_model = InceptionV3(input_shape=input_shape, weights=None, include_top=False)

pixel_branch = base_model.output
pixel_branch = Flatten()(pixel_branch)

new_model = Model(inputs=base_model.input, outputs=pixel_branch)

text_branch = Sequential()
text_branch.add(Dense(32, input_shape=(1,), activation='relu'))

# merged = Merge([new_model, text_branch], mode='concat')
merged = keras.layers.concatenate([new_model.output, text_branch.output])

age = Dense(1000, activation='relu')(merged)
age = Dense(1000, activation='relu')(age)
age = Dense(1)(age)

# show model
# model.summary()
model = Model(inputs=[base_model.input, text_branch.input], outputs=age)

GPU使用率变得疯狂,突然几乎所有GPU中的所有内存都已经结束,甚至在我在Keras中运行model.compile()或model.fit()之前!

我在Tensorflow中也尝试了allow_growth和per_process_gpu_memory_fraction . 当我使用基于Inception的模型运行model.fit时,我仍然得到资源耗尽错误 . 请注意,我不认为这是一个GPU内存错误,因为我使用8个特斯拉K80s的实例有大约96GB的GPU内存 .

另请注意,我的批量大小为2 .