最近做实验利用率不知道为什么那么低,只有8%左右,固然有因为我每个批次的数据占用空间不大的原因,但是后来跑着跑着还莫名其妙的保错(之前用大batch做过验证试验,能跑完,没毛病),回溯问题如下:
Traceback (most recent call last):
File "run.py", line 91, in <module>
siam.y_true: y_true})
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 887, in run
run_metadata_ptr)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1110, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1286, in _do_run
run_metadata)
File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1308, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: CUB reduce errorinvalid configuration argument
[[{{node siamese/fc1/summaries/Mean}} = Mean[T=DT_FLOAT, Tidx=DT_INT32, keep_dims=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](siamese/fc1/Variable/read, siamese/fc1/summaries/range)]]
Caused by op u'siamese/fc1/summaries/Mean', defined at:
File "run.py", line 33, in <module>
siam = network.siamese()
File "/home/maolongchun/mnist_lab/network.py", line 27, in __init__
self.output1 = self.deepnn(self.x1) # shape:(1000,10) or (1,10)
File "/home/maolongchun/mnist_lab/network.py", line 74, in deepnn
self.variable_summaries(w_fc1)
File "/home/maolongchun/mnist_lab/network.py", line 142, in variable_summaries
mean = tf.reduce_mean(var)
File "/usr/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 1492, in reduce_mean
name=name))
File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 4778, in mean
name=name)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
op_def=op_def)
File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
self._traceback = tf_stack.extract_stack()
InternalError (see above for traceback): CUB reduce errorinvalid configuration argument
[[{{node siamese/fc1/summaries/Mean}} = Mean[T=DT_FLOAT, Tidx=DT_INT32, keep_dims=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](siamese/fc1/Variable/read, siamese/fc1/summaries/range)]]
看起来是日志问题?于是就回去把所有关于权重和偏差值的日志记录函数都注释掉了(简单粗暴),但是在训练时候的loss和学习率还是留着的。然后再训练的时候发现GPU利用率提高了!! 仔细思考,应该是因为记录日志需要频繁的进行写入操作,导致运算变慢!!所以提高GPU利用率一定要小心日志还有训练模型保存的频次,太高会显著拉低GPU的利用率!
以后再发现新的再补充,也欢迎看到这篇的朋友一起讨论!