当我运行sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
时,我得到InternalError: Blas SGEMM launch failed
。这是完整的错误和堆栈跟踪:
InternalErrorTraceback (most recent call last)
<ipython-input-9-a3261a02bdce> in <module>()
1 batch_xs, batch_ys = mnist.train.next_batch(100)
----> 2 sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict, options, run_metadata)
338 try:
339 result = self._run(None, fetches, feed_dict, options_ptr,
--> 340 run_metadata_ptr)
341 if run_metadata:
342 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _run(self, handle, fetches, feed_dict, options, run_metadata)
562 try:
563 results = self._do_run(handle, target_list, unique_fetches,
--> 564 feed_dict_string, options, run_metadata)
565 finally:
566 # The movers are no longer used. Delete them.
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
635 if handle is None:
636 return self._do_call(_run_fn, self._session, feed_dict, fetch_list,
--> 637 target_list, options, run_metadata)
638 else:
639 return self._do_call(_prun_fn, self._session, handle, feed_dict,
/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.pyc in _do_call(self, fn, *args)
657 # pylint: disable=protected-access
658 raise errors._make_specific_exception(node_def, op, error_message,
--> 659 e.code)
660 # pylint: enable=protected-access
661
InternalError: Blas SGEMM launch failed : a.shape=(100, 784), b.shape=(784, 10), m=100, n=10, k=784
[[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](_recv_Placeholder_0/_4, Variable/read)]]
Caused by op u'MatMul', defined at:
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py", line 3, in <module>
app.launch_new_instance()
File "/usr/local/lib/python2.7/dist-packages/traitlets/config/application.py", line 596, in launch_instance
app.start()
File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelapp.py", line 442, in start
ioloop.IOLoop.instance().start()
File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/ioloop.py", line 162, in start
super(ZMQIOLoop, self).start()
File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 883, in start
handler_func(fd_obj, events)
File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
self._handle_recv()
File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
self._run_callback(callback, msg)
File "/usr/local/lib/python2.7/dist-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
callback(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 275, in null_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 276, in dispatcher
return self.dispatch_shell(stream, msg)
File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 228, in dispatch_shell
handler(stream, idents, msg)
File "/usr/local/lib/python2.7/dist-packages/ipykernel/kernelbase.py", line 391, in execute_request
user_expressions, allow_stdin)
File "/usr/local/lib/python2.7/dist-packages/ipykernel/ipkernel.py", line 199, in do_execute
shell.run_cell(code, store_history=store_history, silent=silent)
File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2723, in run_cell
interactivity=interactivity, compiler=compiler, result=result)
File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2825, in run_ast_nodes
if self.run_code(code, result):
File "/usr/local/lib/python2.7/dist-packages/IPython/core/interactiveshell.py", line 2885, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-4-d7414c4b6213>", line 4, in <module>
y = tf.nn.softmax(tf.matmul(x, W) + b)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1036, in matmul
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 911, in _mat_mul
transpose_b=transpose_b, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 655, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2154, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1154, in __init__
self._traceback = _extract_stack()
堆栈:EC2 g2.8xlarge机器,Ubuntu 14.04
答案 0 :(得分:93)
if 'session' in locals() and session is not None:
print('Close interactive session')
session.close()
答案 1 :(得分:6)
我遇到了这个问题并通过设置2017-03-30 17:59:44,853 [main] DEBUG CH step (2), time spent (246), score (0.0hard/-34.78590672782874soft), selected move count (1), picked move (CollectionInfo-72 {null -> ManagerInfo-0}).
Soft Score: 4.969724770642201
...
Soft Score: 4.969724770642201
Soft Score: 4.966641437308868
Soft Score: 4.969724770642201
...
Soft Score: 4.969724770642201
Soft Score: 4.967558103975534
Soft Score: 4.969724770642201
...
Soft Score: 4.969724770642201
Soft Score: 4.967558103975534
Exception in thread "main" java.lang.IllegalStateException: Score corruption: the workingScore (0.0hard/-34.78590672782873soft) is not the uncorruptedScore (0.0hard/-34.785906727828745soft) after completedAction (Undo(CollectionInfo-39 {null -> ManagerInfo-0})):
The corrupted scoreDirector has no ConstraintMatch(s) which are in excess.
The corrupted scoreDirector has no ConstraintMatch(s) which are missing.
The corrupted scoreDirector has no ConstraintMatch(s) in excess or missing. That could be a bug in this class (class org.optaplanner.core.impl.score.director.drools.DroolsScoreDirector).
Check your score constraints.
at org.optaplanner.core.impl.score.director.AbstractScoreDirector.assertWorkingScoreFromScratch(AbstractScoreDirector.java:378)
at org.optaplanner.core.impl.phase.scope.AbstractPhaseScope.assertExpectedUndoMoveScore(AbstractPhaseScope.java:142)
at org.optaplanner.core.impl.constructionheuristic.decider.ConstructionHeuristicDecider.doMove(ConstructionHeuristicDecider.java:124)
at org.optaplanner.core.impl.constructionheuristic.decider.ConstructionHeuristicDecider.decideNextStep(ConstructionHeuristicDecider.java:93)
at org.optaplanner.core.impl.constructionheuristic.DefaultConstructionHeuristicPhase.solve(DefaultConstructionHeuristicPhase.java:72)
at org.optaplanner.core.impl.solver.DefaultSolver.runPhases(DefaultSolver.java:215)
at org.optaplanner.core.impl.solver.DefaultSolver.solve(DefaultSolver.java:176)
at org.optaplanner.examples.collectionarrange2.app.CollectionArrangeHelloWorld.main(CollectionArrangeHelloWorld.java:51)
和display.setStatusBar(display.HiddenStatusBar)
centerX=display.contentCenterX
centerY=display.contentCenterY
screenX=display.screenOriginX
screenY=display.screenOriginY
screenWidth=display.contentWidth-screenX * 1
screenHeight=display.contentHeight - screenY *1
screenLeft=screenX
screenRight=screenX + screenWidth
screenTop=screenY
screenBottom=screenY+screenHeight
display.contentWidht=screenWidth
display.contentHeight=screenHeight
display.cl=display.CenterLeftReferencePoint
local tileImg="images/lolo.png"
local hiddenObjects={
"cubo", "abeja", "mariposa", "flor",
"cubo", "abeja", "mariposa", "flor"}
local tileWidth=100
local tileHeigth=100
local tileAcross=6
local tileDown=4
local tileSpacing=2
local topSpacing=screenTop+tileHeigth+tileSpacing
local leftSpacing = screenLeft+tileWidth+tileSpacing
local numMatches=0
local numObjsShowing=0
local flipped={}
local pauseDelay=800
local score=0
local scoreTxt=nil
local allTiles={}
local allThings={}
local resetGame
local function shuffle(t)
local n= #t
while n > 2 do
local k = math.random(n)
t[n], t[k] = t[k], t[n]
n = n-1
end
return t
end
local function killObj( obj)
display.remove( obj )
obj = nil
-- body
end
local function startOver()
local msg
local function start(event)
killObj(event.target)
resetGame()
-- body
end
msg=display.newText("Tap Here To Start", 0, 0, "Helvetica", 24)
msg.x=centerX
msg.y=centerY+250
msg:addEventListener("tap", start)
end
local function addToScore( addNum )
local num = addNum or 100
score = score +num
scoreTxt:setReferencePoint(display.cl)
scoreTxt.x = screenWidth+250
-- body
end
local function checkForMatch()
if #flipped== 2 then
local idx1,idx2=flipped[1],flipped[2]
local function resetNumObjsShowing()
numObjsShowing=0
end
if allThings[idx1].name==allThings[idx2].name then
audio.play(sndMatch)
allThings[idx1]:toFront()
allThings[idx2]:toFront()
transition.to(allThings[idx1],{time=400, x=screenRight, y=screenTop, alpha=0})
transition.to(allThings[idx2],{time=400, x=screenRight, y=screenTop, alpha=0})
addToScore(100)
resetNumObjsShowing()
numMatches=numMatches+1
if numMatches==(tileAcross * tileDown / 2) then
audio.play(sndWinner)
startOver()
end
else
audio.play(sndNoMatch)
transition.to(allTiles[idx1],{delay=pauseDelay, time=200,alpha=1, onComplete=resetNumObjsShowing})
transition.to(allTiles[idx2],{delay=pauseDelay, time=200,alpha=1,})
if score> 20 then
addToScore(-20)
end
end
flipped[1]=nil
flipped[2]=nil
end
end
local function tileTapped(event)
if numObjsShowing< 2 then
local tile=event.target
if flipped[numObjsShowing]~=tile.idx then
numObjsShowing=numObjsShowing+1
flipped[numObjsShowing]=tile.idx
transition.to(tile,{time=500,alpha=0,onComplete=checkForMatch})
end
end
end
local function makeTiles(things)
local idx=1
for x = 1, tileAcross do
for y = 1, tileDown do
local thing = display.newImage("images/" ..things[idx].. ".png")
thing.x=(x+1.5)*(tileWidth + tileSpacing)+ leftSpacing
thing.y=(y+0.9)*(tileHeigth + tileSpacing)+ topSpacing
thing.name=things[idx]
allThings[#allThings+1]=thing
local tile = display.newImage(tileImg)
tile.x =(x+1) * (tileWidth + tileSpacing)+ leftSpacing
tile.y =(y+1) * (tileHeigth + tileSpacing)+ topSpacing
tile.idx=idx
tile:addEventListener("tap", tileTapped)
allTiles[#allTiles+1]=tile
idx=idx+1
end
end
end
function resetGame()
numMatches= 0
numObjsShowing = 0
score = 0
scoreTxt.text ="score: 000"
flipped[1]=nil
flipped[2]=nil
if allThings ~={} then
for x=#allThings, 1, -1 do
killObj(allThings[x])
end
allThings={}
-- body
end
if allTiles ~={} then
for x=#allTiles,1,-1 do
killObj(allTiles[x])
end
allTiles={}
end
makeTiles(shuffle(hiddenObjects))
end
local function setupDisplay()
local bg=display.newImage("images/paisaje1.jpg")
bg.x=centerX
bg.y=centerY
bg.width=screenWidth
bg.height=screenHeight
scoreTxt = display.newText("Score: 000", 0, 0, "Helvetica", 18)
scoreTxt.x=screenWidth-120
scoreTxt.y=screenTop+15
end
setupDisplay()
startOver()
解决了这个问题,allow_soft_placement=True
和gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.3)
专门定义了GPU使用的内存部分。我想这有助于避免两个张量流程争夺GPU内存。
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.3)
sess = tf.Session(config=tf.ConfigProto(
allow_soft_placement=True, log_device_placement=True))
答案 2 :(得分:4)
运行Tensorflow Distributed时出现此错误。您是否检查过任何工作人员是否报告了CUDA_OUT_OF_MEMORY错误?如果是这种情况,则可能与放置体重和偏差变量的位置有关。 E.g。
with tf.device("/job:paramserver/task:0/cpu:0"):
W = weight_variable([input_units, num_hidden_units])
b = bias_variable([num_hidden_units])
答案 3 :(得分:4)
我的环境是Python 3.5,Tensorflow 0.12和Windows 10(没有Docker)。我正在CPU和GPU中训练神经网络。每当在GPU中进行训练时,我都会遇到相同的错误InternalError: Blas SGEMM launch failed
。
我找不到发生此错误的原因但我设法通过避免tensorflow函数tensorflow.contrib.slim.one_hot_encoding()
在GPU中运行我的代码。相反,我在numpy(输入和输出变量)中进行单热编码操作。
以下代码重现错误和修复。使用渐变下降来学习y = x ** 2
函数是一个最小的设置。
import numpy as np
import tensorflow as tf
import tensorflow.contrib.slim as slim
def test_one_hot_encoding_using_tf():
# This function raises the "InternalError: Blas SGEMM launch failed" when run in the GPU
# Initialize
tf.reset_default_graph()
input_size = 10
output_size = 100
input_holder = tf.placeholder(shape=[1], dtype=tf.int32, name='input')
output_holder = tf.placeholder(shape=[1], dtype=tf.int32, name='output')
# Define network
input_oh = slim.one_hot_encoding(input_holder, input_size)
output_oh = slim.one_hot_encoding(output_holder, output_size)
W1 = tf.Variable(tf.random_uniform([input_size, output_size], 0, 0.01))
output_v = tf.matmul(input_oh, W1)
output_v = tf.reshape(output_v, [-1])
# Define updates
loss = tf.reduce_sum(tf.square(output_oh - output_v))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
update_model = trainer.minimize(loss)
# Optimize
init = tf.initialize_all_variables()
steps = 1000
# Force CPU/GPU
config = tf.ConfigProto(
# device_count={'GPU': 0} # uncomment this line to force CPU
)
# Launch the tensorflow graph
with tf.Session(config=config) as sess:
sess.run(init)
for step_i in range(steps):
# Get sample
x = np.random.randint(0, 10)
y = np.power(x, 2).astype('int32')
# Update
_, l = sess.run([update_model, loss], feed_dict={input_holder: [x], output_holder: [y]})
# Check model
print('Final loss: %f' % l)
def test_one_hot_encoding_no_tf():
# This function does not raise the "InternalError: Blas SGEMM launch failed" when run in the GPU
def oh_encoding(label, num_classes):
return np.identity(num_classes)[label:label + 1].astype('int32')
# Initialize
tf.reset_default_graph()
input_size = 10
output_size = 100
input_holder = tf.placeholder(shape=[1, input_size], dtype=tf.float32, name='input')
output_holder = tf.placeholder(shape=[1, output_size], dtype=tf.float32, name='output')
# Define network
W1 = tf.Variable(tf.random_uniform([input_size, output_size], 0, 0.01))
output_v = tf.matmul(input_holder, W1)
output_v = tf.reshape(output_v, [-1])
# Define updates
loss = tf.reduce_sum(tf.square(output_holder - output_v))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
update_model = trainer.minimize(loss)
# Optimize
init = tf.initialize_all_variables()
steps = 1000
# Force CPU/GPU
config = tf.ConfigProto(
# device_count={'GPU': 0} # uncomment this line to force CPU
)
# Launch the tensorflow graph
with tf.Session(config=config) as sess:
sess.run(init)
for step_i in range(steps):
# Get sample
x = np.random.randint(0, 10)
y = np.power(x, 2).astype('int32')
# One hot encoding
x = oh_encoding(x, 10)
y = oh_encoding(y, 100)
# Update
_, l = sess.run([update_model, loss], feed_dict={input_holder: x, output_holder: y})
# Check model
print('Final loss: %f' % l)
答案 4 :(得分:3)
也许你没有充分释放你的gpu,如果你正在使用linux,请尝试&#34; ps -ef | grep python&#34;看看哪些工作正在使用GPU。然后杀了他们
答案 5 :(得分:2)
在我的情况下,我打开了2个python控制台,都使用了keras / tensorflow。 当我关闭旧控制台(从前一天忘记), 一切都开始正常工作。
如果没有多个控制台/进程占用GPU,那么检查是很好的。
答案 6 :(得分:1)
我关闭了所有其他Jupyter会话,这解决了问题。我认为这是GPU内存问题。
答案 7 :(得分:1)
对我来说,当我尝试运行多个tensorflow进程(例如2个)并且它们都需要访问GPU资源时,我遇到了这个问题。
一个简单的解决方案是确保一次仅运行一个tensorflow进程。
有关更多详细信息,请参见here。
为清楚起见,tensorflow将尝试(默认情况下)消耗所有可用的 GPU。它不能与其他也处于活动状态的程序一起运行。闭幕。感觉 如果确实是另一个问题,可以免费重新打开。
答案 8 :(得分:1)
就我而言,
首先,我运行
conda clean --all
清理压缩包和未使用的软件包。
然后,我重新启动IDE(在这种情况下为Pycharm),它运行良好。环境:Anaconda python 3.6,Windows 10 64bit。我通过anaconda网站上提供的命令安装tensorflow-gpu。
答案 9 :(得分:1)
2.0兼容答案:为erko的答案提供2.0代码,以使社区受益。
session = tf.compat.v1.Session()
if 'session' in locals() and session is not None:
print('Close interactive session')
session.close()
答案 10 :(得分:0)
与pytest-xdist并行运行Keras CuDNN测试时遇到此错误。解决方案是连续运行它们。
答案 11 :(得分:0)
对我来说,使用Keras时出现此错误,而Tensorflow是后端。这是因为Anaconda的深度学习环境没有被正确激活,因此,Tensorflow也没有正确启动。自上次激活我的深度学习环境(称为dl
)以来,我注意到了这一点,我的Anaconda提示符中的提示已更改为:
(dl) C:\Users\georg\Anaconda3\envs\dl\etc\conda\activate.d>set "KERAS_BACKEND=tensorflow"
虽然之前它只有dl
。因此,我为摆脱上述错误所做的就是关闭我的jupyter笔记本和Anaconda提示,然后重新启动几次。
答案 12 :(得分:0)
我最近在将操作系统更改为 Windows 10 后遇到此错误,而且在使用Windows 7之前我从未遇到过此错误。
如果我在另一个GPU程序运行时加载我的GPU Tensorflow模型,则会发生错误;这是我的JCuda模型作为套接字服务器加载,不是很大。如果我关闭其他GPU程序,可以非常成功地加载此Tensorflow模型。
这个JCuda程序根本不大,只有大约70M,相比之下,这个Tensorflow模型超过500M且更大。但我使用的是1080 ti,它有很多内存。所以它可能不是一个内存不足的问题,而且可能是关于OS或Cuda的Tensorflow的一些棘手的内部问题。 (PS:我使用的是Cuda版本8.0.44,并且没有下载更新的版本。)
答案 13 :(得分:0)
重启我的Jupyter流程还不够;我不得不重新启动计算机。
答案 14 :(得分:0)
就我而言,在单独的服务器中打开Jupyter Notebooks就足够了。
仅当我尝试在同一服务器上使用多个Tensorflow / Keras模型时,我才会出现此错误。打开一个笔记本,执行它,然后关闭并尝试打开另一个笔记本,都没有关系。如果将它们加载到同一Jupyter服务器中,则始终会发生错误。
答案 15 :(得分:-1)
就我而言,libcublas.so
所在的网络文件系统已经死亡。节点重新启动,一切都很好。只是为数据集添加另一个点。