使用Metal Performance Shaders进行培训-MPSNNGraph.encodeBatch

时间:2018-12-16 14:20:30

标签: ios swift machine-learning metal metal-performance-shaders

我正在MPSCNN中建立一个(相当)平凡的卷积神经网络(CNN),并且遇到了导出权重的问题。

训练时,当批处理大小大于4时调用 MPSNNGraph.encodeBatch 时会遇到问题(考虑到MTLTexture具有4个通道,这有些特殊)。每当我增加批次大小时,都会通过 nan 返回权重和偏差系数(都是通过本地存储的数据源MPSCNNConvolutionWeightsAndBiasesState(来自update方法)或从关联的过滤器节点导出权重)。 / p>

我提高了图和float32节点的所有resultImages的精度,并且还向优化程序添加了裁剪,没有运气。有没有办法判断这是内存问题还是使用的数据类型溢出?问题会存在于优化器,渐变,状态还是从GPU转移到CPU?

任何建议都值得赞赏-我已经在这个“挑战”上停留了好几个星期了-欢呼。

---更新--- 一些进一步的信息

该图的默认存储格式设置为Float32-我调整了批次大小(一切保持不变),并将训练样式设置为CPU以捕获我最顶层的渐变-以下是结果(输出前10个系数)-(a,b,...)仅表示重跑(每个重跑都向后走)。

BATCH SIZE = 4

梯度权重l1(a)... 1568 ... [0.0032182545、0.0018722187、0.004452133、0.0027766703、0.004814127、0.002290076、0.0005896213、0.002064481、0.0019948026、0.0055566807、0.003961149]

梯度权重l1(b)... 1568 ... [0.0032182545,0.0018722187,0.004452133,0.0027766703,0.004814127,0.002290076,0.0005896213,0.002064481,0.0019948026,0.0055566807,0.003961149]

梯度权重l1(c)... 1568 ... [0.0032182545,0.0018722187,0.004452133,0.0027766703,0.004814127,0.002290076,0.0005896213,0.002064481,0.0019948026,0.0055566807,0.003961149]

BATCH SIZE = 8

梯度权重l1(a)... 1568 ... [-0.35463914,0.58976394,-0.59485054,0.22903103,-0.51804817,0.59701616,0.5051392,0.074297816,0.4284085,-0.8984931,-0.10788263]

梯度权重l1(b)... 1568 ... [-0.8611915,0.12668955,-0.20884266,-0.102241494,-0.6502063,-0.23424746,-0.4674223,-0.6518867,-0.23104043,-0.40736914,-0.31194344] < / p>

BATCH SIZE = 16

梯度权重l1(a)... 1568 ... [1.26359e + 35,5.4729107e + 35,3.3159668e + 35,5.214483e + 35,3.2493971e + 35,9.169122e + 35,9.311691e + 35,2.1583421e + 35,3.952557e + 35,2.3942557e + 35,3.6645236e + 35]

梯度权重l1(b)... 1568 ... [0.09119261、0.05756697、0.07213145、0.014482293、0.09319483、0.038098965、0.06368228、0.09818763、0.034319896、0.032822747、0.011597654]

梯度权重l1(c)... 1568 ... [-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan]

BATCH SIZE = 32

梯度权重l1(a)... 1568 ... [1.2068136e + 35,-2.3001325e + 34,2.1084688e + 35,-2.9456847e + 35,9.786839e + 33,-6.9434864e + 35, -1.4935384e + 35,-1.0668826e + 35,-1.9871346e + 35,7.397618e + 34,-2.4444336e + 35]

梯度权重l1(b)... 1568 ... [-1.3880644e + 35,-2.4221317e + 34,-1.1778572e + 35,-1.7336298e + 35,-1.8964465e + 35,-2.3253935e +35,-4.467901e + 35,-2.1361668e + 35,-8.294703e + 34,-1.3844599e + 35,-2.800067e + 35]

梯度权重l1(c)... 1568 ... [-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan]

我会认为梯度会被平均吗?尽管如此;当您增加批次大小时,似乎变得不稳定,因为4的批次大小说明了您期望的结果-一致的梯度(所有其他保持不变的值-滤除掉了)。

进程中是否溢出了内存问题?

0 个答案:

没有答案