Question

我目前正在使用TensorFlow的Python API开发一个音频分类器，使用UrbanSound8K数据集，从每个文件中精确收集176400个数据点，并尝试区分10个互斥类。

我已经为卷积神经网络调整了这个示例代码： https://www.tensorflow.org/get_started/mnist/pros

不幸的是，我收到以下错误：

Traceback (most recent call last):
  ...
tensorflow.python.framework.errors_impl.InvalidArgumentError: logits and labels must have the same first dimension, got logits shape [7000,10] and labels shape [10]
     [[Node: xent/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits = SparseSoftmaxCrossEntropyWithLogits[T=DT_FLOAT, Tlabels=DT_INT64, _device="/job:localhost/replica:0/task:0/gpu:0"](read/add, _recv_y_0/_9)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "urban-cnn.py", line 124, in <module>
    sess.run(optimizer, feed_dict={x: batch_x, y: batch_y, keep_prob: .5})
  ...
tensorflow.python.framework.errors_impl.InvalidArgumentError: logits and labels must have the same first dimension, got logits shape [7000,10] and labels shape [10]
     [[Node: xent/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits = SparseSoftmaxCrossEntropyWithLogits[T=DT_FLOAT, Tlabels=DT_INT64, _device="/job:localhost/replica:0/task:0/gpu:0"](read/add, _recv_y_0/_9)]]

Caused by op 'xent/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits', defined at:
  File "urban-cnn.py", line 102, in <module>
    xent = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=y_conv), name="xent")
  ...

InvalidArgumentError (see above for traceback): logits and labels must have the same first dimension, got logits shape [7000,10] and labels shape [10]
     [[Node: xent/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits = SparseSoftmaxCrossEntropyWithLogits[T=DT_FLOAT, Tlabels=DT_INT64, _device="/job:localhost/replica:0/task:0/gpu:0"](read/add, _recv_y_0/_9)]]

以下是代码的略微编辑版本：

import tensorflow as tf
import soundfile as sfx
import numpy as np
import math
import glob

batch_size = 10
n_epochs = 10

input_width = 176400

n_labels = 10

widths = [5, 5, 7]
channels = [1, 8, 64, 512, n_labels]

learning_rate = 1e-4

def load_data():
    data_x = []
    data_y = []

    for path in glob.glob("./UrbanSound8K/audio/fold1/*.wav"):
        name = path.split("/")[-1].split(".")[0]
        x, sample_rate = sfx.read(path, frames=input_width, fill_value=0.)
        y = int(name.split("-")[1])

        if x.ndim > 1:
            x = x.take(0, axis=1)

        data_x.append(x)
        data_y.append(y)

    return data_x, data_y

data_x, data_y = load_data()
data_split = int(len(data_x) * .9)

train_x = data_x[:data_split]
train_y = data_y[:data_split]

test_x = data_x[data_split:]
test_y = data_y[data_split:]

x = tf.placeholder(tf.float32, [None, input_width], name="x")
y = tf.placeholder(tf.int64, [None], name="y")

x_reshaped = tf.reshape(x, [-1, 1, input_width, channels[0]], name="x_reshaped")

def weights_x(shape, name):
    w = tf.Variable(tf.truncated_normal(shape, stddev=0.1), name=name)
    tf.summary.histogram("weights", w)
    return w

def weights(layer, name):
    return weights_x([1, widths[layer], channels[layer], channels[layer+1]], name)

def biases(layer, name):
    b = tf.Variable(tf.constant(0.1, shape=[channels[layer+1]]), name=name)
    tf.summary.histogram("biases", b)
    return b

def convolution(p, w, b, name):
    c = tf.nn.relu(tf.nn.conv2d(p, w, strides=[1, 1, 1, 1], padding="SAME") + b, name=name)
    tf.summary.histogram("convolution", c)
    return c

def pooling(c, name):
    p = tf.nn.max_pool(c, ksize=[1, 1, 6, 1], strides=[1, 1, 6, 1], padding="SAME", name=name)
    tf.summary.histogram("pooling", p)
    return p

with tf.name_scope("conv1"):
    w1 = weights(0, "w1")
    b1 = biases(0, "b1")
    c1 = convolution(x_reshaped, w1, b1, "c1")
    p1 = pooling(c1, "p1")

with tf.name_scope("conv2"):
    w2 = weights(1, "w2")
    b2 = biases(1, "b2")
    c2 = convolution(p1, w2, b2, "c2")
    p2 = pooling(c2, "p2")

with tf.name_scope("dens"):
    n_edges = widths[2] * channels[2]
    wf1 = weights_x([n_edges, channels[3]], "wf1")
    bf1 = biases(2, "bf1")
    pf1 = tf.reshape(p2, [-1, n_edges], name="pf1")
    f1 = tf.nn.relu(tf.matmul(pf1, wf1) + bf1, name="f1")

with tf.name_scope("drop"):
    keep_prob = tf.placeholder(tf.float32, name="keep_prob")
    dropout = tf.nn.dropout(f1, keep_prob)

with tf.name_scope("read"):
    wf2 = weights_x([channels[3], channels[4]], "wf2")
    bf2 = biases(3, "bf2")
    y_conv = tf.matmul(dropout, wf2) + bf2

with tf.name_scope("xent"):
    xent = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=y_conv), name="xent")
    tf.summary.scalar("xent", xent)

with tf.name_scope("optimizer"):
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(xent)

with tf.name_scope("accuracy"):
    correct_prediction = tf.equal(tf.argmax(y_conv, 1), y, name="correct_prediction")
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name="accuracy")
    tf.summary.scalar("accuracy", accuracy)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print("Initialized Global Variables")

    for epoch in range(n_epochs):
        n_itr = len(train_x)//batch_size

        for itr in range(n_itr):
            left, right = itr*batch_size, (itr+1)*batch_size
            batch_x, batch_y = train_x[left:right], train_y[left:right]

            sess.run(optimizer, feed_dict={x: batch_x, y: batch_y, keep_prob: .5})
        print("epoch: ", epoch + 1)

    print("accuracy: ", sess.run(accuracy, feed_dict={x: test_x, y: test_y, keep_prob: 1.}))

在调用sess.run（...）之前检查Tensor形状时，一切都按预期进行。

那么为什么logits的形状为[7000，n_labels]而不是[batch_size，n_labels]？

Answer 1

您的网络结构不正确，关键问题在于

with tf.name_scope("dens"):
    n_edges = widths[2] * channels[2]
    wf1 = weights_x([n_edges, channels[3]], "wf1")
    bf1 = biases(2, "bf1")
    pf1 = tf.reshape(p2, [-1, n_edges], name="pf1")
    f1 = tf.nn.relu(tf.matmul(pf1, wf1) + bf1, name="f1")

p2有一个形状[10,1,49,64]，n_edges不等于4900 * 64 = 313600，而是448（太小的层！），如果你使n_edges = 313600一切都很好，但是，这取决于您是否考虑到了这个架构。看起来你合并了两个不兼容的东西，你使用了卷积内核的形状来计算层压平它的大小。然而，这不是卷积的工作原理 - 层的形状取决于输入和内核以及填充的大小。因此，一般来说它是更大的方式，并且在这个例子中 - 完全连接的层实际上应该有超过300k的输入神经元，而不是在你的代码中 - 只有448.这里的关键区别在于连接层适用于卷积的输出，而不是参数。

这7000只是操作的结果batch_size *（4900 * 64）/（n_edges）= 10 * 313600/448 = 7000（重塑pf1）。

更通用的修复方法

p2s = p2.get_shape()
n_edges = int(p2s[1] * p2s[2] * p2s[3])

因为此时p2的所有形状（除了第0个）都是已知的，因此可以读取并用于构建网络提醒。

是什么赋予了logits这种意想不到的形状？

1 个答案: