Question

在给定钢琴音符的声音文件（WAV格式）的情况下，我正在尝试创建RNN以猜测钢琴上正在演奏的音符。我目前正在将WAV剪辑切成10秒的块（2D），将较短的部分用零填充到10秒，因此输入都是常规的。但是，当我将片段传递给RNN时，它给出的输出尺寸会减小一维（1D）（当采用最后一个状态时-我应该采用状态序列吗？）。

我创建了一个更简单的RNN以分析单个便笺文件（2D）并生成一个输出（1D），这已成功。但是，当尝试将相同的技术应用于具有多个音符的完整剪辑以及音符开始/停止时，它似乎崩溃了，因为我似乎无法更改输出形状。

def weight_variable(shape):
    initer = tf.truncated_normal_initializer(stddev=0.01)
    return tf.get_variable('W', dtype=tf.float32, shape=shape, initializer=initer)

def bias_variable(shape):
    initial = tf.constant(0., shape=shape, dtype=tf.float32)
    return tf.get_variable('b', dtype=tf.float32,initializer=initial)

def RNN(x, weights, biases, timesteps, num_hidden):
    x = tf.unstack(x, timesteps, 1)

    # Define a rnn cell with tensorflow
    lstm_cell = rnn.LSTMCell(num_hidden)
    states_series, current_state = rnn.static_rnn(lstm_cell, x, dtype=tf.float32)
    return tf.matmul(current_state[1], weights) + biases 
    # return [tf.matmul(temp,weights) + biases for temp in states_series]
    # does this even make sense

# x is for data, y is for targets, shapes are [index, time, frequency], [index, time, output note (s)] respectively
x_train, x_valid, y_train, y_valid = load_data() # removed test
print("Size of:")
print("- Training-set:\t\t{}".format(y_train.shape[0]))
print("- Validation-set:\t{}".format(y_valid.shape[0]))
# print("- Test-set\t{}".format(len(y_test)))

learning_rate = 0.001 # The optimization initial learning rate
epochs = 1000         # Total number of training epochs
batch_size = 100      # Training batch size
display_freq = 100    # Frequency of displaying the training results
threshold = 0.7       # Threshold for determining a "note"
num_hidden_units = 15 # Number of hidden units of the RNN

# Placeholders for inputs (x) and outputs(y)
x = tf.placeholder(tf.float32, shape=(None, stepCount, num_input))
y = tf.placeholder(tf.float32, shape=(None, stepCount, n_classes)) 

# create weight matrix initialized randomly from N~(0, 0.01)
W = weight_variable(shape=[num_hidden_units, n_classes])

# create bias vector initialized as zero
b = bias_variable(shape=[n_classes])

output_logits = RNN(x, W, b, stepCount, num_hidden_units)
y_pred = tf.nn.softmax(output_logits)

# Define the loss function, optimizer, and accuracy, etc.
# (code removed, irrelevant)

# Creating the op for initializing all variables
init = tf.global_variables_initializer()

sess = tf.InteractiveSession()
sess.run(init)
global_step = 0
# Number of training iterations in each epoch
num_tr_iter = int(y_train.shape[0] / batch_size)
for epoch in range(epochs):
    print('Training epoch: {}'.format(epoch + 1))
    x_train, y_train = randomize(x_train, y_train)
    for iteration in range(num_tr_iter):
        global_step += 1
        start = iteration * batch_size
        end = (iteration + 1) * batch_size
        x_batch, y_batch = get_next_batch(x_train, y_train, start, end)
        # Run optimization op (backprop)
        feed_dict_batch = {x: x_batch, y: y_batch}
        sess.run(optimizer, feed_dict=feed_dict_batch)

        if iteration % display_freq == 0:
            # Calculate and display the batch loss and accuracy
            loss_batch, acc_batch = sess.run([loss, accuracy],
                                             feed_dict=feed_dict_batch)

            print("iter {0:3d}:\t Loss={1:.2f},\tTraining Accuracy={2:.01%}".
                  format(iteration, loss_batch, acc_batch))
            testLoss.append(loss_batch)
            testAcc.append(acc_batch)

    # Run validation after every epoch

    feed_dict_valid = {x: x_valid[:1000].reshape((-1, stepCount, num_input)), y: y_valid[:1000]}
    loss_valid, acc_valid = sess.run([loss, accuracy], feed_dict=feed_dict_valid)
    print('---------------------------------------------------------')
    print("Epoch: {0}, validation loss: {1:.2f}, validation accuracy: {2:.01%}".
          format(epoch + 1, loss_valid, acc_valid))
    print('---------------------------------------------------------')
    validLoss.append(loss_valid)
    validAcc.append(acc_batch)

当前，这正在输出一维预测数组，这在我的情况下确实没有意义，但是我不确定如何更改（它应该为每个时间步输出预测-即对哪些音符的预测在每个时间播放）。

有没有一种方法可以配置RNN的输出形状？

0 个答案: