背景

Question

我很难在图层之间获得正确的尺寸输出。

背景

我的数据集存在于长度不等的.wav音频文件中，并且只有x个标签输出，每个情感类别的百分比。我希望能够在训练后处理实时语音。因此，我想到了对原始音频采用滑动窗口方法。我正在使用Keras，并考虑使用TimeDistributed（）能够对所有带有1个标签的滑动窗口进行分类。想一想通过逐帧（通过滑动窗口框架滑动窗口框架）馈送网络，对具有1个输出标签阵列的视频（音频文件）进行分类。

虚拟代码

# Python 3
import numpy as np
from keras.models import Sequential
from keras.layers import LSTM, TimeDistributed, Dense

# raw audio
wavs = [np.array([0.12, 0.43, 0.75]), np.array([0.32, 0.21, 0.64, 0.33])]
# x% emotion 1, x% emotion 2
labels = [np.array([0.4, 0.6]), np.array([0.9, 0.1])]

# Generator function for Keras
def data_generator():
    for i, wav in enumerate(wavs):
        wav_length = len(wav)
        # init empty array with shape (len wav / sliding window width, sliding ww)
        # sliding window of 1
        x = np.zeros([len(wav), 1])
        for idx in range(len(wav)):
            x[idx] = wav[idx]

        # LSTM reshape
        x = x.reshape(x.shape[0], 1, x.shape[1])

        y = labels[0]

        # 1 file == 1 batch
        yield(x, y)

gen = data_generator()
# Example: (array([[[0.12]], [[0.43]], [[0.75]]]), array([0.4, 0.6]))

def model_setup():
    input_shape = (None, 1, 1)

    # TimeDistributed
    model = Sequential()
    model.add(TimeDistributed(LSTM(128), input_shape=input_shape))
    model.add(TimeDistributed(Dense(2, activation='sigmoid')))  # emotion yes/no
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    return model

ks_model = model_setup()
model_hist = ks_model.fit_generator(gen, steps_per_epoch = 2,
                                 epochs = 1, verbose=2, class_weight=None)

错误

但是，无论我将x.reshape(x.shape[0], 1, x.shape[1])更改为x.reshape(1, x.shape[0], 1, x.shape[1])还是添加TimeDistributed()，我都无法运行它。

错误类似于：ValueError: Error when checking input: expected time_distributed_15_input to have 4 dimensions, but got array with shape (3, 1, 1)

灵感来自

我上面的代码受此处给出的视频分类代码的启发：https://github.com/keras-team/keras/issues/5338#issuecomment-278844603

但是，它们从TimeDistributed层到LSTM层（匹配尺寸）。就我而言，我已经在LSTM中使用了TimeDistributed，因此我不能仅使用LSTM，因为尺寸不匹配。

问题

有人知道如何获取上述代码来编译和训练模型吗？

不同的方法

我知道我可以复制标签以匹配滑动窗口框架的数量，但是从语义上讲，将情感百分比赋予100毫秒左右的时间窗口是没有意义的。

编辑：上传的Jupyter Notebook

https://colab.research.google.com/drive/1C_aZcdUULT59i1jhSoB_dgqKp9JgpCH6

DL Keras-可变长度的原始语音分类，每个文件带有1个标签

背景

虚拟代码

错误

灵感来自

问题

不同的方法

编辑：上传的Jupyter Notebook

0 个答案: