我很难在图层之间获得正确的尺寸输出。
我的数据集存在于长度不等的.wav音频文件中,并且只有x个标签输出,每个情感类别的百分比。我希望能够在训练后处理实时语音。因此,我想到了对原始音频采用滑动窗口方法。我正在使用Keras,并考虑使用TimeDistributed()能够对所有带有1个标签的滑动窗口进行分类。想一想通过逐帧(通过滑动窗口框架滑动窗口框架)馈送网络,对具有1个输出标签阵列的视频(音频文件)进行分类。
# Python 3
import numpy as np
from keras.models import Sequential
from keras.layers import LSTM, TimeDistributed, Dense
# raw audio
wavs = [np.array([0.12, 0.43, 0.75]), np.array([0.32, 0.21, 0.64, 0.33])]
# x% emotion 1, x% emotion 2
labels = [np.array([0.4, 0.6]), np.array([0.9, 0.1])]
# Generator function for Keras
def data_generator():
for i, wav in enumerate(wavs):
wav_length = len(wav)
# init empty array with shape (len wav / sliding window width, sliding ww)
# sliding window of 1
x = np.zeros([len(wav), 1])
for idx in range(len(wav)):
x[idx] = wav[idx]
# LSTM reshape
x = x.reshape(x.shape[0], 1, x.shape[1])
y = labels[0]
# 1 file == 1 batch
yield(x, y)
gen = data_generator()
# Example: (array([[[0.12]], [[0.43]], [[0.75]]]), array([0.4, 0.6]))
def model_setup():
input_shape = (None, 1, 1)
# TimeDistributed
model = Sequential()
model.add(TimeDistributed(LSTM(128), input_shape=input_shape))
model.add(TimeDistributed(Dense(2, activation='sigmoid'))) # emotion yes/no
model.compile(loss='categorical_crossentropy', optimizer='adam')
return model
ks_model = model_setup()
model_hist = ks_model.fit_generator(gen, steps_per_epoch = 2,
epochs = 1, verbose=2, class_weight=None)
但是,无论我将x.reshape(x.shape[0], 1, x.shape[1])
更改为x.reshape(1, x.shape[0], 1, x.shape[1])
还是添加TimeDistributed()
,我都无法运行它。
错误类似于:ValueError: Error when checking input: expected time_distributed_15_input to have 4 dimensions, but got array with shape (3, 1, 1)
我上面的代码受此处给出的视频分类代码的启发:https://github.com/keras-team/keras/issues/5338#issuecomment-278844603
但是,它们从TimeDistributed
层到LSTM
层(匹配尺寸)。就我而言,我已经在LSTM
中使用了TimeDistributed
,因此我不能仅使用LSTM
,因为尺寸不匹配。
有人知道如何获取上述代码来编译和训练模型吗?
我知道我可以复制标签以匹配滑动窗口框架的数量,但是从语义上讲,将情感百分比赋予100毫秒左右的时间窗口是没有意义的。
https://colab.research.google.com/drive/1C_aZcdUULT59i1jhSoB_dgqKp9JgpCH6