实施多类CNN的问题

时间:2018-05-11 13:04:41

标签: tensorflow keras deep-learning conv-neural-network spectrogram

免责声明:Keras和Python的新手。

各位大家好,我正在尝试按照本文中提出的规范在Keras实施神经网络:https://arxiv.org/pdf/1605.09507.pdf

首先,我对网络架构有疑问(本文第三节B小节)。实际上,即使我遵循B小节中的规范,我的网络的输出形状也与本文表I中报告的形状不符。

以下是我的网络代码:

filtersNumber = 32
filtersReceptiveField = (3, 3)
filtersStride = (1, 1)
maxpoolSize = (3, 3)
maxpoolStride = (1, 1)
layersNumber = 4
myActivation = 'relu'
inputShape = (128,43,1)
classesNumber = 11

def myActivationFunction(model, activation):
    if activation == 'tanh' or activation == 'relu':
        model.add(Activation(activation))
    elif activation == 'prelu':
        model.add(PReLU())
    elif activation == 'lrelu_0.01':
        model.add(LeakyReLU(alpha=0.01))
    elif activation == 'lrelu_0.33':
        model.add(LeakyReLU(alpha=0.33))
    return model

model = Sequential()
for index in range(layersNumber):
    if index == 0: 
        model.add(Conv2D(filtersNumber,filtersReceptiveField,strides=filtersStride,padding='same',input_shape=inputShape))
    else:
        model.add(Conv2D(filtersNumber,filtersReceptiveField,strides=filtersStride, padding='same'))
    model = myActivationFunction(model, myActivation)
    model.add(Conv2D(filtersNumber,filtersReceptiveField,strides=filtersStride, padding='same'))
    model = myActivationFunction(model,myActivation)

    if index != (layersNumber-1):
        model.add(MaxPooling2D(pool_size=maxpoolSize,strides=maxpoolStride))
        model.add(Dropout(0.25))
        filtersNumber = filtersNumber*2
    else: 
        model.add(GlobalMaxPooling2D())
        model.add(Dense(1024))
        model = myActivationFunction(model, myActivation)
        model.add(Dropout(0.50))
        model.add(Dense(classesNumber))
        model.add(Activation('sigmoid'))
model.summary()

这是model.summary():

Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 128, 43, 32)       320       
_________________________________________________________________
activation_1 (Activation)    (None, 128, 43, 32)       0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 128, 43, 32)       9248      
_________________________________________________________________
activation_2 (Activation)    (None, 128, 43, 32)       0         
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 126, 41, 32)       0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 126, 41, 32)       0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 126, 41, 64)       18496     
_________________________________________________________________
activation_3 (Activation)    (None, 126, 41, 64)       0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 126, 41, 64)       36928     
_________________________________________________________________
activation_4 (Activation)    (None, 126, 41, 64)       0         
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 124, 39, 64)       0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 124, 39, 64)       0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 124, 39, 128)      73856     
_________________________________________________________________
activation_5 (Activation)    (None, 124, 39, 128)      0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 124, 39, 128)      147584    
_________________________________________________________________
activation_6 (Activation)    (None, 124, 39, 128)      0         
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 122, 37, 128)      0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 122, 37, 128)      0         
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 122, 37, 256)      295168    
_________________________________________________________________
activation_7 (Activation)    (None, 122, 37, 256)      0         
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 122, 37, 256)      590080    
_________________________________________________________________
activation_8 (Activation)    (None, 122, 37, 256)      0         
_________________________________________________________________
global_max_pooling2d_1 (Glob (None, 256)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1024)              263168    
_________________________________________________________________
activation_9 (Activation)    (None, 1024)              0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 1024)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 11)                11275     
_________________________________________________________________
activation_10 (Activation)   (None, 11)                0         
=================================================================
Total params: 1,446,123
Trainable params: 1,446,123
Non-trainable params: 0
_________________________________________________________________

经过一些尝试,我已经设法使用以下代码获得表I的完全相同的尺寸。正如您所看到的,除了“padding = same”之外,我还必须在每个卷积层之前插入一个额外的零填充层,并且我必须删除max-pool步长(这意味着max-pool stride将默认为池大小根据keras文件)。

filtersNumber = 32
filtersReceptiveField = (3, 3)
filtersStride = (1, 1)
maxpoolSize = (3, 3)
maxpoolStride = (1, 1)
zeroPadding = (1, 1)
layersNumber = 4
myActivation = 'relu'
inputShape = (128,43,1)
classesNumber = 11

def myActivationFunction(model, activation):
    if activation == 'tanh' or activation == 'relu':
        model.add(Activation(activation))
    elif activation == 'prelu':
        model.add(PReLU())
    elif activation == 'lrelu_0.01':
        model.add(LeakyReLU(alpha=0.01))
    elif activation == 'lrelu_0.33':
        model.add(LeakyReLU(alpha=0.33))
    return model

model = Sequential()
for index in range(layersNumber):
    if index == 0: # for the first layer, specify input shape
        model.add(ZeroPadding2D(zeroPadding,input_shape=inputShape))
    else:
        model.add(ZeroPadding2D(zeroPadding))
    model.add(Conv2D(filtersNumber, filtersReceptiveField, strides=filtersStride, padding='same'))
    model = myActivationFunction(model, myActivation)
    model.add(ZeroPadding2D(zeroPadding))
    model.add(Conv2D(filtersNumber,filtersReceptiveField,strides=filtersStride, padding='same'))
    model = myActivationFunction(model, myActivation)
    if index != (layersNumber-1):
        model.add(MaxPooling2D(pool_size=maxpoolSize))
        model.add(Dropout(0.25))
        filtersNumber = filtersNumber*2
    else: # for the last layer
        model.add(GlobalMaxPooling2D())
        model.add(Dense(1024))
        model = myActivationFunction(model, myActivation)
        model.add(Dropout(0.50))
        model.add(Dense(classesNumber))
        model.add(Activation('sigmoid'))
model.summary()

第一个问题:考虑到作者说“每个卷积层的输入”,不应该“填充=相同”足以进行零填充 用1×1填充零填充以保持空间分辨率“?Max-pool stride = 1是作者的错误还是我错过了什么?

顺便说一句,使用这些新的规格我试图训练网络,但不幸的是损失和val_loss没有改变,“火车停止了因为val_loss没有减少超过三个时期”,如在论文中说明。

Train on 5699 samples, validate on 1006 samples
Epoch 1/1000
5699/5699 [==============================] - 559s 98ms/step - loss: 2.4453 - acc: 0.0635 - val_loss: 2.3979 - val_acc: 0.0447
Epoch 2/1000
5699/5699 [==============================] - 583s 102ms/step - loss: 2.9140 - acc: 0.0602 - val_loss: 3.4699 - val_acc: 0.0447
Epoch 3/1000
5699/5699 [==============================] - 571s 100ms/step - loss: 3.4037 - acc: 0.0604 - val_loss: 3.4699 - val_acc: 0.0447
Epoch 4/1000
5699/5699 [==============================] - 592s 104ms/step - loss: 4.2809 - acc: 0.0598 - val_loss: 4.5773 - val_acc: 0.0447  

这是我的训练代码(规格取自第III节C小节):

import numpy as np
import os
import keras
from keras.models import Sequential
from keras.layers import ZeroPadding2D,Conv2D,Activation,MaxPooling2D,Dropout,GlobalMaxPooling2D,Dense,PReLU,LeakyReLU
from keras.callbacks import EarlyStopping

def myActivationFunction(model, convAct): 
    if convAct == 'tanh' or convAct == 'relu':
        model.add(Activation(convAct))
    elif convAct == 'prelu':
        model.add(PReLU())
    elif convAct == 'lrelu_0.01':
        model.add(LeakyReLU(alpha=0.01))
    elif convAct == 'lrelu_0.33':
        model.add(LeakyReLU(alpha=0.33))
    return model


def buildCNN(inputShape, classesNumber, myActivation):
    # Paper: section III, subsection B: Network Architecture
    filtersNumber = 32
    filtersReceptiveField = (3, 3)
    filtersStride = (1, 1)
    zeroPadding = (1, 1)
    maxpoolSize = (3, 3)
    maxpoolStride = (1, 1)
    layersNumber = 4

    model = Sequential()
    for index in range(layersNumber):
        if index == 0:  
            model.add(ZeroPadding2D(zeroPadding, input_shape=inputShape))
        else:
            model.add(ZeroPadding2D(zeroPadding))
        model.add(Conv2D(filtersNumber, filtersReceptiveField, strides=filtersStride, padding='same'))
        model = myActivationFunction(model, myActivation)
        model.add(ZeroPadding2D(zeroPadding))
        model.add(Conv2D(filtersNumber, filtersReceptiveField, strides=filtersStride, padding='same'))
        model = myActivationFunction(model, myActivation)
        if index != (layersNumber - 1):
            model.add(MaxPooling2D(
                pool_size=maxpoolSize)) 
            model.add(Dropout(0.25))
            filtersNumber = filtersNumber * 2
        else: 
            model.add(GlobalMaxPooling2D())
            model.add(Dense(1024))
            model = myActivationFunction(model, myActivation)
            model.add(Dropout(0.50))
            model.add(Dense(classesNumber))
            model.add(Activation('sigmoid'))
    return model



if __name__ == '__main__':

    import argparse
    parser = argparse.ArgumentParser(description="Trains the network using training dataset")
    parser.add_argument("-w", "--window", type=float, default=3.0, choices=[0.5, 1.0, 1.5, 3.0],
                        help="Analysis window size. Choose from 0.5, 1.0, 1.5, 3.0. Default: 1.0")
    parser.add_argument("-t", "--threshold", type=float, default=0.55, choices=[0.20, 0.25, 0.30, 0.35, 0.40, 0.45,
                                                                 0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80],
                        metavar="[0.20:0.05:0.80]",
                        help="Identification threshold. Choose from 0.20 to 0.80 (step size 0.05). Default: 0.55")
    parser.add_argument("-a", metavar=" ", default="relu",
                        choices=["tanh", "relu", "prelu", "lrelu_0.01", "lrelu_0.33"],
                        help="activation function. Choose from tanh, relu, prelu, lrelu_0.01, lrelu_0.33. Default: relu")
    parser.add_argument("-p", "--path", default="Preproc", help="path of preprocessed files (default: Preproc)")
    args = parser.parse_args()

    X_train = np.load(args.path+"/X_train_"+str(args.window)+"s.npy")
    Y_train = np.load(args.path+"/Y_train_"+str(args.window)+"s.npy")


    batchSize = 128
    epochsNum = 1000 

    model = buildCNN((X_train.shape[1],X_train.shape[2],X_train.shape[3]),Y_train.shape[1], args.a)
    # model.summary()

    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    earlyStopping = EarlyStopping(monitor='val_loss', patience=3) 

    model.fit(X_train, Y_train, batch_size=batchSize, epochs=epochsNum, validation_split=0.15, callbacks=[earlyStopping])

此时我认为我的训练数据预处理可能有问题,但经过多次审核后我无法找到任何错误。这是我的预处理代码(您可以在本文的第III节A小节中找到规范):

import librosa
import librosa.display
import numpy as np
import os
import shutil
import keras
# import matplotlib.pyplot as plt


# Paper: section III, subsection A: Audio Preprocessing
def preprocess_dataset(input_path, output_path):
    for root, directories, filenames in os.walk(input_path):
        for directory in directories:  
            if not os.path.exists(output_path + os.path.join(os.path.relpath(root, input_path), directory)):  
                os.makedirs(output_path + os.path.join(os.path.relpath(root, input_path), directory)) 
            else:
                return

        for filename in filenames:  
            if filename.endswith(".wav"):  
                audio_signal, sample_rate = librosa.load(
                    os.path.join(root, filename))  # audio is mixed to mono and resampled to 22050 Hz
                normalized_audio_signal = librosa.util.normalize(audio_signal)  # audio normalization by its max value
                # Compute mel-spectrogram with the following specs:
                # - STFT window lenght: 1024 samples
                # - hop size: 512 samples
                # - mel frequency bins: 128
                mel_spect = librosa.feature.melspectrogram(normalized_audio_signal, sample_rate, n_fft=1024, hop_length=512,
                                                          n_mels=128)

                log_mel_spect = np.log(np.maximum(1e-10, mel_spect)) # add a threshold to avoid -inf results
                log_mel_spect = log_mel_spect[:,:,np.newaxis]  # add new axis for keras channel last mode


                filename, fileExtension = os.path.splitext(filename)  # split file name from extension
                np.save(output_path + os.path.join(os.path.relpath(root, input_path), filename), log_mel_spect)  # save as .npy file
                # librosa.display.specshow(log_mel_spect, y_axis='mel', x_axis='time')
                # plt.show()
            elif filename.endswith(".txt"):  # copy files containing testing labels
                shutil.copy(os.path.join(root, filename), output_path + os.path.join(os.path.relpath(root, input_path), filename))


def training_vectors_init(training_path, chunks_numb):
    classes_names = sorted(os.listdir(training_path))  
    total_classes = len(classes_names)  
    audio_path = training_path + classes_names[0] + '/'  
    infilename = os.listdir(audio_path)[0]  
    melgram = np.load(audio_path + infilename)  
    melgram_dimensions = melgram.shape  
    for dirpath, dirnames, filenames in os.walk(training_path):
        total_training_files = total_training_files + len(filenames) 

    melgram_chunk_length = int(melgram_dimensions[1] / chunks_numb) 
    x_train = np.zeros(((total_training_files * chunks_numb), melgram_dimensions[0], melgram_chunk_length, melgram_dimensions[2])) 
    y_train = np.zeros(((total_training_files * chunks_numb), total_classes)) 
    return classes_names,total_classes,x_train,y_train,melgram_chunk_length


def shuffle_xy(x, y): 
    assert x.shape[0] == y.shape[0], "Dimensions problem"
    idx = np.array(range(y.shape[0]))  
    np.random.shuffle(idx)  
    new_x = np.copy(x)  
    new_y = np.copy(y)
    for i in range(len(idx)):  
        new_x[i] = x[idx[i], :, :, :]
        new_y[i] = y[idx[i], :]
    return new_x, new_y


def build_training_dataset(preproc_path, training_win_len):
    training_path = preproc_path + "Training/"
    training_audio_length = 3  # training audio length (seconds)
    chunks_numb = int(training_audio_length / training_win_len)
    classes_names,total_classes,x_train,y_train,melgram_chunk_length = training_vectors_init(training_path, chunks_numb)
    count = 0
    for class_index, class_name in enumerate(classes_names):  
        one_hot_label = keras.utils.to_categorical(class_index,
                                           num_classes=total_classes) 
        file_names = os.listdir(training_path + class_name) 

        for file_name in file_names:  
            audio_path = training_path + class_name + '/' + file_name  
            mel = np.load(audio_path)  

            for i in range(chunks_numb):  
                x_train[count,:,:,:] = mel[:,(melgram_chunk_length*i):(melgram_chunk_length*(i+1)),:]  
                y_train[count,:] = one_hot_label 
                count = count + 1

    x_train, y_train = shuffle_xy(x_train, y_train)  
    np.save(preproc_path + "X_train_" + str(training_win_len) + 's', x_train)  
    np.save(preproc_path + "Y_train_" + str(training_win_len) + 's', y_train)
    return melgram_chunk_length,classes_names


if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser(
        description="preprocess_data: convert samples to .npy data format for faster loading")
    parser.add_argument("-i", "--inpath", help="input directory for audio samples (default: IRMAS-Sample)",
                        default="IRMAS-Sample")
    parser.add_argument("-o", "--outpath", help="output directory for preprocessed files (default: Preproc)",
                        default="Preproc")
    args = parser.parse_args()

    preprocess_dataset(args.inpath + '/', args.outpath + '/')

    winLengths = [0.5, 1.0, 1.5, 3.0]
    for winLen in winLengths:
        melChunkLen, classesNames = build_training_dataset(args.outpath + '/', winLen)
    ...

第二个问题:网络可能出现什么问题?我还尝试用很少的样本训练网络并使用相同的样本作为验证数据,但是val_loss保持不变,正如你在这里看到的那样。

Epoch 1/1000
3/3 [==============================] - 1s 256ms/step - loss: 2.3653 - acc: 0.3333 - val_loss: 2.1726 - val_acc: 0.3333
Epoch 2/1000
3/3 [==============================] - 0s 108ms/step - loss: 2.0382 - acc: 0.3333 - val_loss: 1.5727 - val_acc: 0.3333
Epoch 3/1000
3/3 [==============================] - 0s 104ms/step - loss: 1.3635 - acc: 0.3333 - val_loss: 1.1036 - val_acc: 0.6667
Epoch 4/1000
3/3 [==============================] - 0s 109ms/step - loss: 1.1281 - acc: 0.3333 - val_loss: 1.0986 - val_acc: 0.3333
Epoch 5/1000
3/3 [==============================] - 0s 102ms/step - loss: 1.0986 - acc: 0.6667 - val_loss: 1.0986 - val_acc: 0.3333
Epoch 6/1000
3/3 [==============================] - 0s 104ms/step - loss: 1.0986 - acc: 0.3333 - val_loss: 1.0986 - val_acc: 0.3333

有谁知道这个网络上发生了什么?

1 个答案:

答案 0 :(得分:0)

我是本文的作者。关于“填充”,当我们仅考虑这种特殊情况时,确实可能不需要增加数据的大小。在本文中,我们使用额外的填充来保持网络架构在每个转换器前面的各种输入尺寸(0.5s,2.0s,3.0s)尽可能保持不变。层。

另一件事是max-pooling stride应该只是默认大小,与池大小相同。我认为在论文中说stride = 1是我的错误:(