Question

我正在尝试将音频信号从语音分类为情绪。为此，我正在提取音频信号的MFCC功能，并将它们提供给一个简单的神经网络（FeedForwardNetwork使用PyBrain的BackpropTrainer进行训练）。不幸的是，结果非常糟糕。从5个类中，网络似乎几乎总是提出相同的类。

我有5类情绪和大约7000个标记的音频文件，我将其划分为每个班级的80％用于训练网络，20％用于测试网络。

我们的想法是使用小窗口并从中提取MFCC功能以生成大量训练示例。在评估中，评估来自一个文件的所有窗口，并且多数投票决定预测标签。

Training examples per class: 
{0: 81310, 1: 60809, 2: 58262, 3: 105907, 4: 73182}

Example of scaled MFCC features:
[ -6.03465056e-01   8.28665733e-01  -7.25728303e-01   2.88611116e-05
1.18677218e-02  -1.65316583e-01   5.67322809e-01  -4.92335095e-01   
3.29816126e-01  -2.52946780e-01  -2.26147779e-01   5.27210979e-01   
-7.36851560e-01]

Layers________________________:  13 20 5 (also tried 13 50 5 and 13 100 5)
Learning Rate_________________:  0.01 (also tried 0.1 and 0.3)
Training epochs_______________:  10  (error rate does not improve at all during training)

Truth table on test set:
[[   0.    4.    0.  239.   99.]
 [   0.   41.    0.  157.   23.]
 [   0.   18.    0.  173.   18.]
 [   0.   12.    0.  299.   59.]
 [   0.    0.    0.   85.  132.]]

Success rate overall [%]:  34.7314201619
Success rate Class 0 [%]:  0.0
Success rate Class 1 [%]:  18.5520361991
Success rate Class 2 [%]:  0.0
Success rate Class 3 [%]:  80.8108108108
Success rate Class 4 [%]:  60.8294930876

好的，现在，你可以看到结果在类上的分布是非常糟糕的。永远不会预测0级和2级。我认为，这暗示了我的网络或更多可能是我的数据的问题。

我可以在这里发布大量代码，但我认为在下图中显示我正在采取的所有步骤以获得MFCC功能更有意义。请注意，我使用整个信号而不用窗口只是为了说明。这看起来不错吗？ MFCC的价值非常巨大，它们不应该小得多吗？（我将它们缩放到网络之前，用minmaxscaler将所有数据都加到[-2,2]，同时尝试[0,1]）

这是我用于Melfilter银行的代码，我在离散余弦变换之前直接应用以提取MFCC特征（我从这里得到它：stackoverflow）：

def freqToMel(freq):
  '''
  Calculate the Mel frequency for a given frequency 
  '''
  return 1127.01048 * math.log(1 + freq / 700.0)

def melToFreq(mel):
  '''
  Calculate the frequency for a given Mel frequency 
  '''
  return 700 * (math.exp(freq / 1127.01048 - 1))

def melFilterBank(blockSize):
  numBands = int(mfccFeatures)
  maxMel = int(freqToMel(maxHz))
  minMel = int(freqToMel(minHz))

  # Create a matrix for triangular filters, one row per filter
  filterMatrix = numpy.zeros((numBands, blockSize))

  melRange = numpy.array(xrange(numBands + 2))

  melCenterFilters = melRange * (maxMel - minMel) / (numBands + 1) + minMel

  # each array index represent the center of each triangular filter
  aux = numpy.log(1 + 1000.0 / 700.0) / 1000.0
  aux = (numpy.exp(melCenterFilters * aux) - 1) / 22050
  aux = 0.5 + 700 * blockSize * aux
  aux = numpy.floor(aux)  # Arredonda pra baixo
  centerIndex = numpy.array(aux, int)  # Get int values

  for i in xrange(numBands):
    start, centre, end = centerIndex[i:i + 3]
    k1 = numpy.float32(centre - start)
    k2 = numpy.float32(end - centre)
    up = (numpy.array(xrange(start, centre)) - start) / k1
    down = (end - numpy.array(xrange(centre, end))) / k2

    filterMatrix[i][start:centre] = up
    filterMatrix[i][centre:end] = down

  return filterMatrix.transpose()

我可以做些什么来获得更好的预测结果？

Answer 1

在这里，我从言语中提出了性别鉴定的一个例子。我在这个例子中使用了Hyke-dataset ¹。它只是一个快速成长的例子。如果一个人想做严肃的性别认同，可能会做得更好。但总的来说，错误率会降低：

Build up data...
Train network...
Number of training patterns:  94956
Number of test patterns:      31651
Input and output dimensions:  13 2
Train network...
epoch:    0   train error: 62.24%   test error: 61.84%
epoch:    1   train error: 34.11%   test error: 34.25%
epoch:    2   train error: 31.11%   test error: 31.20%
epoch:    3   train error: 30.34%   test error: 30.22%
epoch:    4   train error: 30.76%   test error: 30.75%
epoch:    5   train error: 30.65%   test error: 30.72%
epoch:    6   train error: 30.81%   test error: 30.79%
epoch:    7   train error: 29.38%   test error: 29.45%
epoch:    8   train error: 31.92%   test error: 31.92%
epoch:    9   train error: 29.14%   test error: 29.23%

我使用了来自scikits.talkbox的MFCC实现。也许下面的代码可以帮助您。（性别鉴定肯定比情绪检测更容易......也许你需要更多不同的功能。）

import glob

from scipy.io.wavfile import read as wavread
from scikits.talkbox.features import mfcc

from pybrain.datasets            import ClassificationDataSet
from pybrain.utilities           import percentError
from pybrain.tools.shortcuts     import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.structure.modules   import SoftmaxLayer

def report_error(trainer, trndata, tstdata):
    trnresult = percentError(trainer.testOnClassData(), trndata['class'])
    tstresult = percentError(trainer.testOnClassData(dataset=tstdata), tstdata['class'])
    print "epoch: %4d" % trainer.totalepochs, "  train error: %5.2f%%" % trnresult, "  test error: %5.2f%%" % tstresult  

def main(auido_path, coeffs=13):
    dataset = ClassificationDataSet(coeffs, 1, nb_classes=2, class_labels=['male', 'female'])
    male_files = glob.glob("%s/male_audio/*/*_1.wav" % auido_path)
    female_files = glob.glob("%s/female_audio/*/*_1.wav" % auido_path)

    print "Build up data..."
    for sex, files in enumerate([male_files, female_files]):
        for f in files:
            sr, signal = wavread(f)
            ceps, mspec, spec = mfcc(signal, nwin=2048, nfft=2048, fs=sr, nceps=coeffs)
            for i in range(ceps.shape[0]):
                dataset.appendLinked(ceps[i], [sex])

    tstdata, trndata = dataset.splitWithProportion(0.25)
    trndata._convertToOneOfMany()
    tstdata._convertToOneOfMany()

    print "Number of training patterns: ", len(trndata)
    print "Number of test patterns:     ", len(tstdata)
    print "Input and output dimensions: ", trndata.indim, trndata.outdim

    print "Train network..."
    fnn = buildNetwork(coeffs, int(coeffs*1.5), 2, outclass=SoftmaxLayer, fast=True)
    trainer = BackpropTrainer(fnn, dataset=trndata, learningrate=0.005)

    report_error(trainer, trndata, tstdata)
    for i in range(100):
        trainer.trainEpochs(1)
        report_error(trainer, trndata, tstdata)

if __name__ == '__main__':
    main("/path/to/hyke/audio_data")

<小时/> ¹ Azarias Reda，Saurabh Panjwani和Edward Cutrell： Hyke：面向发展中地区的低成本远程考勤跟踪系统，第五届ACM发展中地区网络系统研讨会（NSDR））。

Python音频信号分类MFCC具有神经网络

1 个答案: