分类器预测是不可靠的,因为我的GMM分类器没有正确训练?

时间:2016-06-29 19:29:14

标签: python machine-learning speech-recognition mfcc

我训练两个GMM分类器,每个分类器用于一个标签,具有MFCC值。 我将一个类的所有MFCC值连接起来并装入分类器中。 并且对于每个分类器,我将其标签概率的概率加起来。

def createGMMClassifiers():
    label_samples = {}
    for label, sample in training.iteritems():
        labelstack = np.empty((50,13))
        for feature in sample:
            #debugger.set_trace()
            labelstack = np.concatenate((labelstack,feature))
        label_samples[label]=labelstack
    for label in label_samples:
        #debugger.set_trace()
        classifiers[label] = mixture.GMM(n_components = n_classes)
        classifiers[label].fit(label_samples[label])
    for sample in testing['happy']:
        classify(sample)
def classify(testMFCC):
    probability = {'happy':0,'sad':0}
    for name, classifier in classifiers.iteritems():
        prediction = classifier.predict_proba(testMFCC)
        for probforlabel in prediction:
            probability[name]+=probforlabel[0]
    print 'happy ',probability['happy'],'sad ',probability['sad']

    if(probability['happy']>probability['sad']):
        print 'happy'
    else:
        print 'sad'

但是我的结果似乎并不一致,我发现很难相信它是由于RandomSeed = None状态,因为所有预测通常都是所有测试数据的相同标签,但每次运行它经常给出完全相反(参见输出1和输出2)。

所以我的问题是,我在训练分类器时做了一些明显错误的事情吗?

输出1:

happy  123.559202732 sad  122.409167294
happy

happy  120.000879032 sad  119.883786657
happy

happy  124.000069307 sad  123.999928962
happy

happy  118.874574047 sad  118.920941127
sad

happy  117.441353421 sad  122.71924156
sad

happy  122.210579428 sad  121.997571901
happy

happy  120.981752603 sad  120.325940128
happy

happy  126.013713257 sad  125.885047394
happy

happy  122.776016525 sad  122.12320875
happy

happy  115.064172476 sad  114.999513909
happy

输出2:

happy  123.559202732 sad  122.409167294
happy

happy  120.000879032 sad  119.883786657
happy

happy  124.000069307 sad  123.999928962
happy

happy  118.874574047 sad  118.920941127
sad

happy  117.441353421 sad  122.71924156
sad

happy  122.210579428 sad  121.997571901
happy

happy  120.981752603 sad  120.325940128
happy

happy  126.013713257 sad  125.885047394
happy

happy  122.776016525 sad  122.12320875
happy

happy  115.064172476 sad  114.999513909
happy

早些时候我问了一个相关的问题并得到了正确答案。我提供以下链接。

Having different results every run with GMM Classifier

编辑: 增加了收集数据并分为培训和测试的主要功能

def main():
    happyDir = dir+'happy/'
    sadDir = dir+'sad/'
    training["sad"]=[]
    training["happy"]=[]
    testing["happy"]=[]
    #TestSet
    for wavFile in os.listdir(happyDir)[::-1][:10]:
        #print wavFile
        fullPath = happyDir+wavFile
        testing["happy"].append(sf.getFeatures(fullPath))
    #TrainSet
    for wavFile in os.listdir(happyDir)[::-1][10:]:
        #print wavFile
        fullPath = happyDir+wavFile
        training["happy"].append(sf.getFeatures(fullPath))
    for wavFile in os.listdir(sadDir)[::-1][10:]:
        fullPath = sadDir+wavFile
        training["sad"].append(sf.getFeatures(fullPath))
    #Ensure the number of files in set
    print "Test(Happy): ", len(testing['happy'])
    print "Train(Happy): ", len(training['happy'])
    createGMMClassifiers()

编辑2: 根据答案更改了代码。仍然有类似的不一致结果。

2 个答案:

答案 0 :(得分:0)

对于分类任务,调整给分类器的参数很重要,也有大量的分类算法遵循选择理论,这意味着如果你简单地改变模型的某些参数,你可能会得到一些巨大的不同结果。使用不同的算法也很重要,而不仅仅是将一种算法用于所有分类任务,

对于这个问题,你可以尝试不同的分类算法来测试你的数据是好的,并为每个分类器尝试不同的参数和不同的值,然后你就可以确定问题在哪里。

另一种方法是使用网格搜索来探索和调整特定分类器的最佳参数,请阅读:http://scikit-learn.org/stable/modules/grid_search.html

答案 1 :(得分:0)

您的代码没有多大意义,您可以为每个新的训练样本重新创建分类器。

正确的培训代码方案应该看看:

label_samples = {}
classifiers = {}

# First we collect all samples per label into array of samples
for label, sample in samples:
     label_samples[label].concatenate(sample)

# Then we train classifier on every label data
for label in label_samples:
     classifiers[label] = mixture.GMM(n_components = n_classes)
     classifiers[label].fit(label_samples[label])

您的解码代码没问题。