我正在使用此库https://code.google.com/p/libmfcc/从幅度平方功率谱生成MFCC系数。
然而,据我所知,第一个系数应代表整体能量。我的结果不是这种情况。这让我怀疑整个功能集。
F0:-3.77,F1:-2.78,F2:2.13,F3:4.47,F4:2.76,F5:-0.00,F6:-0.58,F7:0.76,F8:1.49,F9:0.62,F10: - 0.44,F11:-0.26,F12:0.58
我喜欢这个系数的唯一原因是帮助消除我项目中的问题。 - 我传入256个实数长的功率谱,(原来是512 FFT),采样频率为16000hz。我很确定FFT是正确的,因为我已经进行了测试以检查生成的频率。
我正在尝试使用这些功能来执行说话人识别,但目前我一直在误报。我已经尝试将生成的特征与神经网络,矢量量化和简单的强制欧几里德和spearman的比较一起使用。我所做的一切似乎都没有说出声音之间系数的唯一性。结束误报。
我已经坚持了几个月了,我觉得这是我的功能的错。 任何帮助将不胜感激!
答案 0 :(得分:1)
您的fft值不常见,以下是来自语音帧的示例FFT:
12406.376 317135.746 995981.334 626224.382 2005596.535 4058142.702 1183111.796 1866254.816 3522858.721 340289.386 6767139.243 10894041.353 511321.852 27681515.387 32174731.584 2294241.072 3673880.557 4752891.334 1069708.546 5207759.171 5264486.273 305515.352 1036866.968 1332550.402 150743.522 3417229.415 2512512.261 546054.633 2096752.637 1243709.121 70430.472 1657224.619 1288489.174 915992.292 4282845.277 2132087.811 576691.932 4625295.075 1869747.185 14309491.048 40317789.470 10781189.643 7169652.741 30153832.551 3933090.444 13867788.202 26961212.666 6052446.164 5232152.170 8754440.126 4239680.973 814935.042 8643209.234 8493450.137 869299.756 8647922.201 1814417.128 652202.156 934195.600 72344.850 599552.325 520781.731 94066.862 24987.524 30704.365 14786.379 38961.829 25425.752 457.993 16805.918 21014.001 25724.770 64765.894 31916.339 5772.055 26097.199 14997.984 15845.304 33384.312 10655.138 12742.130 27660.958 4208.045 104839.618 126015.679 126905.152 92657.454 5423.333 6252.982 26137.014 8101.993 23840.536 96350.180 155396.746 111640.103 67379.170 191046.213 53822.423 199623.939 521401.332 240488.616 26096.585 27258.739 56939.019 6054.077 33565.473 17344.580 584.597 27900.058 72742.464 61239.311 13451.726 5192.935 4261.550 439.073 9722.589 18140.512 6855.937 26066.804 19903.202 1091.290 33014.134 42059.955 11662.442 534.955 13736.420 13481.058 48308.510 33231.743 12317.196 48160.791 115668.828 211469.841 163739.245 35339.914 47145.795 37257.335 9065.769 756.579 8372.643 8419.709 1815.682 1017.977 64.215 17711.483 25315.887 44022.134 91004.399 49687.288 1524.393 19627.319 23474.766 9001.670 729.851 11901.670 16078.190 26974.342 13843.501 5620.484 18436.224 27086.375 31720.334 42472.198 143007.306 138588.920 87433.057 108255.923 101891.401 73553.860 76565.005 31125.667 23054.414 75971.499 23780.864 68413.973 240216.065 148102.903 19623.293 8194.448 2725.753 32133.461 60279.038 21668.906 539.175 61133.950 80454.478 6585.491 21330.695 265.198 14129.337 800.514 4 1091.336 66797.293 42455.636 20263.426 973.230 2763.689 1136.641 5300.404 3128.763 2635.018 15487.226 16915.816 5770.127 4770.271 16645.390 13957.322 27129.323 13908.576 2281.975 63947.522 50889.733 697.118 18690.955 12249.632 1006.608 12672.938 4463.555 4693.099 2048.688 1486.160 12965.033 89367.085 57248.261 23332.704 18483.057 1450.837 4288.211 8512.221 9461.348 3105.038 976.106 8155.822 26873.908 44851.560 30956.465 7607.291 4517.811 25642.189 22606.560 12422.574 44612.224 74799.536 25034.774 197.800 2410.775 237.717 3106.175 7980.360 3960.008 8073.620 31488.422 8950.003 3459.935 666.708 7.372
另外,我担心你会写“每个话语的FFT”。必须逐个窗口地分析语音,而不是整个话语。您需要先在窗口上分割信号。