我有一个第一年级的最终项目,我想建立一个神经网络,该网络将获取wav文件的前13个mfcc coeff并返回从一群讲话者那里进行音频文件交谈的人。
我希望您注意到:
我定义了:
X = mfcc(sound_voice)
Y = zero_array +1在第i个位置(其中第一个扬声器的i_th位置为0,第二个扬声器为1,第二个扬声器为2 ...)
然后训练机器,然后检查机器的输出中是否有某些文件...
这就是我所做的...但是很遗憾,结果看起来是完全随机的...
你能帮我理解为什么吗?
这是我在python中的代码-
from sklearn.neural_network import MLPClassifier
import python_speech_features
import scipy.io.wavfile as wav
import numpy as np
from os import listdir
from os.path import isfile, join
from random import shuffle
import matplotlib.pyplot as plt
from tqdm import tqdm
winner = [] # this array count how much Bingo we had when we test the NN
for TestNum in tqdm(range(5)): # in every round we build NN with X,Y that out of them we check 50 after we build the NN
X = []
Y = []
onlyfiles = [f for f in listdir("FinalAudios/") if isfile(join("FinalAudios/", f))] # Files in dir
names = [] # names of the speakers
for file in onlyfiles: # for each wav sound
# UNESSECERY TO UNDERSTAND THE CODE
if " " not in file.split("_")[0]:
names.append(file.split("_")[0])
else:
names.append(file.split("_")[0].split(" ")[0])
names = list(dict.fromkeys(names)) # names of speakers
vector_names = [] # vector for each name
i = 0
vector_for_each_name = [0] * len(names)
for name in names:
vector_for_each_name[i] += 1
vector_names.append(np.array(vector_for_each_name))
vector_for_each_name[i] -= 1
i += 1
for f in onlyfiles:
if " " not in f.split("_")[0]:
f_speaker = f.split("_")[0]
else:
f_speaker = f.split("_")[0].split(" ")[0]
(rate, sig) = wav.read("FinalAudios/" + f) # read the file
try:
mfcc_feat = python_speech_features.mfcc(sig, rate, winlen=0.2, nfft=512) # mfcc coeffs
for index in range(len(mfcc_feat)): # adding each mfcc coeff to X, meaning if there is 50000 coeffs than
# X will be [first coeff, second .... 50000'th coeff] and Y will be [f_speaker_vector] * 50000
X.append(np.array(mfcc_feat[index]))
Y.append(np.array(vector_names[names.index(f_speaker)]))
except IndexError:
pass
Z = list(zip(X, Y))
shuffle(Z) # WE SHUFFLE X,Y TO PERFORM RANDOM ON THE TEST LEVEL
X, Y = zip(*Z)
X = list(X)
Y = list(Y)
X = np.asarray(X)
Y = np.asarray(Y)
Y_test = Y[:50] # CHOOSE 50 FOR TEST, OTHERS FOR TRAIN
X_test = X[:50]
X = X[50:]
Y = Y[50:]
clf = MLPClassifier(solver='lbfgs', alpha=1e-2, hidden_layer_sizes=(5, 3), random_state=2) # create the NN
clf.fit(X, Y) # Train it
for sample in range(len(X_test)): # add 1 to winner array if we correct and 0 if not, than in the end it plot it
if list(clf.predict([X[sample]])[0]) == list(Y_test[sample]):
winner.append(1)
else:
winner.append(0)
# plot winner
plot_x = []
plot_y = []
for i in range(1, len(winner)):
plot_y.append(sum(winner[0:i])*1.0/len(winner[0:i]))
plot_x.append(i)
plt.plot(plot_x, plot_y)
plt.xlabel('x - axis')
# naming the y axis
plt.ylabel('y - axis')
# giving a title to my graph
plt.title('My first graph!')
# function to show the plot
plt.show()
这是我的zip文件,其中包含代码和音频文件:https://ufile.io/eggjm1gw
答案 0 :(得分:2)
您的代码中存在许多问题,一口气将其正确设置几乎是不可能的,但让我们尝试一下。有两个主要问题:
python_speech_features
的{{3}}参数)。在这些记录的每一个中,前25 ms都几乎相同。即使您每个扬声器录制了1万张录音,使用这种方法也无法获得任何收益。 我会给您具体的建议,但是不会做所有的编码-毕竟这是您的功课。
numpy
数组,它们速度更快且内存效率更高。有大量的教程,包括scikit-learn
,它们演示了如何在这种情况下使用numpy
。本质上,您将创建两个数组:一个包含训练数据,第二个包含标签。示例:如果说“ omersk ”的人“产生” 50000个MFCC向量,则将获得(50000, 13)
训练数组。对应的标签数组将是50000
,具有与说话者相对应的单个常数(id)(例如, omersk 为0, lucas 为1,依此类推) 。我会考虑采用更长的窗口(也许200毫秒,实验!)来减少差异。 别忘了拆分数据进行培训,验证和测试。您将拥有足够的数据。另外,在本练习中,我将注意不要为任何单个扬声器提供过多的数据-请采取步骤确保算法不偏向。
稍后,当您进行预测时,将再次为说话者计算MFCC。记录10秒,200毫秒窗口和100毫秒重叠,您将获得99个MFCC向量,形状为(99, 13)
。对于每个产生概率,模型应在99个向量中的每个向量上运行。当您将其求和(并归一化以使其美观)并获得最高价值时,您将获得最有可能的发言人。
通常还会考虑很多其他因素,但是在这种情况下(家庭作业),我将专注于使基础知识正确。
编辑:我决定以您的想法为基础创建模型,但基本操作已固定。它不是完全干净的Python,全部是因为它是根据我运行的Jupyter Notebook改编而成的。
import python_speech_features
import scipy.io.wavfile as wav
import numpy as np
import glob
import os
from collections import defaultdict
from sklearn.neural_network import MLPClassifier
from sklearn import preprocessing
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
audio_files_path = glob.glob('audio/*.wav')
win_len = 0.04 # in seconds
step = win_len / 2
nfft = 2048
mfccs_all_speakers = []
names = []
data = []
for path in audio_files_path:
fs, audio = wav.read(path)
if audio.size > 0:
mfcc = python_speech_features.mfcc(audio, samplerate=fs, winlen=win_len,
winstep=step, nfft=nfft, appendEnergy=False)
filename = os.path.splitext(os.path.basename(path))[0]
speaker = filename[:filename.find('_')]
data.append({'filename': filename,
'speaker': speaker,
'samples': mfcc.shape[0],
'mfcc': mfcc})
else:
print(f'Skipping {path} due to 0 file size')
speaker_sample_size = defaultdict(int)
for entry in data:
speaker_sample_size[entry['speaker']] += entry['samples']
person_with_fewest_samples = min(speaker_sample_size, key=speaker_sample_size.get)
print(person_with_fewest_samples)
max_accepted_samples = int(speaker_sample_size[person_with_fewest_samples] * 0.8)
print(max_accepted_samples)
training_idx = []
test_idx = []
accumulated_size = defaultdict(int)
for entry in data:
if entry['speaker'] not in accumulated_size:
training_idx.append(entry['filename'])
accumulated_size[entry['speaker']] += entry['samples']
elif accumulated_size[entry['speaker']] < max_accepted_samples:
accumulated_size[entry['speaker']] += entry['samples']
training_idx.append(entry['filename'])
X_train = []
label_train = []
X_test = []
label_test = []
for entry in data:
if entry['filename'] in training_idx:
X_train.append(entry['mfcc'])
label_train.extend([entry['speaker']] * entry['mfcc'].shape[0])
else:
X_test.append(entry['mfcc'])
label_test.extend([entry['speaker']] * entry['mfcc'].shape[0])
X_train = np.concatenate(X_train, axis=0)
X_test = np.concatenate(X_test, axis=0)
assert (X_train.shape[0] == len(label_train))
assert (X_test.shape[0] == len(label_test))
print(f'Training: {X_train.shape}')
print(f'Testing: {X_test.shape}')
le = preprocessing.LabelEncoder()
y_train = le.fit_transform(label_train)
y_test = le.transform(label_test)
clf = MLPClassifier(solver='lbfgs', alpha=1e-2, hidden_layer_sizes=(5, 3), random_state=42, max_iter=1000)
cv_results = cross_validate(clf, X_train, y_train, cv=4)
print(cv_results)
{'fit_time': array([3.33842635, 4.25872731, 4.73704267, 5.9454329 ]),
'score_time': array([0.00125694, 0.00073504, 0.00074005, 0.00078583]),
'test_score': array([0.40380048, 0.52969121, 0.48448687, 0.46043165])}
test_score
并不出色。有很多需要改进的地方(对于初学者来说,选择算法),但是基础知识就在那里。首先,请注意我如何获得训练样本的。这不是随机的,我只考虑整个录音。您不能将给定记录中的样本同时放入training
和test
中,因为test
应该是新颖的。
什么在您的代码中不起作用?我会说很多。您正在采样200毫秒,但fft
却很短。 python_speech_features
可能向您抱怨fft
应该比您要处理的帧长。
我请您测试模型。不好,但这是一个开始。