So I need to do audio processing as part of my uni semester, but I wish to create basic speech detection and go a bit of a step further. Looking at other guides and info on particularly GMM in Tensorflow doesn't typically allow you to do the audio processing yourself.
I want to plug my code for decoding wav files, building a spectogram and converting to MFCC into training a model that then can classify specific simple word samples. But, I just can't find any information or helpers on this anywhere.
So far i've tried many different things, but the Simple Audio Recognition Tutorial on Tensorflow's website seems very close to what I want to eventually accomplish, even using MFCC as default audio processing method. I tried going through and reverse engineering it, breaking down the different function calls. But, everything is extremely convoluted.
I would somehow like to do what the tutorial does with my own Audio Processing. It doesn't necessarily matter about the machine learning part complexity. Any pre-existing methods or very simple plug'n'play stuff would be completely fine.
Here is my code for MFCC and spectogram if needed:
def do_mfcc(spectrogram, upper_frequency_limit=4000, lower_frequency_limit=0, dct_coefficient_count=12):
mfcc = dct(spectrogram, type=2, axis=1, norm='ortho')[:, 1: (dct_coefficient_count + 1)] # Keep 2-13
mfcc -= (numpy.mean(mfcc, axis=0) + 1e-8) # Mean normalization of mfcc
return mfcc
def gimmeDaSPECtogram(input, sample_rate, window_size_ms=30.0, stride_ms=10.0, pre_emphasis=0.97, NFFT=512, triangular_filters=40, magnitude_squared=False, name=None):
sample_rate, signal = scipy.io.wavfile.read(input) # File assumed to be in the same directory
signal = signal[0:int(1.0 * sample_rate)] # Keep only the first second
window_size_ms = window_size_ms/1000
stride_ms = stride_ms/1000
emphasized_signal = numpy.append(signal[0], signal[1:] - pre_emphasis * signal[:-1])
frame_length, frame_step = window_size_ms * sample_rate, stride_ms * sample_rate # Convert from seconds to samples
signal_length = len(emphasized_signal)
frame_length = int(round(frame_length))
frame_step = int(round(frame_step))
num_frames = int(numpy.ceil(
float(numpy.abs(signal_length - frame_length)) / frame_step)) # Make sure that we have at least 1 frame
pad_signal_length = num_frames * frame_step + frame_length
z = numpy.zeros((pad_signal_length - signal_length))
pad_signal = numpy.append(emphasized_signal,
z) # Pad Signal to make sure that all frames have equal number of samples without truncating any samples from the original signal
indices = numpy.tile(numpy.arange(0, frame_length), (num_frames, 1)) + numpy.tile(
numpy.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T
frames = pad_signal[indices.astype(numpy.int32, copy=False)]
frames *= numpy.hamming(frame_length)
mag_frames = numpy.absolute(numpy.fft.rfft(frames, NFFT)) # Magnitude of the FFT
pow_frames = ((1.0 / NFFT) * ((mag_frames) ** 2)) # Power Spectrum
low_freq_mel = 0
high_freq_mel = (2595 * numpy.log10(1 + (sample_rate / 2) / 700))
mel_points = numpy.linspace(low_freq_mel, high_freq_mel, triangular_filters + 2) # Equally spaced in Mel scale
hz_points = (700 * (10 ** (mel_points / 2595) - 1)) # Convert Mel to Hz
bin = numpy.floor((NFFT + 1) * hz_points / sample_rate)
fbank = numpy.zeros((triangular_filters, int(numpy.floor(NFFT / 2 + 1))))
for m in range(1, triangular_filters + 1):
f_m_minus = int(bin[m - 1]) # left
f_m = int(bin[m]) # center
f_m_plus = int(bin[m + 1]) # right
for k in range(f_m_minus, f_m):
fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
for k in range(f_m, f_m_plus):
fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
filter_banks = numpy.dot(pow_frames, fbank.T)
filter_banks = numpy.where(filter_banks == 0, numpy.finfo(float).eps, filter_banks) # Numerical Stability
filter_banks = 20 * numpy.log10(filter_banks) # dB
filter_banks = do_mfcc(filter_banks, upper_frequency_limit=4000, lower_frequency_limit=0, dct_coefficient_count=12)
# Code below is for visualising MFCC
plt.subplot(312)
plt.imshow(filter_banks.T, cmap=plt.cm.jet, aspect='auto')
plt.xticks(numpy.arange(0, (filter_banks.T).shape[1],
int((filter_banks.T).shape[1] / 4)),
['0s', '0.25s', '0.5s', '0.75s', '1s'])
plt.yticks(numpy.arange(1, (filter_banks.T).shape[0],
int((filter_banks.T).shape[0] / 4)),
['0', '3', '6', '9', '12'])
ax = plt.gca()
ax.invert_yaxis()
plt.show()
return filter_banks
gimmeDaSPECtogram("samples/leftTest.wav", 16000, window_size_ms=30.0, stride_ms=10.0, pre_emphasis=0.97)
I expect to use the MFCC as input to train a model on each wav file in the dataset so that I can use a classifier to recognise basic words.
Any help/advice with implementing custom audio processing in tensorflow is appreciated. Even any kind of links to guides or pointers in the right direction are also greatly appreciated!