Question

So I need to do audio processing as part of my uni semester, but I wish to create basic speech detection and go a bit of a step further. Looking at other guides and info on particularly GMM in Tensorflow doesn't typically allow you to do the audio processing yourself.

I want to plug my code for decoding wav files, building a spectogram and converting to MFCC into training a model that then can classify specific simple word samples. But, I just can't find any information or helpers on this anywhere.

So far i've tried many different things, but the Simple Audio Recognition Tutorial on Tensorflow's website seems very close to what I want to eventually accomplish, even using MFCC as default audio processing method. I tried going through and reverse engineering it, breaking down the different function calls. But, everything is extremely convoluted.

I would somehow like to do what the tutorial does with my own Audio Processing. It doesn't necessarily matter about the machine learning part complexity. Any pre-existing methods or very simple plug'n'play stuff would be completely fine.

Here is my code for MFCC and spectogram if needed:

def do_mfcc(spectrogram, upper_frequency_limit=4000, lower_frequency_limit=0, dct_coefficient_count=12):

    mfcc = dct(spectrogram, type=2, axis=1, norm='ortho')[:, 1: (dct_coefficient_count + 1)]  # Keep 2-13
    mfcc -= (numpy.mean(mfcc, axis=0) + 1e-8)  # Mean normalization of mfcc

    return mfcc

def gimmeDaSPECtogram(input, sample_rate, window_size_ms=30.0, stride_ms=10.0, pre_emphasis=0.97, NFFT=512, triangular_filters=40, magnitude_squared=False, name=None):
    sample_rate, signal = scipy.io.wavfile.read(input)  # File assumed to be in the same directory
    signal = signal[0:int(1.0 * sample_rate)]  # Keep only the first second
    window_size_ms = window_size_ms/1000
    stride_ms = stride_ms/1000


    emphasized_signal = numpy.append(signal[0], signal[1:] - pre_emphasis * signal[:-1])
    frame_length, frame_step = window_size_ms * sample_rate, stride_ms * sample_rate  # Convert from seconds to samples
    signal_length = len(emphasized_signal)
    frame_length = int(round(frame_length))
    frame_step = int(round(frame_step))
    num_frames = int(numpy.ceil(
        float(numpy.abs(signal_length - frame_length)) / frame_step))  # Make sure that we have at least 1 frame


    pad_signal_length = num_frames * frame_step + frame_length
    z = numpy.zeros((pad_signal_length - signal_length))
    pad_signal = numpy.append(emphasized_signal,
                              z)  # Pad Signal to make sure that all frames have equal number of samples without truncating any samples from the original signal

    indices = numpy.tile(numpy.arange(0, frame_length), (num_frames, 1)) + numpy.tile(
        numpy.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T
    frames = pad_signal[indices.astype(numpy.int32, copy=False)]

    frames *= numpy.hamming(frame_length)

    mag_frames = numpy.absolute(numpy.fft.rfft(frames, NFFT))  # Magnitude of the FFT
    pow_frames = ((1.0 / NFFT) * ((mag_frames) ** 2))  # Power Spectrum

    low_freq_mel = 0
    high_freq_mel = (2595 * numpy.log10(1 + (sample_rate / 2) / 700))  
    mel_points = numpy.linspace(low_freq_mel, high_freq_mel, triangular_filters + 2)  # Equally spaced in Mel scale
    hz_points = (700 * (10 ** (mel_points / 2595) - 1))  # Convert Mel to Hz
    bin = numpy.floor((NFFT + 1) * hz_points / sample_rate)

    fbank = numpy.zeros((triangular_filters, int(numpy.floor(NFFT / 2 + 1))))
    for m in range(1, triangular_filters + 1):
        f_m_minus = int(bin[m - 1])  # left
        f_m = int(bin[m])  # center
        f_m_plus = int(bin[m + 1])  # right

        for k in range(f_m_minus, f_m):
            fbank[m - 1, k] = (k - bin[m - 1]) / (bin[m] - bin[m - 1])
        for k in range(f_m, f_m_plus):
            fbank[m - 1, k] = (bin[m + 1] - k) / (bin[m + 1] - bin[m])
    filter_banks = numpy.dot(pow_frames, fbank.T)
    filter_banks = numpy.where(filter_banks == 0, numpy.finfo(float).eps, filter_banks)  # Numerical Stability

    filter_banks = 20 * numpy.log10(filter_banks)  # dB
    filter_banks = do_mfcc(filter_banks, upper_frequency_limit=4000, lower_frequency_limit=0, dct_coefficient_count=12)


    # Code below is for visualising MFCC
    plt.subplot(312)
    plt.imshow(filter_banks.T, cmap=plt.cm.jet, aspect='auto')
    plt.xticks(numpy.arange(0, (filter_banks.T).shape[1],
                            int((filter_banks.T).shape[1] / 4)),
               ['0s', '0.25s', '0.5s', '0.75s', '1s'])
    plt.yticks(numpy.arange(1, (filter_banks.T).shape[0],
                            int((filter_banks.T).shape[0] / 4)),
               ['0', '3', '6', '9', '12'])
    ax = plt.gca()
    ax.invert_yaxis()
    plt.show()

    return filter_banks




gimmeDaSPECtogram("samples/leftTest.wav", 16000, window_size_ms=30.0, stride_ms=10.0, pre_emphasis=0.97)

I expect to use the MFCC as input to train a model on each wav file in the dataset so that I can use a classifier to recognise basic words.

Any help/advice with implementing custom audio processing in tensorflow is appreciated. Even any kind of links to guides or pointers in the right direction are also greatly appreciated!

How to use custom audio processing with MFCC to train tensorflow classifier

0 个答案: