如何将音频映射到目标文本转录

时间:2019-04-22 07:59:51

标签: python-3.x tensorflow deep-learning lstm speech-to-text

我是深度学习的新手,我正在使用tensorflow API,LSTM模型和ctc损失函数构建基本的端到端语音识别器。我已将音频功能提取到mfccs。我真的不知道如何将音频映射到转录,我知道ctc是用于此目的的,我知道ctc的工作原理,但不知道实现它的代码。

这是我提取特征的代码

import os
import numpy as np
import glob
import scipy.io.wavfile as wav
from python_speech_features import mfcc, logfbank

# Read the input audio file
for f in glob.glob('Downloads/DataVoices/Training/**/*.wav', recursive=True):
    (rate,sig) = wav.read(f)
    sig = sig.astype(np.float64)
    # Take the first 10,000 samples for analysis
    #sig = sig[:10000]
    mfcc_feat = mfcc(sig,rate,winlen=0.025, winstep=0.01,
                     numcep=13, nfilt=26, nfft=512, lowfreq=0, highfreq=None,
                     preemph=0.97, ceplifter=22, appendEnergy=True)
    fbank_feat = logfbank(sig, rate)
    acoustic_features = np.concatenate((mfcc_feat, fbank_feat), axis=1) # time_stamp x n_features
    print(acoustic_features)

我还制作了一个training list.txt文件,其中提供了带有音频路径的转录,例如:

这是example / 001 / 001.wav

这是example / 001/001(1).wav

其中001是文件夹,而001.wav和0001(1).wav是两个发声的波形文件。

1 个答案:

答案 0 :(得分:0)

我将其发布为人为示例,假设这将为如何读取CSV文件和CSV文件名提供一个思路。您可以根据自己的需要进行修改。

假设我有此CSV文件。第一列是您的成绩单。文件路径是您的音频文件。在我的情况下,这只是一个带有随机文本的文本文件。

Script1,D:/PycharmProjects/TensorFlow/script1.txt
Script2,D:/PycharmProjects/TensorFlow/script2.txt

这是我用来测试的代码。请记住,这是一个示例。

import tensorflow as tf


batch_size = 1
record_defaults = [ ['Test'],['D:/PycharmProjects/TensorFlow/script1.txt']]


def readbatch(data_queue) :

    reader = tf.TextLineReader()
    _, rows = reader.read_up_to(data_queue, batch_size)
    transcript,wav_filename = tf.decode_csv(rows, record_defaults,field_delim=",")
    audioreader = tf.WholeFileReader()
    print(wav_filename)
    _, audio = audioreader.read( tf.train.string_input_producer(wav_filename) )
    return [audio,transcript]

data_queue = tf.train.string_input_producer(['D:\\PycharmProjects\\TensorFlow\\script.csv'], shuffle=False)

batch_data = readbatch(data_queue)

batch_values = tf.train.batch(batch_data, shapes=[tf.TensorShape(()),tf.TensorShape(batch_size,)],  batch_size=batch_size, enqueue_many=False )

init = tf.initialize_all_variables()

with tf.Session() as sess:
    sess.run(init)

    sess.run(tf.initialize_local_variables())
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)
    try:
        step = 0
        while not coord.should_stop():
            step += 1
            feat = sess.run([batch_values])
            audio = feat[0][0]
            print(audio)
            script = feat[0][1]
            print(script)
    except tf.errors.OutOfRangeError:
        print(' training for 1 epochs, %d steps', step)
    finally:
        coord.request_stop()
        coord.join(threads)