Question

我正在研究一个简单的分类问题，以训练我的分类器区分＆＃34;拍手＆＃34;声音和＆＃34;非鼓掌＆＃34;人声。为此，我使用的是Google AudioSet。在下面的代码中，我试图提取音频嵌入[不同实例的YouTube视频]和音频标签[＆＃34; Dancing＆＃34;，＆＃34; Singing＆＃34;等]。

    #To store the features in  a vector
audio_embeddings_dict = {}

#To store the labels
audio_labels_dict = {}

#Load embeddings
sess = tf.Session()
for tfrecord in tfrecord_filenames_non:
    for example in tf.python_io.tf_record_iterator(tfrecord):
        #To create a new instance from serialized data
        tf_example = tf.train.Example.FromString(example)
        #To store the Youtube Id that we have extracted
        vid_id = tf_example.features.feature['video_id'].bytes_list.value[0].decode(encoding = 'UTF-8')
        if vid_id in non_examples_clapping:
            #Store the labels of the video_id that are present in the examples
            example_label = list(np.asarray(tf_example.features.feature['labels'].int64_list.value))
            #To store the tf_example of only the vid_id that are present in the examples
            tf_seq_example = tf.train.SequenceExample.FromString(example)
            #To store the number of feature vectors present in the audio_embeddings
            n_frames = len(tf_seq_example.feature_lists.feature_list['audio_embedding'].feature)

            #To store the audio_frame in the form of 10 separate vectors each containing 128 examples
            audio_frame = [[]*n_frames for x in range(n_frames)]
            for i in range(n_frames):
                audio_frame[i].append(tf.cast(tf.decode_raw(tf_seq_example.feature_lists.
                feature_list['audio_embedding'].feature[i].bytes_list.value[0],tf.uint8),tf.float32).eval(session = sess))
                audio_embeddings_dict[vid_id] = audio_frame
                audio_labels_dict[vid_id] = example_label

我将所有音频嵌入存储在1个dict变量[audio_embeddings_dict]中。以及所有audio_labels_dict。

audio_labels_dict：

之间的比较

完整的代码可以在这里找到：（https://github.com/Anirudh257/Audio-files-extraction/blob/master/Audio%20extraction%20dataset(1).ipynb）

每次运行jupyter笔记本时生成不同的数据集

0 个答案: