Question

这是我运行的代码：

import tensorflow as tf

sess = tf.InteractiveSession()

filename = 'song.mp3' # 30 second mp3 file
SAMPLES_PER_SEC = 44100

audio_binary = tf.read_file(filename)

pcm = tf.contrib.ffmpeg.decode_audio(audio_binary, file_format='mp3', samples_per_second=SAMPLES_PER_SEC, channel_count = 1)
stft = tf.contrib.signal.stft(pcm, frame_length=1024, frame_step=512, fft_length=1024)

sess.close()

mp3文件已正确解码，因为print(pcm.eval().shape)返回：

(1323119, 1)

当我用print(pcm.eval()[1000:1010])打印它们时，甚至有一些实际的非零值：

[[ 0.18793298]
 [ 0.16214484]
 [ 0.16022217]
 [ 0.15918455]
 [ 0.16428113]
 [ 0.19858395]
 [ 0.22861415]
 [ 0.2347789 ]
 [ 0.22684409]
 [ 0.20728172]]

但由于某种原因print(stft.eval().shape)评估为：

(1323119, 0, 513) # why the zero dimension?

因此print(stft.eval())是：

[]

根据this，tf.contrib.signal.stft输出的第二维等于帧数。为什么没有帧？

Answer 1

似乎tf.contrib.ffmpeg.decode_audio返回了一个形状(?, 1)的张量，这是?个样本的一个信号。

但是tf.contrib.signal.stft期望(signal_count, samples)张量作为输入，因此必须预先转置它。

像这样修改调用就可以了：

stft = tf.contrib.signal.stft(tf.transpose(pcm), frame_length=1024, frame_step=512, fft_length=1024)

tf.contrib.signal.stft返回一个空矩阵

1 个答案: