我正在尝试实施以下论文:https://arxiv.org/abs/1904.08779,以便在“语音转文本”中获得更好的结果。
我正在尝试使用mozilla DeepSpeech存储库来实现它。
它使用tensorflow数据集模型加载数据。
dataset = (tf.data.Dataset.from_generator(generate_values,
output_types=(tf.string, (tf.int64, tf.int32, tf.int64),tf.int64))
.map(entry_to_features, num_parallel_calls=tf.data.experimental.AUTOTUNE)
.cache(cache_path)
.map(augment_spec, num_parallel_calls=tf.data.experimental.AUTOTUNE)
.window(batch_size, drop_remainder=True).flat_map(batch_fn)
.prefetch(num_gpus))
音频将转换为频谱图,并计算mfcc,因此,当数据到达Enhance_spec函数时,其形状为(?,26)。 ?是可变音频长度重塑的结果。 我正在尝试遮盖图像的某些部分,以实现与张量相乘的想法,其中之一是使用诸如此类的代码
def augment_spec(features, features_len, transcript):
# print("\n\n\n\n duration", duration.eval())
sample_rate = 8000
mask = np.ones_like(features)
temp = tf.Variable(tf.ones_like(features))
print(temp)
time_len = features_len.shape[0]
features_len = features_len
n_time_masks = np.random.randint(0, 4)
n_freq_masks = np.random.randint(0, 3)
for _ in range(n_time_masks):
time_delta = np.random.randint(int(sample_rate / 10), int(sample_rate / 2))
time_start = np.random.randint(0, time_len - time_delta)
print(time_start, time_delta)
mask[time_start:time_start + time_delta] = 0
for _ in range(n_freq_masks):
freq_delta = np.random.randint(1, 4)
freq_start = np.random.randint(0, features_len - freq_delta)
print(freq_start, freq_delta)
mask[:, freq_start:freq_start + freq_delta] = 0
mask = tf.convert_to_tensor(mask, dtype=tf.float32)
return tf.math.multiply(features, mask), features_len, transcript
问题在于这些说明:
mask = np.ones_like(features)
time_len = features_len.shape[0]
不起作用,因为在构建图形时,张量尚未定义形状,因此我不知道如何实现此目的。 你能帮我吗? 非常感谢!
更新:在@kempy回答之后,我的代码现在看起来像这样:
def augment_spec(features, features_len, transcript):
# print("\n\n\n\n duration", duration.eval())
sample_rate = 8000
mask = tf.Variable(tf.ones_like(features),validate_shape=False)
time_len = tf.shape(features)[0]
n_time_masks = np.random.randint(0, 4)
n_freq_masks = np.random.randint(0, 3)
# n_time_masks = tf.random.uniform(
# shape=(), minval=0, maxval=4, dtype=tf.int32)
# n_freq_masks = tf.random.uniform(
# shape=(), minval=0, maxval=3, dtype=tf.int32)
for _ in range(n_time_masks):
time_delta = tf.random.uniform(
shape=(), minval=int(sample_rate / 10), maxval=int(sample_rate / 2), dtype=tf.int32)
time_start = tf.random.uniform(
shape=(), minval=0, maxval=time_len-time_delta, dtype=tf.int32)
# indexes = list(range(time_start,time_start+time_delta))
indexes = tf.range(time_start, time_start+time_delta, delta=1, dtype=tf.int32, name='range')
tf.scatter_update(mask, indexes, 0)
mask = tf.transpose(mask,(1,0))
for _ in range(n_freq_masks):
# freq_delta = np.random.randint(1, 4)
# freq_start = np.random.randint(0, features_len - freq_delta)
freq_delta = tf.random.uniform(
shape=(), minval=1, maxval=4, dtype=tf.int32)
freq_start = tf.random.uniform(
shape=(), minval=0, maxval=(features_len - freq_delta), dtype=tf.int32)
# indexes = list(range(freq_start,freq_start+freq_delta))
indexes = tf.range(freq_start, freq_start+freq_delta, delta=1, dtype=tf.int32, name='range')
tf.scatter_update(mask, indexes, 0)
mask = tf.transpose(mask,(1,0))
mask = tf.convert_to_tensor(mask, dtype=tf.float32)
masked = tf.multiply(features, mask)
return masked, features_len, transcript
但是现在我收到此错误:
ValueError: Tensor("Variable:0", dtype=float32_ref) must be from the same graph as Tensor("tower_0/Mean:0", shape=(), dtype=float32, device=/device:GPU:0).
我不知道该如何解决,谢谢您的帮助
答案 0 :(得分:0)
使用tf
版本而不是np
函数。 tf.ones_like
在形状为(?, 26)
的输入下应该可以正常工作,并且您可以使用tf.shape(features)[0]
动态获取要素的形状。再往下走,您应该使用类似tf.random.uniform
在图形模式下运行TF时(这是TF 1.X中的默认设置),由于尚未执行过张量的输出,因此无法使python代码依赖张量的输出,因此应使用TF ops而不是python numpy代码。
我们可以构建具有动态第一维的图形:
import numpy as np
import tensorflow as tf
# Feature dimensions
unknown_size = 3
feature_dim = 26
tf.reset_default_graph()
# features_input has dynamic first dimension
features_input = tf.placeholder(tf.int32, shape=(None, feature_dim))
# ones_like should work fine with argument of shape (?, 26)
batched_ones = tf.ones_like(features_input)
# dynamically get the shape of the features_input
time_len = tf.shape(features_input)[0]
time_start = tf.random.uniform(
shape=(), minval=0, maxval=time_len, dtype=tf.int32)
并打印以下内容:
print('features_input.shape:')
print(features_input.shape)
print('batched_ones.shape:')
print(batched_ones.shape)
print('time_start.shape:')
print(time_start.shape)
我们看到的输出是:
features_input.shape:
(?, 26)
batched_ones.shape:
(?, 26)
time_start.shape:
()
如果我们随后尝试执行该图:
with tf.Session() as sess:
# Create some input data
features = np.arange(feature_dim)
batched_features = np.tile(features, (unknown_size, 1))
# Evaluate the tensors
features_out, ones_out, time_start_out = sess.run(
[features_input, batched_ones, time_start],
feed_dict={features_input: batched_features})
并打印输出:
# Print out what the output looks like
print('\nOutput:')
print('\nFeatures:')
print(features_out)
print('shape:', features_out.shape)
print('\nOnes:')
print(ones_out)
print('shape:', ones_out.shape)
print('\nRandom between 0 and unknown_size:')
print(time_start_out)
print('shape:', time_start_out.shape)
我们可以看到它有效!
Output:
Features:
[[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25]
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25]
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25]]
shape: (3, 26)
Ones:
[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]
shape: (3, 26)
Random between 0 and unknown_size:
0
shape: ()