Question

我正在尝试将字符串拆分为单词，然后将每个结果单词拆分为字符列表。最后，我有一个文件，每行一个例子，我希望每一行分成单词，然后再分成字符。

sess = tf.Session()

string = tf.constant(['This is the string I would like to split.'], dtype=tf.string)
words = tf.string_split(string)

print words.eval(session=sess)

结果

SparseTensorValue(indices=array([[0, 0],
   [0, 1],
   [0, 2],
   [0, 3],
   [0, 4],
   [0, 5],
   [0, 6],
   [0, 7],
   [0, 8]]), values=array(['This', 'is', 'the', 'string', 'I', 'would', 'like', 'to',
   'split.'], dtype=object), dense_shape=array([1, 9]))

现在，我希望SparseTensor表示锯齿状数组，其中每行都是一个单词，列是其字符。我尝试过像somthings：

def split_word(word):
    word = tf.expand_dims(word, axis=0)
    word = tf.string_split(word, delimiter='')
    return word.values 

split_words = tf.map_fn(split_word, words.values)

但这不起作用，因为map_fn构建了TensorArray，并且形状必须匹配。有没有一种干净的方法来实现这一目标？

Answer 1

我最终在tf.while_loop内使用了Dataset.map。以下是一个工作示例，它读取每行一个示例的文件。它不是很优雅，但它实现了目标。

import tensorflow as tf

def split_line(line):
    # Split the line into words
    line = tf.expand_dims(line, axis=0)
    line = tf.string_split(line, delimiter=' ')

    # Loop over the resulting words, split them into characters, and stack them back together
    def body(index, words):                                                         
        next_word = tf.sparse_slice(line, start=tf.to_int64(index), size=[1, 1]).values
        next_word = tf.string_split(next_word, delimiter='')
        words = tf.sparse_concat(axis=0, sp_inputs=[words, next_word], expand_nonconcat_dim=True)
        return index+[0, 1], words
    def condition(index, words):           
        return tf.less(index[1], tf.size(line))

    i0 = tf.constant([0,1]) 
    first_word = tf.string_split(tf.sparse_slice(line, [0,0], [1, 1]).values, delimiter='')
    _, line = tf.while_loop(condition, body, loop_vars=[i0, first_word], back_prop=False) 

    # Convert to dense              
    return tf.sparse_tensor_to_dense(line, default_value=' ')

dataset = tf.data.TextLineDataset(['./example.txt'])
dataset = dataset.map(split_line)
iterator = dataset.make_initializable_iterator()
parsed_line = iterator.get_next()

sess = tf.Session()
sess.run(iterator.initializer)
for example in range(3):       
    print sess.run(parsed_line)
    print

结果

[['T' 'h' 'i' 's' ' ']
 ['i' 's' ' ' ' ' ' ']
 ['t' 'h' 'e' ' ' ' ']
 ['f' 'i' 'r' 's' 't']
 ['l' 'i' 'n' 'e' '.']]

[['A' ' ' ' ' ' ' ' ' ' ' ' ' ' ']
 ['s' 'e' 'c' 'o' 'n' 'd' ' ' ' ']
 ['e' 'x' 'a' 'm' 'p' 'l' 'e' '.']]

[['T' 'h' 'i' 'r' 'd' '.']]

Answer 2

这听起来像预处理，使用Dataset预处理管道会更好。

https://www.tensorflow.org/programmers_guide/datasets

您将从导入原始字符串开始。然后使用tf.Dataset().map(...)将字符串映射到可变长度的单词张量数组。我刚刚在几天前做了这个，并在这个问题上发布了一个例子：

In Tensorflow's Dataset API how do you map one element into multiple elements?

您需要使用tf.Dataset().flat_map(...)进行操作，以将可变长度的单词标记行压缩为单个样本。

Dataset管道在TF 1.4中是新的，似乎是在tensorflow中处理流水线的方式，因此值得学习。

这个问题也可能对你有用，我在做类似你正在做的事情的时候碰到了它。如果您刚刚开始使用TF管道，请不要从这个问题开始，但您可能会发现它在此过程中很有用。

Using tensorflow's Dataset pipeline, how do I *name* the results of a `map` operation?

我如何进行拆分操作的结果？

2 个答案: