Question

我正在使用TFlearn的VocabularyProcessor将文档映射到整数数组。但是，我似乎无法使用自己的词汇表初始化VocabularyProcessor。在文档中，它表示我可以在创建VocabularyProcessor时提供词汇表，如下所示：

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length, vocabulary=vocab)

但是，在创建像这样的VocabularyProcessor时，我无法正确转换文档。我将词汇表作为字典提供，使用单词indices作为值：

vocab={'hello':3, '.':5, 'world':20}

句子如下：

sentences = ['hello summer .', 'summer is here .', ...]

VocabularyProcessor使用给定的索引来转换文档非常重要，因为每个索引都引用了某个单词嵌入。致电时

list(vocab_processor.transform(['hello world .', 'hello']))

输出

[array([ 3, 20, 0]), array([3, 0, 0])]

因此，根据所提供的映射词汇没有对句子进行转换。＆＃39;到5。如何正确地为词汇表处理器提供词汇表？

Answer 1

让我们做一些实验来回答你的问题，

vocab={'hello':3, '.':5, 'world':20, '/' : 10}
sentences= ['hello world . / hello', 'hello']

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length=6, vocabulary=vocab)
list(vocab_processor.transform(sentences))

以下代码段的输出是

[array([ 3, 20,  3,  0,  0,  0]), array([3, 0, 0, 0, 0, 0])]

现在您可能已经看到空格（＆＃39;＆＃39;）和点（＆＃39;。＆＃39;）两者实际上都没有标记化。因此，在您的代码中，发生的情况是tensorflow只识别两个单词并填充额外的零以使其成为max_document_length=3。要对它们执行标记化，您可以编写自己的tokenized function。下面给出了一个示例代码。

def my_func(iterator):
  return (x.split(" ") for x in iterator)

vocab={'hello':3, '.':5, 'world':20, '/' : 10}
sentences= ['hello world . / hello', 'hello']

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length=6, vocabulary=vocab, tokenizer_fn = my_func)

list(vocab_processor.transform(sentences))

现在代码段的输出就像

[array([ 3, 20,  5, 10,  3,  0]), array([3, 0, 0, 0, 0, 0])]

这是预期的输出。希望这会让你的困惑变得清晰。

您的下一个困惑可能是默认情况下将被标记化的值。让我在这里发布原始的source，这样你就不会感到困惑，

TOKENIZER_RE = re.compile(r"[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+",
                          re.UNICODE)
def tokenizer(iterator):
  """Tokenizer generator.
  Args:
    iterator: Input iterator with strings.
  Yields:
    array of tokens per each value in the input.
  """
  for value in iterator:
    yield TOKENIZER_RE.findall(value)

但我的建议是，＆＃34;编写自己的功能并保持自信＆＃34;

另外，如果你错过了（希望不是），我想指出一些事情。如果您使用transform()函数，则min_frequency参数将无法正常工作，因为它不适合数据。尝试在以下代码中看到效果，

for i in range(6):
    vocab_processor = learn.preprocessing.VocabularyProcessor(
        max_document_length=7, min_frequency=i)
    tokens = vocab_processor.transform(["a b c d e f","a b c d e","a b c" , "a b", "a"])
    print(list(vocab_processor.transform(sentences))[0] )

输出：

[1 2 3 4 5 6 0]
[1 2 3 4 5 6 0]
[1 2 3 4 5 6 0]
[1 2 3 4 5 6 0]
[1 2 3 4 5 6 0]
[1 2 3 4 5 6 0]

再次提供类似的代码，

for i in range(6):
    vocab_processor = learn.preprocessing.VocabularyProcessor(
        max_document_length=7, min_frequency=i)
    tokens = vocab_processor.fit_transform(["a b c d e f","a b c d e","a b c" , "a b", "a"])
    print(list(tokens)[0])

输出：

[1 2 3 4 5 6 0]
[1 2 3 4 5 0 0]
[1 2 3 0 0 0 0]
[1 2 0 0 0 0 0]
[1 0 0 0 0 0 0]
[0 0 0 0 0 0 0]

Answer 2

这应该有效：

processor = learn.preprocessing.VocabularyProcessor(
    max_document_length=4, 
    vocabulary={'hello':2, 'world':20})

list(processor.transform(['world hello']))
>> [array([20,  2,  0, 0])]

请注意，此方法的输出形状为（1，max_document_length）。因此填充最后两个零。

更新：关于'。'在你的词汇表中，我认为它不被处理器中的默认标记器识别为令牌（因此返回0）。默认的tokenizer使用非常简单的Regex来完成实际工作（识别令牌）。见here。为了解决这个问题，我猜你应该通过向4-th argument tokenizer_fn提供{{3}}来为VocabularyProcessor提供你自己的标记生成器。

TFlearn - VocabularyProcessor忽略给定词汇的一部分

2 个答案: