我正在使用tensorflow词汇表,导入如下:
from tensorflow.contrib import learn
vocabulary = learn.preprocessing.VocabularyProcessor(length)
我写了一个单元测试,确保我可以保存词汇,重新加载它,并在保持跟踪旧句子的同时适应新句子。
这是我的结果:
The fit sentence: [1 2 3 4 5 6 2 7 8 4 5 9 7]
The new fit sentence: [0 0 0 2 9 0 6 2 7 8 4 0 0]
它工作正常,第一句中位置0(处理为2)的单词与第二句中位置3中的单词具有相同的值(2),因为它们是相同的。
但是,我注意到所有新单词都是0。
我原本期望我的新句子看起来像这样:
[10 11 12 2 9 10 6 2 7 8 4 12 11]
如何解决此问题?如何让我的词汇处理器学习新单词?
谢谢!
编辑1:
这是我的单元测试的简化版本:
import numpy as np
from tensorflow.contrib import learn
# A test sentence
test_sentence = "This is a test sentence. It is used to test. sentence, this, used"
test_sentence_len = len(test_sentence.split(" "))
# A vocabulary processor
vocabulary_processor = learn.preprocessing.VocabularyProcessor(test_sentence_len)
# Turning a list of sentences ( [test_sentence] ) into a list of fit test sentences and taking the first one.
fit_test_sentence = np.array(list(vocabulary_processor.fit_transform([test_sentence])))[0]
# We see that "is" ( position 1 ) and "is" ( position 6 ) are the same. They should have the same numeric value
# in the fit array as well
print("The fit sentence: ", fit_test_sentence)
# self.assertEqual(fit_test_sentence[1], fit_test_sentence[6])
initial_fit_sentence = fit_test_sentence
# Now, let's save
vocabulary_processor.save("some/path")
# Now, we load into a different variable
new_vocabulary_processor = learn.preprocessing.VocabularyProcessor.restore("some/path")
new_test_sentence = "Very different uttering is this one. It is used to test."
# Now, we fit the new sentence with the new vocabulary, which should be the old one
# We should see "is" being transformed into the same numerical value, initial_fit_sentence[1]
new_fit_sentence = np.array(list(new_vocabulary_processor.fit_transform([new_test_sentence])))[0]
print("The new fit sentence: ", new_fit_sentence)
# self.assertEqual(initial_fit_sentence[1], new_fit_sentence[3])
我尝试更改test_sentence_len
的值,以为可能词汇无法再学习新单词,但即使我将其设置为1000,也不会学习新单词。
答案 0 :(得分:0)
看起来fit_transform
方法会冻结词汇量。这意味着在此之前尚未观察到的任何内容都将获得0
ID(UNK)。您可以使用new_vocabulary_processor.vocabulary_.freeze(False)
解冻词汇表。
new_vocabulary_processor = learn.preprocessing.VocabularyProcessor.restore("some/path")
new_vocabulary_processor.vocabulary_.freeze(False)
new_test_sentence = "Very different uttering is this one. It is used to test."