我只是从tensorflow开始,正在尝试对一系列乳胶文档进行处理,为此,我想使用SubwordTextEncoder
,它对乳胶宏使用单个标记。宏\foo
被编码为令牌"\<c<foo>"
,以强制要求保留令牌“包含字母数字和非字母数字字符的混合”
我有
comms
,其第一个条目是"\<c<begin>"
。seq
的列表,每个字典都有一个“令牌”条目,其中包含一串用空格分隔的令牌,代表一串乳胶;使用上述编码对胶乳宏进行预编码我愿意:
def corpus_generator():
for o in seq:
yield o["tokens"]
encoder = tfds.features.text.SubwordTextEncoder.build_from_corpus(corpus_generator(),4000,reserved_tokens=comms)
但是当我像这样测试该编码器时:
teststring = seq[0]["tokens"]
print(comms[0])
print(teststring)
encoded = encoder.encode(teststring)
for index in encoded:
print("{} --> {}".format(index,encoder.decode([index])))
我得到以下输出:
\<c<begin>
\<c<begin> { definition } We call two mathematical objects \<m$< a \<m$> and \<m$< b \<m$> equal , ( written \<m$< \<c<eq> { a , b } \<m$> ) , iff there are no properties that discern them . \<c<end> { definition }
2178 --> \<c<b
2873 --> eg
1108 --> in
3491 --> >
605 --> {
611 --> definition
608 --> }
656 --> We
644 --> call
705 --> two
891 --> mathematical
888 --> objects
606 --> \<m$<
612 --> a
609 --> \<m$>
621 --> and
606 --> \<m$<
745 --> b
609 --> \<m$>
1434 --> equal
2227 --> , (
724 --> written
606 --> \<m$<
659 --> \<c<eq>
605 --> {
3526 --> a
610 --> ,
3527 --> b
608 --> }
613 --> \<m$>
671 --> ) ,
642 --> iff
672 --> there
646 --> are
798 --> no
1841 --> properties
624 --> that
2900 --> discern
3461 -->
1549 --> them
614 --> .
989 --> \<c<
1996 --> end
3491 --> >
605 --> {
611 --> definition
620 --> }
......特别是(准确地)拆分了那些本应保留的令牌...我在做什么错了?