Question

我只是从tensorflow开始，正在尝试对一系列乳胶文档进行处理，为此，我想使用SubwordTextEncoder，它对乳胶宏使用单个标记。宏\foo被编码为令牌"\<c<foo>"，以强制要求保留令牌“包含字母数字和非字母数字字符的混合”

我有

在主体中出现的宏列表comms，其第一个条目是"\<c<begin>"。
一本字典seq的列表，每个字典都有一个“令牌”条目，其中包含一串用空格分隔的令牌，代表一串乳胶；使用上述编码对胶乳宏进行预编码

我愿意：

def corpus_generator():
    for o in seq:
        yield o["tokens"]
encoder = tfds.features.text.SubwordTextEncoder.build_from_corpus(corpus_generator(),4000,reserved_tokens=comms)

但是当我像这样测试该编码器时：

teststring = seq[0]["tokens"]
print(comms[0])
print(teststring)
encoded = encoder.encode(teststring)
for index in encoded:
    print("{} --> {}".format(index,encoder.decode([index])))

我得到以下输出：

\<c<begin>
\<c<begin> { definition } We call two mathematical objects \<m$< a \<m$> and \<m$< b \<m$> equal , ( written \<m$< \<c<eq> { a , b } \<m$> ) , iff there are no properties that discern them . \<c<end> { definition }
2178 --> \<c<b
2873 --> eg
1108 --> in
3491 --> >
605 -->  { 
611 --> definition
608 -->  } 
656 --> We 
644 --> call 
705 --> two 
891 --> mathematical 
888 --> objects 
606 --> \<m$< 
612 --> a 
609 --> \<m$> 
621 --> and 
606 --> \<m$< 
745 --> b 
609 --> \<m$> 
1434 --> equal
2227 -->  , ( 
724 --> written 
606 --> \<m$< 
659 --> \<c<eq>
605 -->  { 
3526 --> a
610 -->  , 
3527 --> b
608 -->  } 
613 --> \<m$>
671 -->  ) , 
642 --> iff 
672 --> there 
646 --> are 
798 --> no 
1841 --> properties 
624 --> that 
2900 --> discern
3461 -->  
1549 --> them
614 -->  . 
989 --> \<c<
1996 --> end
3491 --> >
605 -->  { 
611 --> definition
620 -->  }

......特别是（准确地）拆分了那些本应保留的令牌...我在做什么错了？

SubwordTextEncoder.build_from_corpus在reserved_tokens中拆分令牌

0 个答案: