伪造句子标记器的范围

时间:2019-12-11 19:41:58

标签: python spacy

我正在使用spacy标记文档中的句子。分词后,我需要能够重建原始文档。如何获得每个句子的跨度?

 s='this is sentence1.\nthis is sentence2.'
 nlp = spacy.load('en_core_web_sm')
 doc = nlp(s)
 for sent in doc.sents:
     print(sent.text.span)

 [ 0,19]
 [19,37]

我想获取找到的每个句子的跨度。这3个句子的预期输出为:

是否有一种方法可以获取每个发送的跨度?

1 个答案:

答案 0 :(得分:1)

由于sent的类型为spacy.tokens.span.Span,因此您可以访问对象的start_char and end_char attributes

print( [sent.start_char, sent.end_char] )

Python测试:

import spacy
nlp = spacy.load("en_core_web_sm")
s='this is sentence1.\nthis is sentence2.'
doc = nlp(s)

for sent in doc.sents:
    print( [sent.start_char, sent.end_char] )

输出:[0, 19] [19, 37]