我得到一个字符串的记号
doc = nlp(u"This is the first sentence. This is the second sentence.")
for token in doc:
print(token.i, token.text)
与输出
0 This
1 is
2 the
3 first
4 sentence
5 .
6 This
7 is
8 the
9 second
10 sentence
11 .
如何获取句子编号为(SENTENCE_NUMBER, token.i, token.text)
0 0 This
0 1 is
0 2 the
0 3 first
0 4 sentence
0 5 .
1 0 This
1 1 is
1 2 the
1 3 second
1 4 sentence
1 5 .
我可以在循环中重置令牌编号,但是如何从doc
获取句子编号?
答案 0 :(得分:3)
没有内置的句子索引,但是您可以遍历句子:
for sent_i, sent in enumerate(doc.sents):
for token in sent:
print(sent_i, token.i, token.text)
如果需要将其存储在其他地方使用,则可以使用自定义扩展名将句子索引保存在跨度或标记上:https://spacy.io/usage/processing-pipelines#custom-components-attributes
答案 1 :(得分:0)
list
doc = nlp(u"This is the first sentence. This is the second sentence.")
[sent_id for sent_id, sent in enumerate(doc.sents) for token in sent]
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
或
doc = nlp(u"This is the first sentence. This is the second sentence.")
[(sent_id, token.i, token.text) for sent_id, sent in enumerate(doc.sents) for token in sent]
[(0, 0, 'This'), (0, 1, 'is'), (0, 2, 'the'), (0, 3, 'first'), (0, 4, 'sentence'), (0, 5, '.'), (1, 6, 'This'), (1, 7, 'is'), (1, 8, 'the'), (1, 9, 'second'), (1, 10, 'sentence'), (1, 11, '.')]
numpy
数组doc = nlp(u"This is the first sentence. This is the second sentence.")
import numpy as np
np.cumsum(doc.to_array(['SENT_START', ])) - 1
[0 0 0 0 0 0 1 1 1 1 1 1]
pandas
数据框使用Spacy Universe(pip install dframcy
)中的Dframcy。
doc = nlp(u"This is the first sentence. This is the second sentence.")
from dframcy import DframCy
dframcy = DframCy(nlp)
spacy_df = dframcy.to_dataframe(doc, ['is_sent_start', 'id', 'text', ]).reset_index()
spacy_df.token_is_sent_start = spacy_df.token_is_sent_start.astype(bool).cumsum() - 1
spacy_df = spacy_df.rename(columns={'token_is_sent_start': 'sentence_id',
'index': 'token_id',
'token_text': 'token_text', })
spacy_df
token_id sentence_id token_text
0 0 0 This
1 1 0 is
2 2 0 the
3 3 0 first
4 4 0 sentence
5 5 0 .
6 6 1 This
7 7 1 is
8 8 1 the
9 9 1 second
10 10 1 sentence
11 11 1 .