如何在spaCy中获取句子编号?

时间:2019-10-02 08:26:43

标签: python nlp spacy

我得到一个字符串的记号

doc = nlp(u"This is the first sentence. This is the second sentence.")
for token in doc:
    print(token.i, token.text)

与输出

0 This
1 is
2 the
3 first
4 sentence
5 .
6 This
7 is
8 the
9 second
10 sentence
11 .

如何获取句子编号为(SENTENCE_NUMBER, token.i, token.text)

0 0 This
0 1 is
0 2 the
0 3 first
0 4 sentence
0 5 .
1 0 This
1 1 is
1 2 the
1 3 second
1 4 sentence
1 5 .

我可以在循环中重置令牌编号,但是如何从doc获取句子编号?

2 个答案:

答案 0 :(得分:3)

没有内置的句子索引,但是您可以遍历句子:

for sent_i, sent in enumerate(doc.sents):
    for token in sent:
        print(sent_i, token.i, token.text)

如果需要将其存储在其他地方使用,则可以使用自定义扩展名将句子索引保存在跨度或标记上:https://spacy.io/usage/processing-pipelines#custom-components-attributes

答案 1 :(得分:0)

作为list

doc = nlp(u"This is the first sentence. This is the second sentence.")

[sent_id for sent_id, sent in enumerate(doc.sents) for token in sent]
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

doc = nlp(u"This is the first sentence. This is the second sentence.")

[(sent_id, token.i, token.text) for sent_id, sent in enumerate(doc.sents) for token in sent]
[(0, 0, 'This'), (0, 1, 'is'), (0, 2, 'the'), (0, 3, 'first'), (0, 4, 'sentence'), (0, 5, '.'), (1, 6, 'This'), (1, 7, 'is'), (1, 8, 'the'), (1, 9, 'second'), (1, 10, 'sentence'), (1, 11, '.')]

作为numpy数组

doc = nlp(u"This is the first sentence. This is the second sentence.")

import numpy as np
np.cumsum(doc.to_array(['SENT_START', ])) - 1
[0 0 0 0 0 0 1 1 1 1 1 1]

作为pandas数据框

使用Spacy Universe(pip install dframcy)中的Dframcy

doc = nlp(u"This is the first sentence. This is the second sentence.")

from dframcy import DframCy
dframcy = DframCy(nlp)
spacy_df = dframcy.to_dataframe(doc, ['is_sent_start', 'id', 'text', ]).reset_index()
spacy_df.token_is_sent_start = spacy_df.token_is_sent_start.astype(bool).cumsum() - 1
spacy_df = spacy_df.rename(columns={'token_is_sent_start': 'sentence_id',
                                    'index': 'token_id',
                                    'token_text': 'token_text', })

spacy_df
    token_id  sentence_id token_text
0          0            0       This
1          1            0         is
2          2            0        the
3          3            0      first
4          4            0   sentence
5          5            0          .
6          6            1       This
7          7            1         is
8          8            1        the
9          9            1     second
10        10            1   sentence
11        11            1          .