从段落中的字符范围中提取句子的单词范围

时间:2018-12-14 07:18:09

标签: python regex

我有一组单词是

Birds are a group of endothermic vertebrates, characterised by feathers Birds are also known as Aves They have toothless beaked jaws They have a high metabolic rate Birds are also known as Aves

我需要做的是找到多次出现的“鸟也被称为Aves”一词。因此,我写了一个正则表达式来匹配本段中“鸟也被称为Aves”的字符索引。在这里,我得到了两场比赛:

此处的跨度表示字符范围。

   <_sre.SRE_Match object; span=(72, 100), match='Birds are also known as Aves'>
<_sre.SRE_Match object; span=(165, 193), match='Birds are also known as Aves'>

但是我需要知道单词范围而不是字符范围。就像在第一次比赛中的单词范围(10,16)和第二次比赛中的单词范围(27,33)一样。

2 个答案:

答案 0 :(得分:2)

regex不支持该功能,但您可以像这样即时计算:

import re
s = 'Birds are a group of endothermic vertebrates, characterised by feathers Birds are also known as Aves They have toothless beaked jaws They have a high metabolic rate Birds are also known as Aves'

pat = 'Birds are also known as Aves'
pat_len = len(pat.split())
for x in re.finditer(pat, s):
    start = len(s[:x.start()].split())
    end = start + pat_len
    print(start, end)

答案 1 :(得分:0)

“我需要做的是找到事件”->假设只有一次:

s = ("Birds are a group of endothermic vertebrates, characterised by feathers "
    "Birds are also known as Aves They have toothless beaked jaws They have a high "
     "metabolic rate Birds are also known as Aves")
sub = "Birds are also known as Aves"
len_sub = len(sub.split())
len_left = len(s.split(sub)[0].split())
print(len_left, len_left+len_sub)