我有一个字符串。我想将字符串剪切成子字符串,其中包含一个包含数字的单词,其中包含两边的(最多)4个单词。如果子串重叠,则它们应该结合起来。
Sampletext = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated."
re.findall('(\s[*\s]){1,4}\d(\s[*\s]){1,4}', Sampletext)
desired output = ['the way I know 54 how to take praise', 'to take praise for 65 excellent questions 34 thank you for asking']
答案 0 :(得分:3)
重叠匹配:使用前瞻
这样做:
subject = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated."
for match in re.finditer(r"(?=((?:\b\w+\b ){4}\d+(?: \b\w+\b){4}))", subject):
print(match.group(1))
什么是Word?
输出取决于您对单词的定义。总而言之,我允许数字。这会产生以下输出。
输出(允许使用单词中的数字)
the way I know 54 how to take praise
to take praise for 65 excellent questions 34 thank
for 65 excellent questions 34 thank you for asking
选项2:单词中没有数字
subject = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated."
for match in re.finditer(r"(?=((?:\b[a-z]+\b ){4}\d+(?: \b[a-z]+\b){4}))", subject, re.IGNORECASE):
print(match.group(1))
输出2
the way I know 54 how to take praise
选项3:扩展为四个不间断的非数字字
根据您的评论,此选项将扩展到数据透视表的左侧和右侧,直到匹配四个不间断的非数字字词。逗号被忽略了。
subject = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated. One Two Three Four 55 Extend 66 a b c d AA BB CC DD 71 HH DD, JJ FF"
for match in re.finditer(r"(?=((?:\b[a-z]+[ ,]+){4}(?:\d+ (?:[a-z]+ ){1,3}?)*?\d+.*?(?:[ ,]+[a-z]+){4}))", subject, re.IGNORECASE):
print(match.group(1))
输出3
the way I know 54 how to take praise
to take praise for 65 excellent questions 34 thank you for asking
One Two Three Four 55 Extend 66 a b c d
AA BB CC DD 71 HH DD, JJ FF