Question

我正在尝试分析文章以确定是否出现特定的子字符串。

如果出现“帐单”，那么我想从文章中删除子字符串的父句，以及第一个删除的句子之后的每个句子。

如果未出现“帐单”，则不会对文章进行任何更改。

示例文字：

stringy = """This is Bill Everest here. A long time ago in, erm, this galaxy, a game called Star Wars Episode I: Racer was a smash hit, leading to dozens of enthusiastic magazine reviews with the byline "now this is podracing!" Unfortunately, the intervening years have been unkind to the Star Wars prequels, Star Fox in the way you can rotate your craft to fit through narrow gaps. 

This is Bill, signing off. Thank you for reading. And see you tomorrow!"""

目标子字符串为“帐单”时所需的结果：

stringy = """This is Bill Everest here. A long time ago in, erm, this galaxy, a game called Star Wars Episode I: Racer was a smash hit, leading to dozens of enthusiastic magazine reviews with the byline "now this is podracing!" Unfortunately, the intervening years have been unkind to the Star Wars prequels, but does that hindsight extend to this thoroughly literally-named racing tie-in? Star Fox in the way you can rotate your craft to fit through narrow gaps.
"""

这是到目前为止的代码：

if "Bill" not in stringy[-200:]:
    print(stringy)

text = stringy.rsplit("Bill")[0]

text = text.split('.')[:-1]

text = '.'.join(text) + '.'

当“比尔”出现在最后200个字符之外时，当前效果不佳，并在“比尔”的第一个实例（开头句“这是比尔·埃弗勒斯特”）处切断了文本。如何更改此代码以仅选择最近200个字符中的“帐单”？

非常感谢您！

Answer 1

您可以在这里使用re：

import re

stringy = """..."""
target = "Bill"

l = re.findall(r'([A-Z][^\.!?]*[\.!?])',stringy)

for i in range(len(l)-1,0,-1):
    if target in l[i] and sum([len(a) for a in l[i:]])-sum([len(a) for a in l[i].split(target)[:-1]]) < 200:
        strings = ' '.join(l[:i])

print(stringy)

Answer 2

这是另一种使用正则表达式遍历每个句子的方法。我们保留行数，一旦进入最后200个字符，我们就会在该行中检查“帐单”。如果找到，则从此行开始排除。

希望代码足够可读。

import re

def remove_bill(stringy):
    sentences = re.findall(r'([A-Z][^\.!?]*[\.!?]\s*\n*)', stringy)
    total = len(stringy)
    count = 0
    for index, line in enumerate(sentences):
        #Check each index of 'Bill' in line
        for pos in (m.start() for m in re.finditer('Bill', line)):
            if count + pos >= total - 200:
                stringy = ''.join(sentences[:index])
                return stringy
        count += len(line)
    return stringy

stringy = remove_bill(stringy)

根据状态删除

2 个答案: