正则表达式不从引用的回复中删除文本

时间:2013-06-26 03:07:03

标签: python regex gmail

我正在尝试解析电子邮件回复的文本并删除引用的文本(以及随后的任何内容,包括签名)

此代码返回:     消息测试     2013年6月25日星期二,下午10:01,Catie Brand<

我希望它简单地回归     消息测试

我错过了什么正则表达式?

def format_mail_plain(value, from_address):
    res = [re.compile(r'From:\s*' + re.escape(from_address), re.IGNORECASE),
           re.compile('<' + re.escape(from_address) + '>', re.IGNORECASE),
           re.compile(r'\s+wrote:', re.IGNORECASE  | re.MULTILINE),
           re.compile(r'On.*?wrote:.*?', re.IGNORECASE | re.MULTILINE | re.DOTALL),
           re.compile(r'-+original\s+message-+\s*$', re.IGNORECASE),
           re.compile(r'from:\s*$', re.IGNORECASE),
           re.compile(r'^>.*$', re.IGNORECASE | re.MULTILINE)]

    whitespace_re = re.compile(r'\s+')

    lines = list(line.rstrip() for line in value.split('\n'))

    result = ''
    for line_number, line in zip(range(len(lines)), lines):
        for reg_ex in res:
            if reg_ex.search(line):
                return result

        if not whitespace_re.match(line):
            if '' is result:
                result += line
            else:
                result += '\n' + line

    return result




************************ Sample Text *****************************
message tests 
On Tue, Jun 25, 2013 at 10:01 PM, XXXXX XXXX < 
conversations+yB1oupeCJzMOBj@xxxx.com> wrote: 
> ** 
>    [image: Krow] <http://www.krow.com/>


************************ Result **********************************
message tests
On Tue, Jun 25, 2013 at 10:01 PM, XXXXX XXXX <

我宁愿结果是:

************************ Result **********************************
message tests

2 个答案:

答案 0 :(得分:1)

在您的示例输入中,On.*?wrote不匹配,因为On ... wrote:跨越两行。

我将代码更改为将On.*wrote:\s*替换为空字符串。

def format_mail_plain(value, from_address):
    value = re.compile(r'^On.*?wrote:\s*', re.IGNORECASE | re.MULTILINE | re.DOTALL).sub('', value)
    res = [re.compile(r'From:\s*' + re.escape(from_address), re.IGNORECASE),
           re.compile('<' + re.escape(from_address) + '>', re.IGNORECASE),
           re.compile(r'-+original\s+message-+\s*$', re.IGNORECASE),
           re.compile(r'^from:', re.IGNORECASE),
           re.compile(r'^>')]

    lines = filter(None, [line.rstrip() for line in value.split('\n')])

    result = []
    for line in lines:
        result.append(line)
        for reg_ex in res:
            if reg_ex.search(line):
                result.pop()
                break

    return '\n'.join(filter(None, result))

答案 1 :(得分:0)

您期望抓住'On Tue, Jun 25 ...'的正则表达式是

re.compile(r'On.*?wrote:.*?', re.IGNORECASE | re.MULTILINE | re.DOTALL)

这将不匹配,因为在正则表达式看到字符串时,示例文本中的'wrote'已经被拆分为另一行。由于您希望在看到该字符串后停止处理该消息,因此在分割字符串之前将其替换为会触发处理循环退出的内容。我会建议引用字符'>'falsetru首先抓住了这一点,我将替代想法纳入了我的答案。

您的正则表达式似乎写入根本不使用替换。那是为了提高绩效吗?

我会减少正则表达式的数量,消除在列表生成阶段处理的空格行,并使用子字符串来测试单个和两个字符的正则表达式。试试这个:

def format_mail_plain(value, from_address):
    on_wrote_regex = re.compile(
        r'^On.*?wrote:\s*', re.IGNORECASE | re.MULTILINE | re.DOTALL)
    value = on_wrote_regex.sub('>', value)
    res = [re.compile(r'from:\s*(' + re.escape(from_address) +)|$, re.IGNORECASE),
           re.compile('<' + re.escape(from_address) + '>', re.IGNORECASE),
           re.compile(r'\s+wrote:', re.IGNORECASE),
           re.compile(r'-+original\s+message-+\s*$', re.IGNORECASE)]

    result = ''
    for line in (text_line.rstrip() 
                 for text_line in value.split('\n') 
                 if text_line.strip()):
        if line[0] == '>':
            return result

        for reg_ex in res:
            if reg_ex.search(line):
                return result

        if '' is result:
            result += line
        else:
            result += '\n' + line

    return result