我正在尝试解析电子邮件回复的文本并删除引用的文本(以及随后的任何内容,包括签名)
此代码返回: 消息测试 2013年6月25日星期二,下午10:01,Catie Brand<
我希望它简单地回归 消息测试
我错过了什么正则表达式?
def format_mail_plain(value, from_address):
res = [re.compile(r'From:\s*' + re.escape(from_address), re.IGNORECASE),
re.compile('<' + re.escape(from_address) + '>', re.IGNORECASE),
re.compile(r'\s+wrote:', re.IGNORECASE | re.MULTILINE),
re.compile(r'On.*?wrote:.*?', re.IGNORECASE | re.MULTILINE | re.DOTALL),
re.compile(r'-+original\s+message-+\s*$', re.IGNORECASE),
re.compile(r'from:\s*$', re.IGNORECASE),
re.compile(r'^>.*$', re.IGNORECASE | re.MULTILINE)]
whitespace_re = re.compile(r'\s+')
lines = list(line.rstrip() for line in value.split('\n'))
result = ''
for line_number, line in zip(range(len(lines)), lines):
for reg_ex in res:
if reg_ex.search(line):
return result
if not whitespace_re.match(line):
if '' is result:
result += line
else:
result += '\n' + line
return result
************************ Sample Text *****************************
message tests
On Tue, Jun 25, 2013 at 10:01 PM, XXXXX XXXX <
conversations+yB1oupeCJzMOBj@xxxx.com> wrote:
> **
> [image: Krow] <http://www.krow.com/>
************************ Result **********************************
message tests
On Tue, Jun 25, 2013 at 10:01 PM, XXXXX XXXX <
我宁愿结果是:
************************ Result **********************************
message tests
答案 0 :(得分:1)
在您的示例输入中,On.*?wrote
不匹配,因为On ... wrote:
跨越两行。
我将代码更改为将On.*wrote:\s*
替换为空字符串。
def format_mail_plain(value, from_address):
value = re.compile(r'^On.*?wrote:\s*', re.IGNORECASE | re.MULTILINE | re.DOTALL).sub('', value)
res = [re.compile(r'From:\s*' + re.escape(from_address), re.IGNORECASE),
re.compile('<' + re.escape(from_address) + '>', re.IGNORECASE),
re.compile(r'-+original\s+message-+\s*$', re.IGNORECASE),
re.compile(r'^from:', re.IGNORECASE),
re.compile(r'^>')]
lines = filter(None, [line.rstrip() for line in value.split('\n')])
result = []
for line in lines:
result.append(line)
for reg_ex in res:
if reg_ex.search(line):
result.pop()
break
return '\n'.join(filter(None, result))
答案 1 :(得分:0)
您期望抓住'On Tue, Jun 25 ...'
的正则表达式是
re.compile(r'On.*?wrote:.*?', re.IGNORECASE | re.MULTILINE | re.DOTALL)
这将不匹配,因为在正则表达式看到字符串时,示例文本中的'wrote'
已经被拆分为另一行。由于您希望在看到该字符串后停止处理该消息,因此在分割字符串之前将其替换为会触发处理循环退出的内容。我会建议引用字符'>'
。 falsetru首先抓住了这一点,我将替代想法纳入了我的答案。
您的正则表达式似乎写入根本不使用替换。那是为了提高绩效吗?
我会减少正则表达式的数量,消除在列表生成阶段处理的空格行,并使用子字符串来测试单个和两个字符的正则表达式。试试这个:
def format_mail_plain(value, from_address):
on_wrote_regex = re.compile(
r'^On.*?wrote:\s*', re.IGNORECASE | re.MULTILINE | re.DOTALL)
value = on_wrote_regex.sub('>', value)
res = [re.compile(r'from:\s*(' + re.escape(from_address) +)|$, re.IGNORECASE),
re.compile('<' + re.escape(from_address) + '>', re.IGNORECASE),
re.compile(r'\s+wrote:', re.IGNORECASE),
re.compile(r'-+original\s+message-+\s*$', re.IGNORECASE)]
result = ''
for line in (text_line.rstrip()
for text_line in value.split('\n')
if text_line.strip()):
if line[0] == '>':
return result
for reg_ex in res:
if reg_ex.search(line):
return result
if '' is result:
result += line
else:
result += '\n' + line
return result