在正则表达式中,我如何匹配任意数量的任何字符(例如,(。| \ n)*)而不消耗其他可能跟随的匹配?如果这个问题不明确,我的情况就是这样:
在一个文本文件中,我有一堆电子邮件,包括所有粘贴在一起的标题。
修改:下面的清洁版本在换行符的开头有每个标题。我的实际数据可能是也可能不是。每个标题组件(如“From:xxx”)可以在任何内容之前或之前。在某些情况下,许多电子邮件和标题可能都在一行上,在一堆其他残余之后。最重要的是,我需要识别其他电子邮件标题,其中包含“发件人:”。所以,我需要识别这个整个标题样式。
在编辑之前给出的几个答案依赖于^或制表符分隔等内容,我不能指望它。他们看起来似乎可能会稍加修改,但我(显然)对正则表达式并不是很好,我自己也无法调整它们。我很抱歉之前省略了这一点,只有几个答复者才能抓住它...另一种我对正则表达式缺乏经验的产品。
这是一个丑陋的版本 - 这是我实际上想要匹配的字符串。它包含两个标题和消息。
emailsString = u"""From:\n Lastname, Firstname\n Sent:\n Monday, June 24, 2013 1:48 PM\n To:\n Othername, Name\n Subject:\n RE: Center update\n Message message message.\n Such a lovely message\n Take care,\n Firstname Lastname, MS\n Long signature\n in this email\n \n E-mail:\n email@email.com\n Web\n my blog\n From:\n Lastname, Firstname\n Sent:\n Monday, June 24, 2013 9:33 AM\n To:\n Othername, Name\n Subject:\n Center update\n Importance:\n High\n Good Morning Name,\n I hope this finds you doing well.\n I wanted to inform you of some changes. The Center will be closing August 30\n th\n . or September 1\n st\n . I\u2019ve enjoyed my experience. """
这是一个更清晰的版本,用于显示标题的内容
From: Lastname, Firstname
Sent: Monday, July 15th, 2011, 9:36 AM
To: Othername, Name
Subject: blah
Importance: High
Message message message
second line of message
second para of message
From: Lastname, Firstname
Sent: Thursday, July 18th, 2011, 10:45 AM
To: Othername, Name
Subject: blahblah
message
...
我正在尝试将标题中的信息与消息本身一起使用。我有一个可以成功匹配所有标题的正则表达式,但我正在努力解决这个问题。问题是,消息可以包含任何内容(或任何内容)。可能有多个新行,等等。我想得到所有这些,但我仍然想要分割电子邮件。我的尝试(注意标题的'重要'部分是可选的):
for hit in re.finditer(r'[\s\n]*From:[\s\n]*(?P<from>.*)[\s\n]*Sent:[\s\n]*(?P<date>.*)[\s\n]*To:[\s\n]*(?P<to>.*)[\s\n]*Subject:[\s\n]*(?P<subject>.*)[\s\n]*(?:Importance:)?[\s\n]*.*[\s\n]*(?P<message>(.|\n)*)', allEmailsString):
print "from: " + hit.group("from")
print "to: " + hit.group("to")
print "date: " + hit.group("date")
print "subject: " + hit.group("subject")
print "message: " + hit.group("message")
问题是,消息组正在抓取所有内容。因此,我正确地从/到/ etc获取第一个电子邮件标题,然后查看包含该电子邮件消息的消息,以及所有以下电子邮件标题和消息。我需要抓住'所有内容直到下一个电子邮件标题/正则表达式匹配或直到字符串结尾'。
我已经有了一个解决方法 - 我可以摆脱消息捕获组并只抓取标题。然后,遍历匹配对象并根据字符串的开始/结束对字符串进行切片。例如,message1来自match1.end到match2.start。
所以,我问......
答案 0 :(得分:1)
只有当文本由可变部分和稳定部分组成时(或者至少部分具有稳定的可变性......),正则表达式才可用于提取文本块。
在下面的正则表达式模式中,我在“稳定”部分做了一些假设来提高它们的数量,从而可以区分电子邮件并在文本中提取所需的块,这些文本看起来几乎没有确定的锚点: / p>
我认为在'发送'部分,总有一个星期的名字
我认为如果存在“重要性”这一行,那么只有一个词来描述这种重要性,那么[^ \t\r\n]+
我认为主题描述不能在几行上,然后是[^\r\n]+
如果文本中稳定部分的数量太少,也就是说文本的结构太松,使用正则表达式就不可能了。
模式[ \t\r\n]*(?P<from>.*?[^ \t\r\n])[ \t\r\n]*'
对捕获的群组产生strip
影响
然后,如果消息中有多个空白行,则匹配结果表明消息为''
如果在最后一条消息之后没有其他行,则需要\Z
来捕获桅杆电子邮件,如我的文本示例所示。
import re
emailsString = (u' From:\n'
' Lastname, Firstname\n'
' Sent:\n'
' Monday, June 24, 2013 1:48 PM\n'
' To:\n'
' Othername, Name\n'
' Subject:\n'
' RE: Center update\n'
' Message message message.\n'
' Such a lovely message\n'
' Take care,\n'
' Firstname Lastname, MS\n'
' Long signature\n'
' in this email\n'
' \n'
' E-mail:\n'
' email@email.com\n'
' Web\n'
' my blog\n'
' From:\n'
' Lastname, Firstname\n'
' Sent:\n'
' Monday, June 24, 2013 9:33 AM\n'
' To:\n'
' Othername, Name\n'
' Subject:\n'
' Center update\n'
' Importance:\n'
' High\n'
' Good Morning Name,\n'
' I hope this finds you doing well.\n'
' I wanted to inform you of some changes. The Center will be closing August 30\n'
' th\n'
' . or September 1\n'
' st\n'
' . I\u2019ve enjoyed my experience. ')
allEmailsString = '''
From: FirstLastname, FirstFirstname
Sent: Monday, July 15th, 2011, 9:36 AM
To: TheOne
Subject: blah
Importance: High
Message message message
second line of message
second para of message
From: MidLastname, MidFirstname
Sent: Thursday, July 18th, 2011, 10:45 AM
To: TWOTWO
Subject: once upon
From: LastLastname, LastFirstname
Sent: Saturday, July 20th, 2011, 12:51 AM
To: Mr Three
Subject: blobloblo
Nothing to say. '''
dispat = ("* from: {from}\n"
"* to: {to}\n"
"* date: {date}\n"
"* subject: {subject}\n"
"** message (beginning on next line):\n{message}\n"
"-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-")
regx = re.compile('From:[ \t\r\n]*(?P<from>.*?[^ \t\r\n])'
'[ \t\r\n]*'
'Sent:[ \t\r\n]*'
'(?P<date>.*?(?:Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day.*?[^ \t\r\n])'
'[ \t\r\n]*'
'To:[ \t\r\n]*(?P<to>.*?[^ \t\r\n])'
'[ \t\r\n]*'
'Subject:[ \t\r\n]*(?P<subject>[^\r\n]+)'
'[ \t\r\n]*'
'(?:Importance:[ \t\r\n]*(?P<importance>[^ \t\r\n]+))?'
'[ \t\r\n]*'
'(?P<message>.*?)'
'(?=[ \t\r\n]*From:.*?'
'Sent:.*?(?:Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day.*?'
'To.*?Subject:|\Z)',
re.DOTALL)
for s in (emailsString,allEmailsString):
print ''.join(dispat.format(**d)
for d in (ma.groupdict('') for ma in regx.finditer(s)))
print '\n#######################################\n'
结果
* from: Lastname, Firstname
* to: Othername, Name
* date: Monday, June 24, 2013 1:48 PM
* subject: RE: Center update
** message (beginning on next line):
Message message message.
Such a lovely message
Take care,
Firstname Lastname, MS
Long signature
in this email
E-mail:
email@email.com
Web
my blog
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-* from: Lastname, Firstname
* to: Othername, Name
* date: Monday, June 24, 2013 9:33 AM
* subject: Center update
** message (beginning on next line):
Good Morning Name,
I hope this finds you doing well.
I wanted to inform you of some changes. The Center will be closing August 30
th
. or September 1
st
. I\u2019ve enjoyed my experience.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
#######################################
* from: FirstLastname, FirstFirstname
* to: TheOne
* date: Monday, July 15th, 2011, 9:36 AM
* subject: blah
** message (beginning on next line):
Message message message
second line of message
second para of message
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-* from: MidLastname, MidFirstname
* to: TWOTWO
* date: Thursday, July 18th, 2011, 10:45 AM
* subject: once upon
** message (beginning on next line):
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-* from: LastLastname, LastFirstname
* to: Mr Three
* date: Saturday, July 20th, 2011, 12:51 AM
* subject: blobloblo
** message (beginning on next line):
Nothing to say.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
#######################################
答案 1 :(得分:0)
我只是划分(split
)并征服(re.match
):
import re
# `data` is your text file
delimiter = r'(^|\n)From:'
capturer = re.compile(r'From:[\n\s]*(?P<from>.*)[\n\s]*'
r'Sent:[\n\s]*(?P<date>.*)[\n\s]*'
r'To:[\n\s]*(?P<to>.*)[\n\s]*'
r'Subject:[\n\s]*(?P<subject>.*)[\n\s]*'
r'(?:Importance:)?[\n\s]*.*[\n\s]*'
r'(?P<message>(\n|.)*)')
raw_emails = ['From:' + d for d in re.split(delimiter, data) if d.strip()]
emails = []
for raw_email in raw_emails:
parts = capturer.match(raw_email)
emails.append(parts.groupdict())
对于您的示例数据,此输出:
[{'date': 'Monday, July 15th, 2011, 9:36 AM',
'from': 'Lastname, Firstname',
'message': 'Message message message\nsecond line of message\n\nsecond para of message\n',
'subject': 'blah',
'to': 'Othername, Name'},
{'date': 'Thursday, July 18th, 2011, 10:45 AM',
'from': 'Lastname, Firstname',
'message': '...\n',
'subject': 'blahblah',
'to': 'Othername, Name'}]
答案 2 :(得分:0)
这看起来可能很痛苦。为了清晰起见,它进行了扩展 使用多线模式和No-DotAll。
@mobabo - 在第一次评论后编辑到此。
必须明确界定您的关键字,并且有。你的陈述
I can't count on things like '^From' to work
显示您没有查看上一个
正则表达式,这一部分是相同的。 ^[^\S\n]*From:
与^From
此外,主题和留言之间没有明确的界限 或重要性和消息。如果“重要性”是电子邮件的一部分,则主题具有终点。
我制作了一个正则表达式,用于处理脏乱的电子邮件,底部是Perl 运动它的程序。输出包括在内。看看是否可以解决您的问题 (见下文)。
不幸的是,这是你能想到的最好的。
祝你好运! (注意 - 如果Python有递归,这个正则表达式将是这个大小的1/4)
# Compressed
# -------------------
# ^[^\S\n]*From:\s*(?P<from>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*)(?:\s*^[^\S\n]*Sent:\s*(?P<sent>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*To:\s*(?P<to>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*Subject:\s*(?P<subject>(?:(?!\s*^[^\S\n]*(?:(?:From|Sent|To|Subject|Importance)):)[\S\s])*)(?:\s*^[^\S\n]*Importance:\s*(?P<importance>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?)?
# Expanded
# -------------------
#
^ [^\S\n]* From: \s*
(?P<from>
(?:
(?!
\s* ^ [^\S\n]*
(?: From | Sent | To | Subject | Importance )
:
)
[\S\s]
)*
)
(?:
\s* ^ [^\S\n]* Sent: \s*
(?P<sent>
(?:
(?!
\s* ^ [^\S\n]*
(?: From | Sent | To | Subject | Importance )
:
)
[\S\s]
)*
)
)?
(?:
\s* ^ [^\S\n]* To: \s*
(?P<to>
(?:
(?!
\s* ^ [^\S\n]*
(?: From | Sent | To | Subject | Importance )
:
)
[\S\s]
)*
)
)?
(?:
\s* ^ [^\S\n]* Subject: \s*
(?P<subject>
(?:
(?!
\s* ^ [^\S\n]*
(?:
(?: From | Sent | To | Subject | Importance )
)
:
)
[\S\s]
)*
)
(?:
\s* ^ [^\S\n]* Importance: \s*
(?P<importance>
(?:
(?!
\s* ^ [^\S\n]*
(?: From | Sent | To | Subject | Importance )
:
)
[\S\s]
)*
)
)?
)?
# // Output from Perl sample code (below)
# //
# // ======================
# // From:
# // Lastname, Firstname
# // Sent:
# // Monday, July 15th, 2011, 9:36 AM
# // To:
# // Othername, Name
# // Subject:
# // blah
# // Importance/Message:
# // High
# //
# // Message message message
# // second line of message
# //
# // second para of message
# //
# //
# // ======================
# // From:
# // Lastname, Firstname
# // Sent:
# // Thursday, July 18th, 2011, 10:45 AM
# // To:
# // Othername, Name
# // Subject/Message:
# // blahblah
# //
# // message
# //
# //
# // ======================
# // From:
# // Lastname, Firstname
# // Sent:
# // Monday, June 24, 2013 1:48 PM
# // To:
# // Othername, Name
# // Subject/Message:
# // RE: Center update
# // Message message message.
# // Such a lovely message
# // Take care,
# // Firstname Lastname, MS
# // Long signature
# // in this email
# //
# // E-mail:
# // email@email.com
# // Web
# // my blog
# //
# //
# // ======================
# // From:
# // Lastname, Firstname
# // Sent:
# // Monday, June 24, 2013 9:33 AM
# // To:
# // Othername, Name
# // Subject:
# // Center update
# // Importance/Message:
# // High
# // Good Morning Name,
# // I hope this finds you doing well.
# // I wanted to inform you of some changes. The Center will be closing August 30
# //
# // th
# // . or September 1
# // st
# // . I've enjoyed my experience.
# //
# ------------------------------------------------------------
# # Perl sample code
# use strict;
# use warnings;
#
# $/ = undef;
#
# my $str = <DATA>;
#
#
#
# while ( $str =~ /
# ^[^\S\n]*From:\s*(?P<from>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*)(?:\s*^[^\S\n]*Sent:\s*(?P<sent>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*To:\s*(?P<to>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*Subject:\s*(?P<subject>(?:(?!\s*^[^\S\n]*(?:(?:From|Sent|To|Subject|Importance)):)[\S\s])*)(?:\s*^[^\S\n]*Importance:\s*(?P<importance>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?)?
# /xmg)
#
# {
# print "\n\n======================\n";
# print "From: \n\t$+{from}\n";
# if (defined $+{sent})
# {
# print "Sent: \n\t$+{sent}\n";
# }
# if (defined $+{to})
# {
# print "To: \n\t$+{to}\n";
# }
# if (defined $+{importance})
# {
# print "Subject: \n\t$+{subject}\n";
# print "Importance/Message: \n\t$+{importance}\n";
# }
# elsif (defined $+{subject})
# {
# print "Subject/Message: \n\t$+{subject}\n";
# }
# }
#
#
# __DATA__
#
# From: Lastname, Firstname
# Sent: Monday, July 15th, 2011, 9:36 AM
# To: Othername, Name
# Subject: blah
# Importance: High
#
# Message message message
# second line of message
#
# second para of message
#
# From: Lastname, Firstname
# Sent: Thursday, July 18th, 2011, 10:45 AM
# To: Othername, Name
# Subject: blahblah
#
# message
#
#
#
#
#
# From:
# Lastname, Firstname
# Sent:
# Monday, June 24, 2013 1:48 PM
# To:
# Othername, Name
# Subject:
# RE: Center update
# Message message message.
# Such a lovely message
# Take care,
# Firstname Lastname, MS
# Long signature
# in this email
#
# E-mail:
# email@email.com
# Web
# my blog
# From:
# Lastname, Firstname
# Sent:
# Monday, June 24, 2013 9:33 AM
# To:
# Othername, Name
# Subject:
# Center update
# Importance:
# High
# Good Morning Name,
# I hope this finds you doing well.
# I wanted to inform you of some changes. The Center will be closing August 30
# th
# . or September 1
# st
# . I've enjoyed my experience.
#
#