我试图解析传递给函数的电子邮件,如下所示:
From: Bob
Sent: Thursday
To: Jack
Cc: Mary, Zaphod,
Janice, Trillian
Subject: Restaurant at the End of the Universe
I can't seem to find this on the map, any help?
Is it on I-95?
我使用的正则表达式如下所示:
From:(?:\s){0,}(.*)(?:\s){1,}Sent:(.*)(?:\s){1,}To:(.*)(?:\s){1,}Cc:(\s|\S){1,}(?:\s){1,}Subject:(.*)(\s){1,}(.*)
问题是这只捕获了抄送信息的最后一个字符和电子邮件正文的一行。
我可以使用DOTALL
标志并在Cc和Subject((。*)之后更改(\ s | \ S){1,}:
From:(?:\s){0,}(.*)(?:\s){1,}Sent:(.*)(?:\s){1,}To:(.*)(?:\s){1,}Cc:(.*){1,}(?:\s){1,}Subject:(.*)(\s){1}(.*)
但这会合并身体和主体。
有没有办法用p或p来捕获多个char,或者我应该使用DOTALL
以不同的方式分割主语和正文?
答案 0 :(得分:1)
您可以使用非贪婪匹配来匹配所有内容,直到Subject
:
From:(?:\s){0,}(.*)(?:\s){1,}Sent:(.*)(?:\s){1,}To:(.*)(?:\s){1,}Cc:([\s\S]*?)Subject:(.*)(\s){1,}([\s\S]*)
要多次匹配相同的模式,您可以使用前瞻以确保下一部分以From:
开头或结束:
From:(?:\s){0,}(.*)(?:\s){1,}Sent:(.*)(?:\s){1,}To:(.*)(?:\s){1,}Cc:([\s\S]*?)Subject:(.*)(\s){1,}([\s\S]*?)(?=(From:|$))
我认为这已经变得足够复杂了,现在是时候使用像电子邮件解析器这样更强大的解决方案了。
答案 1 :(得分:1)
你可以稍微概括一下,然后一次修剪一下
我添加了命名组,但您可以删除它们,或者更改为(?P<>)
Python表单。
From:\s*\s*(?<From>.*?)\s+Sent:\s*(?<Sent>.*?)\s+To:\s*(?<To>[\S\s]*?)\s+Cc:\s*(?<Cc>[\S\s]*?)\s+Subject:\s*(?<Subject>.*?)\s*(?=\r?\n|$)(?:\r?\n(?<Message>[\S\s]+?\S[\S\s]+?)\s*$)?
Blowup
From: \s*
\s*
(?<From> .*? ) # (1), From: single line
\s+
Sent:
\s*
(?<Sent> .*? ) # (2), Sent: single line
\s+
To:
\s*
(?<To> [\S\s]*? ) # (3), To: multiple line's possible
\s+
Cc:
\s*
(?<Cc> [\S\s]*? ) # (4), Cc: multiple line's possible
\s+
Subject:
\s*
(?<Subject> .*? ) # (5), Subject: single line
\s*
(?= \r? \n | $ )
(?: # Optional message body
\r? \n
(?<Message> # (6 start), Message: multiple line's possible
[\S\s]+?
\S
[\S\s]+?
) # (6 end)
\s*
$
)?
输出
** Grp 1 [From] - ( pos 6 , len 3 )
Bob
** Grp 2 [Sent] - ( pos 19 , len 8 )
Thursday
** Grp 3 [To] - ( pos 35 , len 4 )
Jack
** Grp 4 [Cc] - ( pos 47 , len 33 )
Mary, Zaphod,
Janice, Trillian
** Grp 5 [Subject] - ( pos 93 , len 37 )
Restaurant at the End of the Universe
** Grp 6 [Message] - ( pos 134 , len 65 )
I can't seem to find this on the map, any help?
Is it on I-95?