我有一个类似电子邮件的示例文本。我只想保留文本的正文,并从文本中删除名称,地址,名称,公司名称,电子邮件地址。因此,明确地说,我只希望从亲爱的/嗨/你好到真诚/问候/谢谢之间的每封邮件的内容。如何使用正则表达式或其他方式有效地做到这一点
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Hi Roger,
Yes, an extension until June 22, 2018 is acceptable.
Regards,
Loren
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Dear Loren,
We had initial discussion with the ABC team us know if you would be able to extend the response due date to June 22, 2018.
Best Regards,
Mr. Roger
Global Director
roger@abc.com
78 Ford st.
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
responding by June 15, 2018.check email for updates
Hello,
John Doe
Senior Director
john.doe@pqr.com
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Please refer to your January 12, 2018 data containing labeling supplements to add text regarding this
symptom. We are currently reviewing your supplements and have
made additional edits to your label.
Feel free to contact me with any questions.
Warm Regards,
Mr. Roger
Global Director
roger@abc.com
78 Ford st.
Center for Research
Office of New Discoveries
Food and Drug Administration
Loren@mno.com
从这段文字中,我只想作为OUTPUT:
Subject: [EXTERNAL] RE: QUERY regarding supplement 73
Yes, an extension until June 22, 2018 is acceptable.
We had initial discussion with the ABC team us know if you would be able to extend the response due date to June 22, 2018.
responding by June 15, 2018.check email for updates
Please refer to your January 12, 2018 data containing labeling supplements to add text regarding this
symptom. We are currently reviewing your supplements and have
made additional edits to your label.
Feel free to contact me with any questions.
答案 0 :(得分:1)
以下是适用于您当前输入的答案。当您处理示例代码超出以下代码概述的参数时,将调整代码 。
with open('email_input.txt') as input:
# List to store the cleaned lines
clean_lines = []
# Reads until EOF
lines = input.readlines()
# Remove some of the extra lines
no_new_lines = [i.strip() for i in lines]
# Convert the input to all lowercase
lowercase_lines = [i.lower() for i in no_new_lines]
# Boolean state variable to keep track of whether we want to be printing lines or not
lines_to_keep = False
for line in lowercase_lines:
# Look for lines that start with a subject line
if line.startswith('subject: [external]'):
# set lines_to_keep true and start capturing lines
lines_to_keep = True
# Look for lines that start with a salutation
elif line.startswith("regards,") or line.startswith("warm regards,") \
or line.startswith("best regards,") or line.startswith("hello,"):
# set lines_to_keep false and stop capturing lines
lines_to_keep = False
if lines_to_keep:
# regex to catch greeting lines
greeting_component = re.compile(r'(dear.*,|(hi.*,))', re.IGNORECASE)
remove_greeting = re.match(greeting_component, line)
if not remove_greeting:
if line not in clean_lines:
clean_lines.append(line)
for item in clean_lines:
print (item)
# output
subject: [external] re: query regarding supplement 73
yes, an extension until june 22, 2018 is acceptable.
we had initial discussion with the abc team us know if you would be able to
extend the response due date to june 22, 2018.
responding by june 15, 2018.check email for updates
please refer to your january 12, 2018 data containing labeling supplements
to add text regarding this symptom. we are currently reviewing your
supplements and have made additional edits to your label.
feel free to contact me with any questions.