我想从此文本文档中删除所有从,至,抄送,主题发送的标签,而仅保留邮件正文,以便我可以使用它来总结文档的内容。在python中执行此操作的最佳方法是什么。我认为最好先进行提取,然后再对这种情况使用预处理。还在此处附加代码。因此,如果有人可以提出建议,那将非常有帮助。该文件的有效负载和ismultipart部分未正确完成,这是我的疑问所在,因此在该部分添加了注释并需要帮助。
下面附上代码和.txt文件以供参考。
import os, sys, csv
import glob
import re
import email
#from tika import parser
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim.summarization import summarize, keywords
# Set path to directory where files are
dirs = 'C:\\Users\\Lenovo\\.spyder-py3\\Testing\\'
#os.chdir(dirs)
for filename in glob.glob(os.path.join(dirs, '*.txt')):
try:
for files in filename:
file = open(filename, 'r', encoding ='utf-8')
filecontents = file.read()
filecontents = re.sub(r'\s+', ' ', filecontents)
print(filecontents)
filecontents = filecontents.strip('\n')
b = email.message_from_string(filecontents)# NEED
if b.is_multipart():#HELP
for payload in b.get_payload():#HERE
# if payload.is_multipart(): ...#SO
print (payload.get_payload())#COMMENTED
else:#
print (b.get_payload())#
summary = summarize(filecontents, ratio =0.10)
print(summary)
kw = keywords(filecontents, words=15)
print(kw)
break
#writer.writerow([file, summary, kw])
except Exception as e:
pass
文本文件
Stephanie /ANN
From: Mr.A, <.Mr.A@abc.com>
Sent: Wednesday, July 25, 2018 2:27 PM
To: , Tim /ANN; Abd, May /ANN
Cc: Mr.A, ; Theoder Jerry,
Subject: [EXTERNAL] RE: Holdings: XXXX SPA – mfno.1322
Dear Dr. Tim A. ,
The option-2 is fine. By the way, we had received in the past Letter of Authorization for many companies other
than Spa and I guess Xxxx does not do bANNiness with them either. If yes, then need to submit withdrawal
of Letter of Authorization for those companies and send a Letter of Authorization for spa. stating for any
applications submitted. We will send an administrative filing issue letter for both the holder and the agent.
Thank you!
Regards,
Mr.A
PRODUCT Master File
CDER
Currently, there is no requirement to submit or resubmit NAs in any electronic format. However, starting May 5, 2018,
new NAs, as well as any submissions to the existing NAs mANNt be submitted electronically in legal (electronic Common
Technical Document) format specified by GROUP A in the legal guidance. NA submissions that are not submitted in legal
format after this date may be subject to rejection. For more information please check the NA website
www.GROUP A.gov/abc/bca
This communication is an informal communication consistent with which represents my best judgment
at this time, but does not constitute an advisory opinion, does not necessarily represent the formal position of the
GROUP A, and does not bind or otherwise obligate or commit the agency to the views expressed. This communication,
including any attachments, is intended only for the person or entity to which it is addressed and may contain
confidential material. Any review, retransmission, distribution or other ANNe of this information by persons or entities
other than the intended recipient is prohibited. If you received this in error, please destroy any copies, contact the
sender and delete the material from any computer. Thank you.
From: Tim.@xxxx.com [mailto:Tim.@xxxx.com]
Sent: Wednesday, July 25, 2018 2:10 PM
To: Mr.A, <.Mr.A@abc.com>
Cc: May.Abd@xxxx.com
Subject: RE: Holdings: XXXX SPA ‐ dm 013383
Dear ,
XXXX
2
Thanks for your phone call to clarify your needs and to understand the situation. I have confirmed that Xxxx only does
direct bANNiness for test S intermediate with b. and not with the other companies (e,
x, etc.) that are secondary companies. Based on our discANNsion, I believe that we do not need to
provide QAs for these secondary companies or mention them in our NA file as they would be covered under a
separate QA S.p.A. to them. If this is correct, then I believe you mentioned that we have two options as
described below:
Option 1: We can issue a separate QA for each . NA to be specific on which NA is being cross‐referenced
to our NA 13383.
Option 2: We can do a single QA for and mention that they can cross‐reference any of their NAs. This
would allow them to cross‐reference any of their
If I have misunderstood or am incorrect in my response and we need to discANNs further, please let me know.
If not, when you issue your request, can you please send to me and May Abd by email?
Kind regards.
Tim
Tim A. , BsC
Director, YY SERVICES)
Xxxx ANN
Phone/FAX: 2312333
Cell: 23312123131
Email: tim.@xxxx.com
From: , Tim /ANN
Sent: Monday, July 23, 2018 7:05 AM
To: 'Mr.A, '
Cc: Abd, May /ANN
Subject: RE: [EXTERNAL] Holder: XXXX SPA - NA 013383
Dear ,
May is now on vacation and I am covering for her during her absence. Is there a good time to call you today or later this
week? Please let me know and we can schedule or please call my cell phone 21313131231 at your convenience.
Kind regards.
Tim
Tim A. , MSC
Director, PQR
Xxxx
Phone/FAX: 2312313313
Cell: 3142342424
Email: tim.@xxxx.com
XXXX
3
‐‐‐‐‐‐‐‐‐‐ Forwarded message ‐‐‐‐‐‐‐‐‐‐
From: "Mr.A, " <.Mr.A@abc.com>
Date: Jul 20, 2018 9:01 AM
Subject: [EXTERNAL] Holder: XXXX SPA ‐ NA 013383
To: "TRETE/ANN" <May.Abd@xxxx.com>
Cc: "mno.com>
Dear May Abd,
. I need to talk to you on this.
Thank you!
Regards,
Mr.A
PRODUCT Master File
CDER
Currently, there is no requirement to submit or resubmit NAs in any electronic format.
format after this date may be subject to rejection. For more information please check the NA website
www.GROUP A./cder/NA
This communication is an informal communication which represents my best judgment
at this time, but does not constitute an advisory opinion, does not necessarily represent the formal position of the
GROUP A, and does not bind or otherwise obligate or commit the agency to the views expressed. This communication,
including any attachments, is intended only for the person or entity to which it is addressed and may contain
confidential material. Any review, retransmission, distribution or other ANNe of this information by persons or entities
other than the intended recipient is prohibited. If you received this in error, please destroy any copies, contact the
sender and delete the material from any computer. Thank you.
XXXX
答案 0 :(得分:1)
尚不清楚您需要帮助的代码的哪一部分,您希望它做什么而不是当前要做什么,或者如何传递结果以进行正确的进一步处理。
但是,我会注意到您的代码有很多问题。
UnicodeDecodeError
和其他障碍。except Exception:
是主要错误。也许您只是将其放入调试中,但实际上会使调试更加困难。multipart/alternative
,其中相同的消息以不同的格式呈现,因此收件人可以决定是否要阅读以HTML,纯文本呈现的消息,还是偶尔以PDF或RTF或单个图像的形式阅读的消息,具体取决于应用程序。此外,HTML结构通常包含多个部分,因为主要的HTML也希望提取MIME结构中提供的小图像(公司徽标,动画表情符号以及对读者的其他侮辱)。也许也请参见What are the "parts" in a multipart email? 此答案的另一个复杂之处是Python的email
库相对较新。新功能是在Python 3.3中实验性引入的,但直到3.6才成为文档记录和默认版本。您将在野外发现的大多数代码都将使用3.6之前的功能,但展望未来,您可能希望针对新的和改进的API。
使用旧版API,您的代码可能类似于
from email import message_from_binary_file
for filename in glob.glob(os.path.join(dirs, '*.txt')):
# Not useful; we already have a filename
#for files in filename:
# Open in binary mode, don't try to guess encoding
# Use a context manager so we don't leave the file open
with open(filename, 'rb') as file:
# Just let the email library take it from here
#filecontents = file.read()
#filecontents = re.sub(r'\s+', ' ', filecontents)
#print(filecontents)
#filecontents = filecontents.strip('\n')
b = email.message_from_binary_file(file)
if b.is_multipart():
# There are a number of things you could do to pick out
# one or more payloads for analysis, but let's just take
# the first text/plain part and call it "main_part"
for part in b.walk()
if part.get_content_type() == 'text/plain':
main_part = part.get_payload()
break
else:
main_part = b.get_payload()
summary = summarize(main_part, ratio =0.10)
print(summary)
kw = keywords(main_part, words=15)
print(kw)
要使用新的3.6+ API,您需要对此进行调整,使其类似
from email.policy import default as default_email_policy
...
b = email.message_from_binary_file(file, policy=default_email_policy)
main_part = b.get_body(['related', 'plain', 'html'])
这将导致一个新的email.message.EmailMessage
对象,该对象具有与旧email.message.Message
类不同的方法和不同的行为。该文档建议默认一天会默认传入默认的policy
,这时旧代码将切换为新行为(但也可能会出现一些令人不愉快的意外和彻底的破坏)。
还请注意get_body()
method,它是3.6中的新功能,可让您轻松挑选“可能的主要零件”;尽管如果没有text/plain
部分可用,则上面的代码将退回到HTML,然后您将需要对其进行进一步处理以提取实际文本(也许看看Beautifulsoup?)
没有技术,可靠,可靠的方法可以将样板(标头,签名等)与电子邮件中的实际内容分开。某些HTML电子邮件客户端可能会在生成的消息中提供有关<div>
包含用户键入的内容的提示,但是在一般情况下,您只需要对(坦率地说,毫无希望的)启发式方法大为惊讶。 / p>
答案 1 :(得分:0)
如果只想从电子邮件中删除“发件人”,“已发送”,“收件人”,“抄送”,“主题”和“转发”标签,则可以使用正则表达式。
import re
with open('email_input.txt', 'r') as input:
lines = input.readlines()
no_new_lines = [i.strip() for i in lines]
for line in no_new_lines:
email_component = re.compile(r'((From:|Sent:|To:|Cc:|Subject:|Forwarded message).*)', re.IGNORECASE)
remove_component = re.findall(email_component, line)
if remove_component:
print(line)
# output
‐‐‐‐‐‐‐‐‐‐ Forwarded message ‐‐‐‐‐‐‐‐‐‐
From: Mr.A, <.Mr.A@abc.com>
Sent: Wednesday, July 25, 2018 2:27 PM
To: , Tim /ANN; Abd, May /ANN
Cc: Mr.A, ; Theoder Jerry,
Subject: [EXTERNAL] RE: Holdings: XXXX SPA – mfno.1322
关于在“问候”之后删除内容。我没有将其添加到我的正则表达式中,因为可以通过多种方式对电子邮件进行签名。以下是一些最常见的方法:
Best,
Best regards,
Best wishes,
Fond regards,
Kind regards,
Regards,
Sincerely,
Sincerely yours,
Thank you,
With appreciation,
With gratitude,
Yours sincerely,
已更新答案一
下面更新的答案将清除您的更多电子邮件输入,但需要更多清除。
import re
with open('email_input.txt', 'r') as input:
lines = input.readlines()
# Remove some of the extra lines
no_new_lines = [i.strip() for i in lines]
# regex to catch header lines
email_component = re.compile(r'((From:|Sent:|To:|Cc:|Subject:|Date:|Forwarded message).*)', re.IGNORECASE)
remove_headers = [x for x in no_new_lines if not email_component.findall(x)]
# regex to catch greeting lines
greeting_component = re.compile(r'(Dear.*)', re.IGNORECASE)
remove_greeting = [x for x in remove_headers if not greeting_component.findall(x)]
# regex to catch lines with contact details
contact_component = re.compile(r'(Phone.*:)|(Cell:.*)|(Email:.*)', re.IGNORECASE)
remove_contacts = [x for x in remove_greeting if not contact_component.findall(x)]
# regex to catch lines with salutation
email_salutation_component = re.compile(r'Best,(.*?)|Best regards,(.*?)|Best wishes,(.*?)|Fond regards,(.*?)|'
r'Kind regards(.*?)|Regards,(.*?)|Sincerely,(.*?)|Sincerely yours,(.*?)|'
r'Thank you,(.*?)|With appreciation,(.*?)|Yours sincerely,(.*?)', re.IGNORECASE)
remove_salutations = [x for x in remove_contacts if not email_salutation_component.findall(x)]
# do something else
更新了两个答案
下面更新的答案使用python电子邮件库。我的输入文件是从我的电子邮件客户端提取的原始电子邮件。使用下面的代码,我能够提取我尝试过的每封电子邮件的正文。我还测试了gensim模块,它正常工作。
import email
from gensim.summarization import summarize, keywords
with open('email_input.txt', 'r') as input:
email_body = ''
raw_message = input.read()
# Return a message object structure from a string
msg = email.message_from_string(raw_message)
# iterate over all the parts and subparts of a message object tree
for part in msg.walk():
# Return the message’s content type.
if part.get_content_type() == 'text/plain':
email_body = part.get_payload()
summary = summarize(email_body, ratio=0.10)
print(summary)
kw = keywords(email_body, words=15)
print(kw)
最终答案
这是我对这个问题的最终答案。希望这四个答案之一能满足您的要求。
您将不得不对输出进行一些小的清理,因为我不知道您的所有要求。
with open('email_input.txt') as infile:
# Boolean state variable to keep track of whether we want to be printing lines or not
lines_to_keep = False
for line in infile:
# Look for lines that start with a greeting
if line.startswith("Dear"):
# set lines_to_keep true and start capturing lines
lines_to_keep = True
# Look for lines that start with a salutation
elif line.startswith("Regards") or line.startswith("Kind regards"):
# set lines_to_keep false and stop capturing lines
lines_to_keep = False
if lines_to_keep:
greeting_component = re.compile(r'(Dear.*)', re.IGNORECASE)
remove_greeting = re.match(greeting_component, line)
if not remove_greeting:
print (line.rstrip('\n'))
# output
The option-2 is fine. By the way, we had received in the past Letter of Authorization for many companies other than Spa and I guess Xxxx does not do bANNiness with them either. If yes, then need to submit withdrawal of Letter of Authorization for those companies and send a Letter of Authorization for spa. stating for any applications submitted. We will send an administrative filing issue letter for both the holder and the agent.
more here....