使用Python3解析附加的.MSG文件

时间:2019-04-09 19:06:11

标签: python-3.x

我正在尝试监视一个网络钓鱼收件箱,该收件箱既可以接收普通电子邮件(即带有潜在附件的HTML /文本),也可以接收附有.MSG文件的电子邮件。

目标是让用户向phishing@company.com发送电子邮件,一旦我解析了各种链接(可能是恶意的)以及附件(也可能是恶意的),我将对其进行一些分析。

我遇到的问题是附加的.msg文件的正文。

使用下面的代码,我可以将原始电子邮件中的“至”,“从”,“主题”以及所有链接拉到。它还会拉下.msg文件的所有附件(即在我的测试中,我能够拉下.msg内的PDF)。但是,我无法获得.msg文件的任何收件人,主题或正文。

当我将其原始打印时,我会以非常难看的格式获得其中的一些内容,但是显然,由于包含多个部分,我在获取该信息方面做错了事。

我对Python还是很陌生,所以将不胜感激。

import imaplib
import base64
import os
import email
from bs4 import BeautifulSoup

server = 'mail.server.com'
email_user = 'phishing@company.com'
email_pass = 'XXXXXXXXXXXX'
output_dir = '/tmp/attachments/'
body = ""

def get_body(msg):
    if msg.is_multipart():
        return get_body(msg.get_payload(0))
    else:
        return msg.get_payload(None, True)

def get_attachments(msg):
    for part in msg.walk():
        if part.get_content_maintype()=='multipart':
            continue
        if part.get('Content-Disposition') is None:
            continue
        fileName = part.get_filename()

        if bool(fileName):
            filePath = os.path.join(output_dir, fileName)
            with open(filePath,'wb') as f:
                f.write(part.get_payload(decode=True))

mail = imaplib.IMAP4_SSL(server)
mail.login(email_user, email_pass)
mail.select('INBOX')

result, data = mail.search(None, 'UNSEEN')
mail_ids = data[0]
id_list = mail_ids.split()
print(id_list)

for emailid in id_list:
    result, email_data = mail.fetch(emailid, '(RFC822)')
    raw_email = email_data[0][1]
    raw_email_string = raw_email.decode('utf-8')
    email_message = email.message_from_string(raw_email_string)
    email_from = str(email.header.make_header(email.header.decode_header(email_message['From'])))
    email_to = str(email.header.make_header(email.header.decode_header(email_message['To'])))
    subject = str(email.header.make_header(email.header.decode_header(email_message['Subject'])))
    print('From: ' + email_from)
    print('To: ' + email_to)
    print('Subject: ' + subject)
	
    get_attachments(raw_email)

    for part in email_message.walk():
        body = part.get_payload(0)
        content = body.get_payload(decode=True)
        soup = BeautifulSoup(content, 'html.parser')
        for link in soup.find_all('a'):
            print('Link: ' + link.get('href'))
        break

1 个答案:

答案 0 :(得分:0)

我使用以下代码来完成此工作。我基本上必须在.msg步内进行多个for循环,然后才在text / html部分中提取相关信息。

for emailid in id_list:
    result, data = mail.fetch(emailid, '(RFC822)')
    raw = email.message_from_bytes(data[0][1])
    get_attachments(raw)
    #print(raw)

    header_from = mail.fetch(emailid, "(BODY[HEADER.FIELDS (FROM)])")
    header_from_str = str(header_from)
    mail_from = re.search('From:\s.+<(\S+)>', header_from_str)

    header_subject = mail.fetch(emailid, "(BODY[HEADER.FIELDS (SUBJECT)])")
    header_subject_str = str(header_subject)
    mail_subject = re.search('Subject:\s(.+)\'\)', header_subject_str)
    #mail_body = mail.fetch(emailid, "(BODY[TEXT])")
    print(mail_from.group(1))
    print(mail_subject.group(1))


    for part in raw.walk():
        if part.get_content_type() == 'message/rfc822':
            part_string = str(part)
            original_from = re.search('From:\s.+<(\S+)>\n', part_string)
            original_to = re.search('To:\s.+<(\S+)>\n', part_string)
            original_subject = re.search('Subject:\s(.+)\n', part_string)
            print(original_from.group(1))
            print(original_to.group(1))
            print(original_subject.group(1))
        if part.get_content_type() == 'text/html':
            content = part.get_payload(decode=True)
            #print(content)
            soup = BeautifulSoup(content, 'html.parser')
            for link in soup.find_all('a'):
                print('Link: ' + link.get('href'))