Python-Imap:检索MIMEMultipart的邮件并使用HTML标记获取正文

时间:2018-04-19 10:25:53

标签: python html email tags imap

我的情况:我目前正从我的收件箱中检索所有的Outlook(2016)邮件,更具体地说我正在检索一个表:

¦ Product ¦ Currency ¦ Tenor (months) ¦  Code 1 ¦   
¦ MyItem  ¦  USD     ¦   12           ¦ AAA01   ¦

我的目标是抓住每个人的身体然后将它们存储在MsSQL服务器中。

我很难理解' Multipart '一词,现在有了(长)小时的情侣,这一点就更清楚了。

所以现在我的流程是:

  • 浏览收件箱中的所有邮件
  • 创建邮件列表Id
  • 对于此列表中的所有ID,我正在检查邮件是否为Multipart
    • 如果 - >我使用body = part.get_payload(decode=True)
    • 检索正文
    • 如果 - >我使用body = b.get_payload(decode=True)
    • 检索正文

所以在这两种情况下我都使用get_payload(decode=True)

当我的邮件是“ Multipart ”时,它会在我的调试器中显示为一个简单的文字:

Product
Currency
Tenor (months)
Code 1

MyItem  
USD
12
AAA01

当我的邮件不是 Multipart 时,它会在我的调试器中显示 HTML标记

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head>
<body>
    <table>
        <tr>
            <td><b>Product</b></td>
            <td><b>Currency</b></td>
            <td><b>Tenor (months)</b></td>
            <td><b>Code 1</b></td>
        </tr>
        <tr>
            <td>MyItem</td>
            <td>USD</td>
            <td>12</td>
            <td>AAA01</td>
        </tr>
    </table>
</body>
</html>

如何使用HTML标签检索Multipart的邮件正文而不是简单文本?

我需要HTML标记来识别每个标头及其对应的值才能使用Beautiful Soup并将所有这些数据插入我的MSSQL Server?

感谢您的帮助,让我更好地了解MIMEMultipart!

这是我的(凌乱)python代码:

@app.route('/ps_rfq_imap', methods=['GET', 'POST'])

def ps_rfq_imap():

#Connection to IMAP/OULTLOOK
url = 'outlook.mycompayny.com'
mailbox = imaplib.IMAP4_SSL(url,993)
user,password = ('mymail@mycompany.com','mypassword')
mailbox.login(user,password)

mailbox.list() # Lists all labels in GMail
mailbox.select('INBOX') # Connected to inbox.

#giving list id, not outlook ones, but uid ones
typ, data = mailbox.search(None,'ALL') 
#Get all the uid outlook of all emails
#typ, data =    mailbox.uid('search', None,'ALL')
ids = data[0]
id_list = ids.split()
print id_list
#get the most recent email id
latest_email_id = int( id_list[-1] )

for i in range( latest_email_id, latest_email_id-(latest_email_id), -1 ):
    print 'EMAIL ID:'
    print i
    typ, data = mailbox.fetch( i, '(RFC822)')
    msg=str(email.message_from_string(data[0][1]))


    b = email.message_from_string(msg)
    body = ""

    if b.is_multipart():
        email_from = b['from']
        email_subject = b['subject']
        print 'FROM:'
        print email_from
        print 'SUBJECT'
        print email_subject
        for part in b.walk():
            ctype = part.get_content_type()
            cdispo = str(part.get('Content-Disposition'))

            # skip any text/plain (txt) attachments
            if ctype == 'text/plain' and 'attachment' not in cdispo:

                body = part.get_payload(decode=True)  # decode
                print    '******************************* MULTIPART body content***********************************'
                print body 
                break
            elif ctype == 'text/html':
                print 'HTML PART'
                continue
            # not multipart - i.e. plain text, no attachments, keeping fingers crossed
    else:
        email_from = b['from']
        email_subject = b['subject']
        print 'FROM:'
        print email_from
        print 'SUBJECT'
        print email_subject
        body = b.get_payload(decode=True)
        print   '******************************* SIMMMMMMPPPPLLLLEEEE***********************************'
        print body 

return body

编辑:这里myupdated代码只捕获我的电子邮件的HTML部分,以防它可以帮助某人:

typ, data = mailbox.fetch( i, '(RFC822)')
    msg=str(email.message_from_string(data[0][1]))


    b = email.message_from_string(msg)
    body = ""

    if b.is_multipart():
        email_from = b['from']
        email_subject = b['subject']
        for part in b.walk():
            ctype = part.get_content_type()
            cdispo = str(part.get('Content-Disposition'))
            # skip any text/plain (txt) attachments
            if ctype == 'text/plain' and 'attachment' not in cdispo:
                continue

            elif ctype == 'text/html':
                print 'HTML PART'

                body = part.get_payload(decode=True)  # decode

                soup = BeautifulSoup(body)

                metaTag = soup.find_all('meta')

                if metaTag is not None:
                    print 'WE HAVE FOUND THE BODY******************** Time to process it with BS for getting the value of the table'
                    soup = BeautifulSoup(body, "html.parser")
                    tables = soup.findChildren('table')


                continue
            # not multipart - i.e. plain text, no attachments, keeping fingers crossed
    else:
        continue

最诚挚的问候,

0 个答案:

没有答案