Question

我正在使用此功能来解析电子邮件。我能够解析“简单”的多部分电子邮件，但是当电子邮件定义多个边界（子部分）时，它会产生错误（UnboundLocalError：在赋值之前引用的局部变量'html'）。我希望脚本将文本和html部分分开并仅返回html部分（除非没有html部分，否则返回文本）。

def get_text(msg):
text = ""
if msg.is_multipart():
    for part in msg.get_payload():
        if part.get_content_charset() is None:
            charset = chardet.detect(str(part))['encoding']
        else:
            charset = part.get_content_charset()
        if part.get_content_type() == 'text/plain':
            text = unicode(part.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
        if part.get_content_type() == 'text/html':
            html = unicode(part.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
    if html is None:
        return text.strip()
    else:
        return html.strip()
else:
    text = unicode(msg.get_payload(decode=True),msg.get_content_charset(),'ignore').encode('utf8','replace')
    return text.strip()

Answer 1

就像评论所说，你总是检查html，但只在其中一个特定情况下声明它。多数是错误告诉你的，你在分配之前引用了html。在python中，如果尚未将任何内容分配给任何内容，则检查某些内容是否为无效。例如，打开python交互式提示符：

>>> if y is None:
...   print 'none'
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'y' is not defined

正如您所看到的，您不仅可以检查是否存在变量。回到你的具体案例。

您需要先将html设置为None，然后稍后您将检查它是否仍为None。即编辑你的代码：

def get_text(msg):
text = ""
if msg.is_multipart():
    html = None
    for part in msg.get_payload():
        if part.get_content_charset() is None:
            charset = chardet.detect(str(part))['encoding']
        else:
            charset = part.get_content_charset()
        if part.get_content_type() == 'text/plain':
            text = unicode(part.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
        if part.get_content_type() == 'text/html':
            html = unicode(part.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
    if html is None:
        return text.strip()
    else:
        return html.strip()
else:
    text = unicode(msg.get_payload(decode=True),msg.get_content_charset(),'ignore').encode('utf8','replace')
    return text.strip()

这解释了一点： http://code.activestate.com/recipes/59892-testing-if-a-variable-is-defined/

Answer 2

这里有与OlliM有用的建议相同的代码。如果没有这种改变，你就无法正确地解析＆＃34; multipart / alternative＆＃34;电子邮件中的容器。

import chardet

def get_text(msg):
    """ Parses email message text, given message object
    This doesn't support infinite recursive parts, but mail is usually not so naughty.
    """
    text = ""
    if msg.is_multipart():
        html = None
        for part in msg.get_payload():
            if part.get_content_charset() is None:
                charset = chardet.detect(str(part))['encoding']
            else:
                charset = part.get_content_charset()
            if part.get_content_type() == 'text/plain':
                text = unicode(part.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
            if part.get_content_type() == 'text/html':
                html = unicode(part.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
            if part.get_content_type() == 'multipart/alternative':
                for subpart in part.get_payload():
                    if subpart.get_content_charset() is None:
                        charset = chardet.detect(str(subpart))['encoding']
                    else:
                        charset = subpart.get_content_charset()
                    if subpart.get_content_type() == 'text/plain':
                        text = unicode(subpart.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')
                    if subpart.get_content_type() == 'text/html':
                        html = unicode(subpart.get_payload(decode=True),str(charset),"ignore").encode('utf8','replace')

        if html is None:
            return text.strip()
        else:
            return html.strip()
    else:
        text = unicode(msg.get_payload(decode=True),msg.get_content_charset(),'ignore').encode('utf8','replace')
        return text.strip()

写出更优雅的结构，不重复任何代码，留给读者练习。

另外，请查看此helpful diagram of container structure。

Answer 3

我的代码需要进行以下更改： unicode 更改为 str ， str（part）更改为 charset = chardet.detect（ str （part））['encoding']中的字节（部分）。我将bs4应用于html。代码对我的项目很有用。谢谢。

使用Python解析包含子部件的多部分电子邮件

3 个答案: