BeautifulSoup重复了find_all

时间:2018-01-31 11:17:40

标签: python beautifulsoup

我在名为html的列表中有html。

html = [u'<body bgcolor="#F2F2F2" lang=EN-GB link=blue vlink=purple style=\'tab-interval:36.0pt\'>\r\n\r\n<div class=WordSection1>\r\n\r\n<p class=MsoNormal style=\'margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid-align:none;text-autospace:none\'>\r\n<b><span style=\'font-family:"Calibri",sans-serif;mso-bidi-font-family:Calibri; color:black\'>From:<span style=\'mso-tab-count:1\'></span></span></b>\r\n\r\n\r\n\r\n']

我通过beautifulsoup解析它:

for i in email:
    soup = BeautifulSoup(i, "html.parser")

    for i in soup(text=lambda text: isinstance(text, Comment)):
        i.extract()

    scrape_selected_tags = soup.find_all(["a", "abbr", "acronym", "address", "b", "big", "br", "caption",
                                          "cite", "code", "datalist", "dd", "dfn", "dir", "dl", "dt",
                                          "div", "em", "figcaption", "footer", "h1", "h2", "h3", "h4",
                                          "h5", "h6", "header", "i", "img", "iframe", "label", "legend",
                                          "li", "mark", "ol", "p", "pre", "q", "small", "source",
                                          "strike", "strong", "span", "sub" , "sup", "table", "tbody",
                                          "td", "th", "time", "title", "tt", "tr", "u", "ul", "video",
                                          "wbr"], recursive = False)

然而,它似乎是重复输出?这是我打印scrape_selected_tags时得到的:

[<div class="WordSection1">\n<p class="MsoNormal" style="margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid-align:none;text-autospace:none">\n<b><span style='font-family:"Calibri",sans-serif;mso-bidi-font-family:Calibri; color:black'>From:<span style="mso-tab-count:1"></span></span></b>\n</p></div>, <p class="MsoNormal" style="margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid-align:none;text-autospace:none">\n<b><span style='font-family:"Calibri",sans-serif;mso-bidi-font-family:Calibri; color:black'>From:<span style="mso-tab-count:1"></span></span></b>\n</p>, <b><span style='font-family:"Calibri",sans-serif;mso-bidi-font-family:Calibri; color:black'>From:<span style="mso-tab-count:1"></span></span></b>, <span style='font-family:"Calibri",sans-serif;mso-bidi-font-family:Calibri; color:black'>From:<span style="mso-tab-count:1"></span></span>, <span style="mso-tab-count:1"></span>]

有谁知道为什么会发生这种情况以及如何制止这种情况?

0 个答案:

没有答案