我在名为html
的列表中有html。
html = [u'<body bgcolor="#F2F2F2" lang=EN-GB link=blue vlink=purple style=\'tab-interval:36.0pt\'>\r\n\r\n<div class=WordSection1>\r\n\r\n<p class=MsoNormal style=\'margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid-align:none;text-autospace:none\'>\r\n<b><span style=\'font-family:"Calibri",sans-serif;mso-bidi-font-family:Calibri; color:black\'>From:<span style=\'mso-tab-count:1\'></span></span></b>\r\n\r\n\r\n\r\n']
我通过beautifulsoup解析它:
for i in email:
soup = BeautifulSoup(i, "html.parser")
for i in soup(text=lambda text: isinstance(text, Comment)):
i.extract()
scrape_selected_tags = soup.find_all(["a", "abbr", "acronym", "address", "b", "big", "br", "caption",
"cite", "code", "datalist", "dd", "dfn", "dir", "dl", "dt",
"div", "em", "figcaption", "footer", "h1", "h2", "h3", "h4",
"h5", "h6", "header", "i", "img", "iframe", "label", "legend",
"li", "mark", "ol", "p", "pre", "q", "small", "source",
"strike", "strong", "span", "sub" , "sup", "table", "tbody",
"td", "th", "time", "title", "tt", "tr", "u", "ul", "video",
"wbr"], recursive = False)
然而,它似乎是重复输出?这是我打印scrape_selected_tags
时得到的:
[<div class="WordSection1">\n<p class="MsoNormal" style="margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid-align:none;text-autospace:none">\n<b><span style='font-family:"Calibri",sans-serif;mso-bidi-font-family:Calibri; color:black'>From:<span style="mso-tab-count:1"></span></span></b>\n</p></div>, <p class="MsoNormal" style="margin-left:120.0pt;text-indent:-120.0pt;tab-stops:120.0pt;mso-layout-grid-align:none;text-autospace:none">\n<b><span style='font-family:"Calibri",sans-serif;mso-bidi-font-family:Calibri; color:black'>From:<span style="mso-tab-count:1"></span></span></b>\n</p>, <b><span style='font-family:"Calibri",sans-serif;mso-bidi-font-family:Calibri; color:black'>From:<span style="mso-tab-count:1"></span></span></b>, <span style='font-family:"Calibri",sans-serif;mso-bidi-font-family:Calibri; color:black'>From:<span style="mso-tab-count:1"></span></span>, <span style="mso-tab-count:1"></span>]
有谁知道为什么会发生这种情况以及如何制止这种情况?