BeautifulSoup儿童标签并删除重复

时间:2018-01-06 19:27:30

标签: python html beautifulsoup

我试图通过使用Python 2通过BeautifulSoup解析它来清理一些html。

BeautifulSoup解析与raw_html中的website_id相关联的html_dict。它还会删除与html标记关联的所有属性(<a>, <b><p>)。

html_dict = {"l0000": ["<a href='some url'>test</a>", "lol", "<a><b>test</b></a>"], "l0001":["<p>this is html</p>", "<p>this is html</p>"]}

clean_html = {}

for website_id, raw_html in html_dict.items():
    for i in raw_html:
        soup = BeautifulSoup(i, 'html.parser')
        scrape_selected_tags = soup.find_all(["a", "b", "p"])

# Remove attributes from html tag

        for i in scrape_selected_tags:
            i.attrs = {}

        print website_id, scrape_selected_tags

输出:

l0001 [<p>this is html</p>]
l0001 [<p>this is html</p>]
l0000 [<a>test</a>]
l0000 []
l0000 [<a><b>test</b></a>, <b>test</b>]

我有两个问题:

1)最后一个输出输出“test”两次。我认为这是因为它被<a><b>标签包围了?如何处理子标签只输出<a><b>test</b></a>

2)给定一个唯一的website_id,如何删除重复项,以便<p>this is html</p>只出现一次l0001?我知道scrape_selected_tags的类型为bs4.element.ResultSet,我不知道如何处理此问题,并以与html_dict相同的格式插入新输出,但在clean_html中。

由于

1 个答案:

答案 0 :(得分:1)

1)将recursive参数设置为False。这将只选择直接的后代,而不会更深入的汤。这种方法的问题是子标签将保留其属性,因此您将不得不再使用一个循环来清除它们。

2)使用集合(或者您可以使用列表推导)仅选择唯一标签。

from bs4 import BeautifulSoup

html_dict = {
    "l0000":["<a href='some url'>test</a>", "lol", "<a class='1'><b class='2'>test</b></a>"], 
    "l0001":["<p>this is html</p>", "<p>this is html</p>"]
}
clean_html = {}

for website_id, raw_html in html_dict.items():
    clean_html[website_id] = []
    for i in raw_html:
        soup = BeautifulSoup(i, 'html.parser')
        scrape_selected_tags = soup.find_all(["a", "b", "p"], recursive=False)

        for i in scrape_selected_tags:
            i.attrs = {}
        for i in [c for p in scrape_selected_tags for c in p.find_all()]:
            i.attrs = {}
        clean_tags = list(set(scrape_selected_tags + clean_html[website_id]))
        clean_html[website_id] = clean_tags

print(clean_html)
{'l0001': [<p>this is html</p>], 'l0000': [<a><b>test</b></a>, <a>test</a>]}