使用Python将ID标签添加到HTML标签(BeautifulSoup?)

时间:2018-08-02 18:34:47

标签: python html beautifulsoup

我有一个包含某些标签的HTML文件,我需要以id="rule_1"id="rule_1.1"id="rule_1.2"id="rule_1.2.1"的格式向每个标签添加ID号,等等。例如,当前的HTML是:

<div style="styles">
    <p class="classname">TEXT</p>
    <p class="classname">TEXT</p>
    <ul style="styles">
        <li>
            <p class="classname">TEXT</p>
        </li>
        <li>
            <p class="classname">TEXT</p>
        </li>
    </ul>
</div>

我需要HTML看起来像这样:

<div style="styles" id="rule_1">
    <p class="classname" id="rule_1.1">TEXT</p>
    <p class="classname" id="rule_1.2">TEXT</p>
    <ul style="styles" id="rule_1.3">
        <li id="rule_1.3.1">
            <p class="classname" id="rule_1.3.1.1">TEXT</p>
        </li>
        <li id="rule_1.3.2">
            <p class="classname" id="rule_1.3.2.1">TEXT</p>
        </li>
    </ul>
</div>

我可以手动编写这些内容,但是我希望使用现有的HTML解析器库。是否可以使用BeautifulSoup或其他模块?

我尝试过这样的事情:

from bs4 import BeautifulSoup as html_parser

with open('outputs/HTML/{}.html'.format(deal), 'r') as read_file:
    html_source = read_file.read()

soup = html_parser(html_source, 'html.parser')
html_tags = soup.find_all(['div', 'p', 'span', 'ul', 'li'])

for each_tag in html_tags:
    each_tag.attrs['id'] = html_tags.index(each_tag)

with open('outputs/HTML/{}-id.html'.format(deal), 'w') as save_file:
    save_file.write(str(soup))

但这只会添加id="1"id="2",依此类推。如何使它们像11.11.1.1等那样交错排列?

1 个答案:

答案 0 :(得分:0)

没关系,想通了:

curr_tags = {}

for each_tag in html_tags:
    if html_tags.index(each_tag) == 0:
        each_tag.attrs['id'] = 'rule_1'
    else:
        parent_id = each_tag.parent.attrs['id']
        if parent_id in curr_tags.keys():
            curr_tags[parent_id] += 1
        else:
            curr_tags[parent_id] = 1
        each_tag.attrs['id'] = parent_id + '.{0}'.format(curr_tags[parent_id])