查找标题标签之间的所有html内容,并在Python中用section标签包装

时间:2019-02-27 10:01:32

标签: python-3.x

在Pythong中,我想用section标记包装内容部分。我有以下html内容:

<h2>Heading 2.1</h2>
<p>Para 1</p>
<p>Para 2</p>
<h3>Heading 3.1</h3>
<p>Para 3</p>
<p>Para 4</p>
<h2>Heading 2.2</h2>
<p>Para 5</p>
<h3>Heading 3.2</h3>
<p>Para 6</p>

我希望它成为

<section id="1">
    <h2>Heading 2.1</h2>
    <p>Para 1</p>
    <p>Para 2</p>
    <section id="1.1">
        <h3>Heading 3.1</h3>
        <p>Para 3</p>
        <p>Para 4</p>
    </section>
</section>
<section id="2">
    <h2>Heading 2.2</h2>
    <p>Para 5</p>
    <section id="2.1">
        <h3>Heading 3.2</h3>
        <p>Para 6</p>
    </section>
</section>

1 个答案:

答案 0 :(得分:0)

本来可以尝试的,但最终比我希望的要困难得多,并且如果没有一些不可取的行为,例如修改字符串(或替换它,因为字符串是不可变的),而我正在遍历它。

我敢肯定有更好的方法,希望有人建议,但这是我所做的:

html_string = '''<h2>Heading 2.1</h2>
<p>Para 1</p>
<p>Para 2</p>
<h3>Heading 3.1</h3>
<p>Para 3</p>
<p>Para 4</p>
<h2>Heading 2.2</h2>
<p>Para 5</p>
<h3>Heading 3.2</h3>
<p>Para 6</p>'''

def depth_wrap(input_string, current_depth, base_heading_label = ""):

    # Index to track where we're searching from in the string.
    current_search_index = 0
    # Counter for number of sections seen at this depth.
    depth_counter = 1

    # String we'll insert on ending a section.
    end_section_string = '\n</section>\n'

    # Set the return string to the input string.
    return_string = input_string

    # Loop through looking for Headings
    while True:

        # String we'll insert if we find a new heading.
        start_string_to_insert = '<section id="' + base_heading_label + str(depth_counter) + '">\n'
        #  String to search for headings at the current depth.
        search_string = '<h' + str(current_depth) + '>'

        # Where is the next header at this depth?
        index_of_next_header = return_string.find(search_string, current_search_index)

        # Where is the next header at this depth, after the above one.
        index_of_next_header_at_same_depth = return_string.find(search_string, index_of_next_header +1)

        # If no headers found, then break the loop.
        if index_of_next_header == -1:
            break

        # Is this the last header at this depth?
        if index_of_next_header_at_same_depth == -1:
            # Extract content from this header to the end of the string.
            string_between_headers = return_string[index_of_next_header:]

            # Look for any headers at the next level down, recurse, and then end the section.
            next_level_string = depth_wrap(string_between_headers, current_depth + 1, base_heading_label + str(depth_counter) + '.')\
                                + end_section_string
            # Replace the string between with the updated result from above.
            return_string = return_string.replace(string_between_headers, next_level_string, 1)

            # Add the start of the section last to avoid shifting the indices.
            return_string = return_string[:index_of_next_header] + start_string_to_insert + return_string[index_of_next_header:]
            break
        else:
            # Extract content from between this header and the next.
            string_between_headers = return_string[index_of_next_header: index_of_next_header_at_same_depth]

            # Look for any headers at the next level down, recurse, and then end the section.
            next_level_string = depth_wrap(string_between_headers, current_depth + 1, base_heading_label + str(depth_counter) + '.') \
                                + end_section_string
            # Replace the string between with the updated result from above.
            return_string = return_string.replace(string_between_headers, next_level_string, 1)
            # Add the start of the section last to avoid shifting the indices.
            return_string = return_string[:index_of_next_header] + start_string_to_insert + return_string[index_of_next_header:]

            # Update the search index to search from after the updated section of text.
            current_search_index = index_of_next_header + len(start_string_to_insert) + len(next_level_string)

            # Update the depth counter for labelling.
            depth_counter += 1

    # Strip any extra line returns or spaces from end and return.
    return return_string.strip()


print(depth_wrap(html_string, 2))

我也无法使缩进在所有情况下都能始终如一地工作,因此将其排除在外。像Beautiful Soup这样的模块可以根据需要来美化输出。