在Pythong中,我想用section标记包装内容部分。我有以下html内容:
<h2>Heading 2.1</h2>
<p>Para 1</p>
<p>Para 2</p>
<h3>Heading 3.1</h3>
<p>Para 3</p>
<p>Para 4</p>
<h2>Heading 2.2</h2>
<p>Para 5</p>
<h3>Heading 3.2</h3>
<p>Para 6</p>
我希望它成为
<section id="1">
<h2>Heading 2.1</h2>
<p>Para 1</p>
<p>Para 2</p>
<section id="1.1">
<h3>Heading 3.1</h3>
<p>Para 3</p>
<p>Para 4</p>
</section>
</section>
<section id="2">
<h2>Heading 2.2</h2>
<p>Para 5</p>
<section id="2.1">
<h3>Heading 3.2</h3>
<p>Para 6</p>
</section>
</section>
答案 0 :(得分:0)
本来可以尝试的,但最终比我希望的要困难得多,并且如果没有一些不可取的行为,例如修改字符串(或替换它,因为字符串是不可变的),而我正在遍历它。
我敢肯定有更好的方法,希望有人建议,但这是我所做的:
html_string = '''<h2>Heading 2.1</h2>
<p>Para 1</p>
<p>Para 2</p>
<h3>Heading 3.1</h3>
<p>Para 3</p>
<p>Para 4</p>
<h2>Heading 2.2</h2>
<p>Para 5</p>
<h3>Heading 3.2</h3>
<p>Para 6</p>'''
def depth_wrap(input_string, current_depth, base_heading_label = ""):
# Index to track where we're searching from in the string.
current_search_index = 0
# Counter for number of sections seen at this depth.
depth_counter = 1
# String we'll insert on ending a section.
end_section_string = '\n</section>\n'
# Set the return string to the input string.
return_string = input_string
# Loop through looking for Headings
while True:
# String we'll insert if we find a new heading.
start_string_to_insert = '<section id="' + base_heading_label + str(depth_counter) + '">\n'
# String to search for headings at the current depth.
search_string = '<h' + str(current_depth) + '>'
# Where is the next header at this depth?
index_of_next_header = return_string.find(search_string, current_search_index)
# Where is the next header at this depth, after the above one.
index_of_next_header_at_same_depth = return_string.find(search_string, index_of_next_header +1)
# If no headers found, then break the loop.
if index_of_next_header == -1:
break
# Is this the last header at this depth?
if index_of_next_header_at_same_depth == -1:
# Extract content from this header to the end of the string.
string_between_headers = return_string[index_of_next_header:]
# Look for any headers at the next level down, recurse, and then end the section.
next_level_string = depth_wrap(string_between_headers, current_depth + 1, base_heading_label + str(depth_counter) + '.')\
+ end_section_string
# Replace the string between with the updated result from above.
return_string = return_string.replace(string_between_headers, next_level_string, 1)
# Add the start of the section last to avoid shifting the indices.
return_string = return_string[:index_of_next_header] + start_string_to_insert + return_string[index_of_next_header:]
break
else:
# Extract content from between this header and the next.
string_between_headers = return_string[index_of_next_header: index_of_next_header_at_same_depth]
# Look for any headers at the next level down, recurse, and then end the section.
next_level_string = depth_wrap(string_between_headers, current_depth + 1, base_heading_label + str(depth_counter) + '.') \
+ end_section_string
# Replace the string between with the updated result from above.
return_string = return_string.replace(string_between_headers, next_level_string, 1)
# Add the start of the section last to avoid shifting the indices.
return_string = return_string[:index_of_next_header] + start_string_to_insert + return_string[index_of_next_header:]
# Update the search index to search from after the updated section of text.
current_search_index = index_of_next_header + len(start_string_to_insert) + len(next_level_string)
# Update the depth counter for labelling.
depth_counter += 1
# Strip any extra line returns or spaces from end and return.
return return_string.strip()
print(depth_wrap(html_string, 2))
我也无法使缩进在所有情况下都能始终如一地工作,因此将其排除在外。像Beautiful Soup这样的模块可以根据需要来美化输出。