我有一个很长的html文件,我想提取位于给定条件之间的html部分。
<div style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt" align="justify">
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 12pt; FONT-WEIGHT: bold">
<font style="DISPLAY: inline; TEXT-DECORATION: underline">ITEM 1A. RISK FACTORS</font></font></div>
---
---
---
---
<div style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt" align="justify">
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 12pt; FONT-WEIGHT: bold">
<font style="DISPLAY: inline; TEXT-DECORATION: underline">ITEM 1B. UNRESOLVED STAFF COMMENTS</font></font></div>
在这两个代码片段的上方,下方和下方都有很多html。我想提取从 ITEM 1A开始的html。风险因素,并在 ITEM 1B结束。未解决的员工评论
这是我到目前为止尝试过的,但是只打印包含ITEM 1A的html。风险因素
page_soup = soup(page_html, "html.parser")
for item in page_soup.find_all('font'):
if "ITEM 1A. RISK FACTORS" in item.text:
print(item)
答案 0 :(得分:2)
在for循环外可以有一个布尔值,以跟踪是否要打印行。像这样:
page_soup = soup(page_html, "html.parser")
should_print = False
for item in page_soup.find_all('font'):
if "ITEM 1A. RISK FACTORS" in item.text:
should_print = True
if "ITEM 1B. UNRESOLVED STAFF COMMENTS" in item.text:
break
if should_print:
print(item)