我想提取位于给定条件之间的html部分

时间:2019-06-08 20:37:10

标签: python beautifulsoup

我有一个很长的html文件,我想提取位于给定条件之间的html部分。

<div style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt" align="justify">
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 12pt; FONT-WEIGHT: bold">
<font style="DISPLAY: inline; TEXT-DECORATION: underline">ITEM 1A. RISK FACTORS</font></font></div>

    ---
    ---
    ---
    ---
<div style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt" align="justify">
<font style="DISPLAY: inline; FONT-FAMILY: Times New Roman; FONT-SIZE: 12pt; FONT-WEIGHT: bold">
<font style="DISPLAY: inline; TEXT-DECORATION: underline">ITEM 1B. UNRESOLVED STAFF COMMENTS</font></font></div>

在这两个代码片段的上方,下方和下方都有很多html。我想提取从 ITEM 1A开始的html。风险因素,并在 ITEM 1B结束。未解决的员工评论

这是我到目前为止尝试过的,但是只打印包含ITEM 1A的html。风险因素

page_soup = soup(page_html, "html.parser")

for item in page_soup.find_all('font'):
    if "ITEM 1A. RISK FACTORS" in item.text:
            print(item)

1 个答案:

答案 0 :(得分:2)

在for循环外可以有一个布尔值,以跟踪是否要打印行。像这样:

page_soup = soup(page_html, "html.parser")

should_print = False
for item in page_soup.find_all('font'):
    if "ITEM 1A. RISK FACTORS" in item.text:
            should_print = True
    if "ITEM 1B. UNRESOLVED STAFF COMMENTS" in item.text:
            break
    if should_print:
            print(item)