find_all正在捕获我想要的标签,但find_all_previous不是(似乎应该是)

时间:2019-06-06 15:47:01

标签: python beautifulsoup

我正在使用BeautifulSoup解析HTML文档。 find_all_previous()似乎只在查找前一个项目(或者至少是在它应捕获的2个项目中仅捕获项目2)。我是误解了它的用法还是我的代码有错误?

我要解析的HTML包含有关六个属性的信息,每个属性都位于<tr class="property shaded"><tr class="property">标记内。两个是当前属性,四个是先前属性;这两个集合用<h2 id="past-property-deeds">标签分开。我只想收集有关当前属性的信息。似乎识别“过去的属性”标头标记并在其上使用find_all_previous()应该会得到我想要的结果(有关属性1和2的信息),但它只是捕获第二个属性而不是第一个属性。

html_doc = """
<tr>
    <td colspan="3" valign="top">
        <h2 id="current-property-deeds">Current Property Deeds (2 Found)</h2>
    </td>
</tr>
<tr><td colspan="3" class="reportstableheader">
<span>
Purchase Date: N/A</span>
</td></tr>
<tr class="property shaded">
    Info for current property 1
</tr>
<tr><td colspan="3">&nbsp;</td></tr>
<tr><td colspan="3" class="reportstableheader">
<span>
Purchase Date: N/A</span>
</td></tr>
<tr class="property ">
    Info for current property 2
</tr>
<tr><td colspan="3">&nbsp;</td></tr>
<tr>
    <td colspan="3" valign="top">
        <h2 id="past-property-deeds">Past Property Deeds (4 Found)</h2>
    </td>
</tr>
<tr><td colspan="3" class="reportstableheader">
<span>
Purchase Date: 01/01/1900</span>
</td></tr>
<tr class="property shaded">
    Info for past property 1
</tr>
<tr><td colspan="3">&nbsp;</td></tr>
<tr><td colspan="3" class="reportstableheader">
<span>
Purchase Date: 01/01/1900&nbsp;&nbsp;-&nbsp;&nbsp; Sold Date: 01/01/1900</span>
</td></tr>
<tr class="property ">
    Info for past property 2
</tr>
<tr><td colspan="3">&nbsp;</td></tr>
<tr><td colspan="3" class="reportstableheader">
<span>
Purchase Date: N/A&nbsp;&nbsp;-&nbsp;&nbsp; Sold Date: 03/30/2007</span>
</td></tr>
<tr class="property shaded">
    Info for past property 3
</tr>
<tr><td colspan="3">&nbsp;</td></tr>
<tr><td colspan="3" class="reportstableheader">
<span>
Purchase Date: 09/22/2000</span>
</td></tr>
<tr class="property ">
    Info for past property 4
</tr>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
past_property_header = soup.find("h2", id="past-property-deeds")
all_property_info = soup.find_all("tr", class_=re.compile("^property"))
current_property_only = past_property_header.find_all_previous("tr", class_=re.compile("^property"))

all_property_info按照预期方式找到所有标签。但是,current_property_only只能在属性2周围找到标签,而我认为它应该同时捕捉1和2。

1 个答案:

答案 0 :(得分:0)

好的,我是个白痴。它一直都在返回正确的标签,我只是没有意识到它们会向后显示,而且我没有在原始未编辑HTML的混乱中找到属性1标签。抱歉,谢谢!