用美丽的汤提取HTML数据是行不通的

时间:2018-02-06 14:23:42

标签: python html beautifulsoup

我想从此网站检索所有行数据 https://www.dibbs.bsm.dla.mil/Awards/AwdRecs.aspx?Category=awddt&TypeSrch=cq&Value=02-06-2018 这是行

的示例html
  <tr class="BgWhite" style="border-color:Gray;border-width:1px;border-style:Solid;">

<td align="left" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl43_lblAwardBasicNumber" style="display:inline-block;width:150px;"><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/03JAN17/SP450017D0005.PDF" title="Link To Award/Basic Document" target="DIBBSDocuments"><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" alt="PDF Document" width="16" height="16" hspace="2" border="0"></a><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/03JAN17/SP450017D0005.PDF" title="Link To Award/Basic Document" target="DIBBSDocuments">SP450017D0005</a></span>
</td>

<td align="center" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl43_lblCage"><a href="javascript:void(0);" onclick="return openNewWindow(&quot;https://www.dibbs.bsm.dla.mil/Refs/cage.aspx?Cage=0ZE15&quot;, &quot;CAGE&quot;, 475, 300)" title="Click to perform a CAGE Search">0ZE15</a></span>
</td>
<td align="right" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl43_lblTotalContactPrice">                   $2,341.94</span>
</td>

 </tr>

 <tr class="BgSilver" style="border-color:Gray;border-width:1px;border-style:Solid;">


<td align="left" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl44_lblDeliveryOrder" style="display:inline-block;width:175px;"><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/06FEB18/SP450017D0005SP450018F2293.PDF" title="Link To Delivery Order Document" target="DIBBSDocuments"><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" alt="PDF Document" width="16" height="16" hspace="2" border="0"></a><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/06FEB18/SP450017D0005SP450018F2293.PDF" title="Link To Delivery Order Document" target="DIBBSDocuments">SP450018F2293</a> <br><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16" height="16" hspace="1" border="0" alt="-spacer-"><span style="font-size: 9px;">» <a href="https://www.dibbs.bsm.dla.mil/Awards/AwdRec.aspx?contract=SP450017D0005&amp;dlv=SP450018F2293&amp;cnt=108" title="Delivery Order Package View" target="DIBBS">Delivery Order Package View</a></span></span>
</td>
<td align="right" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl44_lblDeliveryOrderCounter" style="display:inline-block;width:50px;">108</span>
</td>

<td align="right" valign="top">
    <span id="ctl00_cph1_grdAwardSearch_ctl44_lblTotalContactPrice">                   $2,341.94</span>
</td>

我想从html中提取奖项SP450017D0005和SP450018F2293。所以我尝试了这个     dibbssoup = BeautifulSoup(main_page.content,&#39; html5lib&#39;)

containers1 = dibbssoup.find_all("tr", {"class": "BgWhite"})
containers2 = dibbssoup.find_all("tr", {"class": "BgSilver"})

containers = containers1 + containers2

for container1 in containers:


    for page in range(row)[3:]:
        containerid = "ctl00_cph1_grdAwardSearch_ctl"+str(page)+"_lblAwardBasicNumber"

        awardid = container1.find("td", {"align": "left"}).find("span", {"id":containerid})

        print(page)
        print(containerid)
        print(awardid)
        print(" ")

页面增量工作,而containerid是正确的但是awardid的输出是&#34; none&#34;。我做错了什么,如何纠正呢?

1 个答案:

答案 0 :(得分:0)

我目前没有看到代码中存在任何重大缺陷。使用这种嵌套的html标记时,将find语句拆分并打印出每个语句的结果通常很有用。在进行调试时,您现在可以清楚地看到哪个find调用失败了。解决问题后,您仍然可以重新组合它们并清理代码。

要摆脱pagecontainerid变量,您可以使用函数作为find的参数,如下所示:

def basic_number_filter(tag):
    return tag.name == "span" and tag.attrs.get("id", "").endswith("_lblAwardBasicNumber")

containers = soup.find_all('tr', {'class': ['BgWhite', 'BgSilver']})

for container in containers:
    awardid = container.find("td", align="left").find(basic_number_filter)
    print(awardid)

您可以在此处找到有关此内容的更多信息:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function

使用您提供的示例html运行此代码时,我得到:

<span id="ctl00_cph1_grdAwardSearch_ctl43_lblAwardBasicNumber" style="display:inline-block;width:150px;"><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/03JAN17/SP450017D0005.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/></a><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/03JAN17/SP450017D0005.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document">SP450017D0005</a></span>
None

第二个awardidNone,因为

<tr class="BgSilver" style="border-color:Gray;border-width:1px;border-style:Solid;">
    <td align="left" valign="top">
        <span id="ctl00_cph1_grdAwardSearch_ctl44_lblDeliveryOrder" style="display:inline-block;width:175px;">
            <a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/06FEB18/SP450017D0005SP450018F2293.PDF" title="Link To Delivery Order Document" target="DIBBSDocuments"><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" alt="PDF Document" width="16" height="16" hspace="2" border="0"></a>
            <a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/06FEB18/SP450017D0005SP450018F2293.PDF" title="Link To Delivery Order Document" target="DIBBSDocuments">SP450018F2293</a>
            <br><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16" height="16" hspace="1" border="0" alt="-spacer-">
            <span style="font-size: 9px;">» <a href="https://www.dibbs.bsm.dla.mil/Awards/AwdRec.aspx?contract=SP450017D0005&amp;dlv=SP450018F2293&amp;cnt=108" title="Delivery Order Package View" target="DIBBS">Delivery Order Package View</a></span>
        </span>
    </td>
    <td align="right" valign="top">
        <span id="ctl00_cph1_grdAwardSearch_ctl44_lblDeliveryOrderCounter" style="display:inline-block;width:50px;">108</span>
    </td>

    <td align="right" valign="top">
        <span id="ctl00_cph1_grdAwardSearch_ctl44_lblTotalContactPrice">                   $2,341.94</span>
    </td>
</tr>

不包含span id,而不是ctl00_cph1_grdAwardSearch_ctl43_lblAwardBasicNumber