我想从此网站检索所有行数据 https://www.dibbs.bsm.dla.mil/Awards/AwdRecs.aspx?Category=awddt&TypeSrch=cq&Value=02-06-2018 这是行
的示例html <tr class="BgWhite" style="border-color:Gray;border-width:1px;border-style:Solid;">
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl43_lblAwardBasicNumber" style="display:inline-block;width:150px;"><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/03JAN17/SP450017D0005.PDF" title="Link To Award/Basic Document" target="DIBBSDocuments"><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" alt="PDF Document" width="16" height="16" hspace="2" border="0"></a><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/03JAN17/SP450017D0005.PDF" title="Link To Award/Basic Document" target="DIBBSDocuments">SP450017D0005</a></span>
</td>
<td align="center" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl43_lblCage"><a href="javascript:void(0);" onclick="return openNewWindow("https://www.dibbs.bsm.dla.mil/Refs/cage.aspx?Cage=0ZE15", "CAGE", 475, 300)" title="Click to perform a CAGE Search">0ZE15</a></span>
</td>
<td align="right" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl43_lblTotalContactPrice"> $2,341.94</span>
</td>
</tr>
<tr class="BgSilver" style="border-color:Gray;border-width:1px;border-style:Solid;">
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl44_lblDeliveryOrder" style="display:inline-block;width:175px;"><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/06FEB18/SP450017D0005SP450018F2293.PDF" title="Link To Delivery Order Document" target="DIBBSDocuments"><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" alt="PDF Document" width="16" height="16" hspace="2" border="0"></a><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/06FEB18/SP450017D0005SP450018F2293.PDF" title="Link To Delivery Order Document" target="DIBBSDocuments">SP450018F2293</a> <br><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16" height="16" hspace="1" border="0" alt="-spacer-"><span style="font-size: 9px;">» <a href="https://www.dibbs.bsm.dla.mil/Awards/AwdRec.aspx?contract=SP450017D0005&dlv=SP450018F2293&cnt=108" title="Delivery Order Package View" target="DIBBS">Delivery Order Package View</a></span></span>
</td>
<td align="right" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl44_lblDeliveryOrderCounter" style="display:inline-block;width:50px;">108</span>
</td>
<td align="right" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl44_lblTotalContactPrice"> $2,341.94</span>
</td>
我想从html中提取奖项SP450017D0005和SP450018F2293。所以我尝试了这个 dibbssoup = BeautifulSoup(main_page.content,&#39; html5lib&#39;)
containers1 = dibbssoup.find_all("tr", {"class": "BgWhite"})
containers2 = dibbssoup.find_all("tr", {"class": "BgSilver"})
containers = containers1 + containers2
for container1 in containers:
for page in range(row)[3:]:
containerid = "ctl00_cph1_grdAwardSearch_ctl"+str(page)+"_lblAwardBasicNumber"
awardid = container1.find("td", {"align": "left"}).find("span", {"id":containerid})
print(page)
print(containerid)
print(awardid)
print(" ")
页面增量工作,而containerid是正确的但是awardid的输出是&#34; none&#34;。我做错了什么,如何纠正呢?
答案 0 :(得分:0)
我目前没有看到代码中存在任何重大缺陷。使用这种嵌套的html标记时,将find
语句拆分并打印出每个语句的结果通常很有用。在进行调试时,您现在可以清楚地看到哪个find
调用失败了。解决问题后,您仍然可以重新组合它们并清理代码。
要摆脱page
和containerid
变量,您可以使用函数作为find
的参数,如下所示:
def basic_number_filter(tag):
return tag.name == "span" and tag.attrs.get("id", "").endswith("_lblAwardBasicNumber")
containers = soup.find_all('tr', {'class': ['BgWhite', 'BgSilver']})
for container in containers:
awardid = container.find("td", align="left").find(basic_number_filter)
print(awardid)
您可以在此处找到有关此内容的更多信息:https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function
使用您提供的示例html运行此代码时,我得到:
<span id="ctl00_cph1_grdAwardSearch_ctl43_lblAwardBasicNumber" style="display:inline-block;width:150px;"><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/03JAN17/SP450017D0005.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/></a><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/03JAN17/SP450017D0005.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document">SP450017D0005</a></span>
None
第二个awardid
是None
,因为
<tr class="BgSilver" style="border-color:Gray;border-width:1px;border-style:Solid;">
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl44_lblDeliveryOrder" style="display:inline-block;width:175px;">
<a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/06FEB18/SP450017D0005SP450018F2293.PDF" title="Link To Delivery Order Document" target="DIBBSDocuments"><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" alt="PDF Document" width="16" height="16" hspace="2" border="0"></a>
<a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/06FEB18/SP450017D0005SP450018F2293.PDF" title="Link To Delivery Order Document" target="DIBBSDocuments">SP450018F2293</a>
<br><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16" height="16" hspace="1" border="0" alt="-spacer-">
<span style="font-size: 9px;">» <a href="https://www.dibbs.bsm.dla.mil/Awards/AwdRec.aspx?contract=SP450017D0005&dlv=SP450018F2293&cnt=108" title="Delivery Order Package View" target="DIBBS">Delivery Order Package View</a></span>
</span>
</td>
<td align="right" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl44_lblDeliveryOrderCounter" style="display:inline-block;width:50px;">108</span>
</td>
<td align="right" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl44_lblTotalContactPrice"> $2,341.94</span>
</td>
</tr>
不包含span
id
,而不是ctl00_cph1_grdAwardSearch_ctl43_lblAwardBasicNumber
。