我想从这个html中提取一些单词,如
<tr class="BgSilver" style="border-color:Gray;border-width:1px;border-style:Solid;">
<td align="right" style="width:75px;" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblRowNum" style="display:inline-block;width:50px;">124</span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblAwardBasicNumber" style="display:inline-block;width:150px;"><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/27FEB17/SP450017D0007.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/></a><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/27FEB17/SP450017D0007.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document">SP450017D0007</a></span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrder" style="display:inline-block;width:175px;"><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/>0243 <br/><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/><span style="font-size: 9px;">» <a href="https://www.dibbs.bsm.dla.mil/Awards/AwdRec.aspx?contract=SP450017D0007&dlv=0243&cnt=" target="DIBBS" title="Delivery Order Package View">Delivery Order Package View</a></span></span>
</td>
<td align="right" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrderCounter" style="display:inline-block;width:50px;"> </span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblLastModPostingDate" style="display:inline-block;width:75px;">04-12-2018</span>
</td>
</tr>
这是我的代码的一部分,它生成上面的html
import requests
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
import urllib3
import numpy as np
import re
from datetime import datetime, timedelta
containers = pagesoup.find_all('tr', {'class': ['BgWhite', 'BgSilver']})
for batch in containers:
for item in range(53)[2:]:
try:
// batch is the html above
print(batch)
uid = "ctl00_cph1_grdAwardSearch_ctl"+str(item)+"_lblAwardBasicNumber"
print("uid id ", uid)
awardid = batch.find_all("span", text = re.compile("_lblAwardBasicNumber"))
print("award id is")
print(awardid)
except Exception as e:
print(colorama.Fore.MAGENTA + "award error.."+ str(e) )
# print(container1)
continue
except Exception as e:
raise e
打印(批处理)是上面生成的html,我想从这个获得这个号码SP450017D0007
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblAwardBasicNumber" style="display:inline-block;width:150px;"><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/27FEB17/SP450017D0007.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/></a><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/27FEB17/SP450017D0007.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document">SP450017D0007</a></span>
但是奖励输出无。我怎样才能提取SP450017D0007?
答案 0 :(得分:0)
解决方案:
要获取此文SP450017D0007
,我使用了pagesoup.find('a', text=True).text
。
注意:
您上面的代码中有以下额外的行应该被取出
except Exception as e:
raise e
代码:
import requests
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
import urllib3
import numpy as np
import re
from datetime import datetime, timedelta
data = '''
<tr class="BgSilver" style="border-color:Gray;border-width:1px;border-style:Solid;">
<td align="right" style="width:75px;" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblRowNum" style="display:inline-block;width:50px;">124</span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblAwardBasicNumber" style="display:inline-block;width:150px;"><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/27FEB17/SP450017D0007.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/></a><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/27FEB17/SP450017D0007.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document">SP450017D0007</a></span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrder" style="display:inline-block;width:175px;"><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/>0243 <br/><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/><span style="font-size: 9px;">» <a href="https://www.dibbs.bsm.dla.mil/Awards/AwdRec.aspx?contract=SP450017D0007&dlv=0243&cnt=" target="DIBBS" title="Delivery Order Package View">Delivery Order Package View</a></span></span>
</td>
<td align="right" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrderCounter" style="display:inline-block;width:50px;"> </span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblLastModPostingDate" style="display:inline-block;width:75px;">04-12-2018</span>
</td>
</tr>
'''
pagesoup = BeautifulSoup(data, 'html.parser')
containers = pagesoup.find_all('tr', {'class': ['BgWhite', 'BgSilver']})
for batch in containers:
for item in range(53)[2:]:
try:
print(batch)
uid = "ctl00_cph1_grdAwardSearch_ctl" + str(item) + "_lblAwardBasicNumber"
print("uid id ", uid)
awardid = pagesoup.find('a', text=True).text
print("award id is")
print(awardid)
dateid = pagesoup.find('span', id='ctl00_cph1_grdAwardSearch_ctl26_lblLastModPostingDate').text
print("date id is")
print(dateid)
except Exception as e:
print(colorama.Fore.MAGENTA + "award error.." + str(e))
# print(container1)
continue
输出:
<tr class="BgSilver" style="border-color:Gray;border-width:1px;border-style:Solid;">
<td align="right" style="width:75px;" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblRowNum" style="display:inline-block;width:50px;">124</span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblAwardBasicNumber" style="display:inline-block;width:150px;"><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/27FEB17/SP450017D0007.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document"><img alt="PDF Document" border="0" height="16" hspace="2" src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/IconPdf.gif" width="16"/></a><a href="https://dibbs2.bsm.dla.mil/Downloads/Awards/27FEB17/SP450017D0007.PDF" target="DIBBSDocuments" title="Link To Award/Basic Document">SP450017D0007</a></span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrder" style="display:inline-block;width:175px;"><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/>0243 <br/><img alt="-spacer-" border="0" height="16" hspace="1" src="https://www.dibbs.bsm.dla.mil/app_themes/images/common/space.gif" width="16"/><span style="font-size: 9px;">» <a href="https://www.dibbs.bsm.dla.mil/Awards/AwdRec.aspx?contract=SP450017D0007&dlv=0243&cnt=" target="DIBBS" title="Delivery Order Package View">Delivery Order Package View</a></span></span>
</td>
<td align="right" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblDeliveryOrderCounter" style="display:inline-block;width:50px;"> </span>
</td>
<td align="left" valign="top">
<span id="ctl00_cph1_grdAwardSearch_ctl26_lblLastModPostingDate" style="display:inline-block;width:75px;">04-12-2018</span>
</td>
</tr>
uid id ctl00_cph1_grdAwardSearch_ctl2_lblAwardBasicNumber
award id is
SP450017D0007
date id is
04-12-2018