我正在尝试解析一个网站,并从中获取所谓的PSC代码。该网站按如下方式构造PSC代码:
<span class="results_title_text">PSC (Code): </span>
</td>
<td width="30%">
<span class="results_text">
MED & SURGICAL INSTRUMENTS,EQ & SUP
(
<a alt="Click here to drill down by PSC Code 6515" href="?q=60854+PRODUCT_OR_SERVICE_CODE%3A%226515%22&s=FPDS.GOV&templateName=1.5.1&indexName=awardfull&x=0&y=0" title="Click here to drill down by PSC Code 6515">6515</a>
)
</span>
</td>
</tr>
<tr>
到目前为止,我已经编写了代码来找到跨度为“ PSC(Code):”的代码,但是现在我不确定如何到达下一个包含实际PSC代码的跨度。这是我到目前为止的内容:
html_page = urllib.request.urlopen(url)
soup = BeautifulSoup(html_page, features='lxml')
#print(soup)
span = soup.findAll('span', {'class': 'results_title_text'})
for s in span:
if s.text == 'PSC (Code): ':
print(s)
此代码仅在html中找到的地方打印“ PSC(代码):”。对如何进行有任何想法吗?
答案 0 :(得分:0)
还有其他搜索汤的方法。假设带有代码的链接足够相似,则可以使用正则表达式搜索:
>>> import re
>>> soup.find(title=re.compile(r'PSC Code')).text
'6515'
>>> soup.find(href=re.compile(r'PRODUCT_OR_SERVICE_CODE')).text
'6515'
>>> soup.find('a',href=re.compile(r'PRODUCT_OR_SERVICE_CODE')).text
'6515'
>>> soup.find('a',title=re.compile(r'PSC Code')).text
'6515'
>>>
如果内容中有多个内容,则可以使用.find_all
并遍历结果。
答案 1 :(得分:0)
使用find_next
,您可以查看find_all_next
和span = soup.find("span", {"class": "results_title_text"})
if span.text.strip() == "PSC (Code):":
l_other = span.find_all_next(string=True)
for l in l_other:
print(l)
方法。
MED & SURGICAL INSTRUMENTS,EQ & SUP
(
6515
)
结果将是
[package]
name = "rust_backend"
version = "0.1.0"
edition = "2018"
crate-type = ["cdylib"]
[dependencies]
jni = { version = "0.10.2", default-features = false }
[profile.release]
lto = true
希望有帮助
答案 2 :(得分:0)
这怎么样?
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<td><span class="results_title_text">PSC (Code): </span>
</td>
<td width="30%">
<span class="results_text">
MED & SURGICAL INSTRUMENTS,EQ & SUP
(
<a alt="Click here to drill down by PSC Code 6515" href="?q=60854+PRODUCT_OR_SERVICE_CODE%3A%226515%22&s=FPDS.GOV&templateName=1.5.1&indexName=awardfull&x=0&y=0" title="Click here to drill down by PSC Code 6515">6515</a>
)
</span>
</td>
</tr>
<tr>
'''
doc = SimplifiedDoc(html)
span = doc.getElementByClass('results_title_text') # use class
span = doc.getElementByText('PSC (Code):',tag='span') # use text
print (span.text)
nextSpan = span.getParent().getNexts()[0].span # Through parent-child structure
print (nextSpan.html)
nextSpan = doc.getElementByClass('results_text',start='class="results_title_text"') # Through class and location
print (nextSpan.html)
结果将是
PSC (Code):
MED & SURGICAL INSTRUMENTS,EQ & SUP
(
<a alt="Click here to drill down by PSC Code 6515" href="?q=60854+PRODUCT_OR_SERVICE_CODE%3A%226515%22&s=FPDS.GOV&templateName=1.5.1&indexName=awardfull&x=0&y=0" title="Click here to drill down by PSC Code 6515">6515</a>)