如何从给定特定标题的跨度中获取特定文本

时间:2019-12-04 00:09:53

标签: python html beautifulsoup tags

我正在尝试解析一个网站,并从中获取所谓的PSC代码。该网站按如下方式构造PSC代码:

<span class="results_title_text">PSC (Code): </span>
</td>
<td width="30%">
<span class="results_text">
                MED &amp; SURGICAL INSTRUMENTS,EQ &amp; SUP
                (
                                  <a alt="Click here to drill down by PSC Code 6515" href="?q=60854+PRODUCT_OR_SERVICE_CODE%3A%226515%22&amp;s=FPDS.GOV&amp;templateName=1.5.1&amp;indexName=awardfull&amp;x=0&amp;y=0" title="Click here to drill down by PSC Code  6515">6515</a>
                )
                </span>
</td>
</tr>
<tr>

到目前为止,我已经编写了代码来找到跨度为“ PSC(Code):”的代码,但是现在我不确定如何到达下一个包含实际PSC代码的跨度。这是我到目前为止的内容:

html_page = urllib.request.urlopen(url)
soup = BeautifulSoup(html_page, features='lxml')
#print(soup)
span = soup.findAll('span', {'class': 'results_title_text'})
for s in span:
    if s.text == 'PSC (Code): ':
        print(s)

此代码仅在html中找到的地方打印“ PSC(代码):”。对如何进行有任何想法吗?

3 个答案:

答案 0 :(得分:0)

还有其他搜索的方法。假设带有代码的链接足够相似,则可以使用正则表达式搜索:

>>> import re
>>> soup.find(title=re.compile(r'PSC Code')).text
'6515'
>>> soup.find(href=re.compile(r'PRODUCT_OR_SERVICE_CODE')).text
'6515'
>>> soup.find('a',href=re.compile(r'PRODUCT_OR_SERVICE_CODE')).text
'6515'
>>> soup.find('a',title=re.compile(r'PSC Code')).text
'6515'
>>> 

如果内容中有多个内容,则可以使用.find_all并遍历结果。

答案 1 :(得分:0)

使用find_next,您可以查看find_all_nextspan = soup.find("span", {"class": "results_title_text"}) if span.text.strip() == "PSC (Code):": l_other = span.find_all_next(string=True) for l in l_other: print(l) 方法。

                MED & SURGICAL INSTRUMENTS,EQ & SUP
                (

6515

                )

结果将是

[package]
name = "rust_backend"
version = "0.1.0"
edition = "2018"
crate-type = ["cdylib"]

[dependencies]    
jni = { version = "0.10.2", default-features = false }

[profile.release]
lto = true

希望有帮助

答案 2 :(得分:0)

这怎么样?

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''<td><span class="results_title_text">PSC (Code): </span>
</td>
<td width="30%">
<span class="results_text">
                MED &amp; SURGICAL INSTRUMENTS,EQ &amp; SUP
                (
                                  <a alt="Click here to drill down by PSC Code 6515" href="?q=60854+PRODUCT_OR_SERVICE_CODE%3A%226515%22&amp;s=FPDS.GOV&amp;templateName=1.5.1&amp;indexName=awardfull&amp;x=0&amp;y=0" title="Click here to drill down by PSC Code  6515">6515</a>
                )
                </span>
</td>
</tr>
<tr>
'''
doc = SimplifiedDoc(html)
span = doc.getElementByClass('results_title_text') # use class
span = doc.getElementByText('PSC (Code):',tag='span') # use text
print (span.text)

nextSpan = span.getParent().getNexts()[0].span # Through parent-child structure
print (nextSpan.html)

nextSpan = doc.getElementByClass('results_text',start='class="results_title_text"') # Through class and location
print (nextSpan.html)

结果将是

PSC (Code):
MED &amp; SURGICAL INSTRUMENTS,EQ &amp; SUP
                (
                                  <a alt="Click here to drill down by PSC Code 6515" href="?q=60854+PRODUCT_OR_SERVICE_CODE%3A%226515%22&amp;s=FPDS.GOV&amp;templateName=1.5.1&amp;indexName=awardfull&amp;x=0&amp;y=0" title="Click here to drill down by PSC Code  6515">6515</a>)