我如何通过网络抓取该标签?

时间:2020-01-19 06:18:29

标签: python html python-3.x beautifulsoup

这是我的HTML标签。我正在尝试获取<br>标记后的值。当我尝试这样做时,我同时获得了两个值。我将如何使用美丽汤来做到这一点。任何帮助,将不胜感激。

<div class="col search_price discounted responsive_secondrow">
<span style="color: #888888;"><strike>CDN$ 2.29</strike></span>
<br>CDN$ 1.48
</div>

2 个答案:

答案 0 :(得分:0)

您基本上已经拥有了它,只需要使用attrs词典作为正确的div类,然后搜索下一个'br'标签,其兄弟姐妹就是您的文本:

from bs4 import BeautifulSoup as bs
HTML = """
<div class="col search_price discounted responsive_secondrow">
<span style="color: #888888;"><strike>CDN$ 2.29</strike></span>
<br>CDN$ 1.48
</div>
"""
soup = bs(HTML, 'html.parser')
# get all divs with your class attr
divs = soup.find_all("div", attrs={'class': 'col search_price discounted responsive_secondrow'})
for div in divs:
    # find the <br> tag, next_sibling is the data
    print(div.find_next('br').next_sibling)

答案 1 :(得分:0)

其他解决方案。

from simplified_scrapy.simplified_doc import SimplifiedDoc
html='''
<div class="col search_price discounted responsive_secondrow">
<span style="color: #888888;"><strike>CDN$ 2.29</strike></span>
<br>CDN$ 1.48
</div>
'''
doc = SimplifiedDoc(html)
divs = doc.getElementsByClass('col search_price discounted responsive_secondrow')
for div in divs:
  value = div.br.nextText() # first
  print (value)
  value = doc.html[div.br._end:div._end-6] # second
  print (value)
  value = doc.removeHtml(div.getSectionByReg('<br.*>.*')) # third
  print (value)
  value = div.removeElement('span') # fourth
  print (value.text)

结果:

CDN$ 1.48
CDN$ 1.48
CDN$ 1.48
CDN$ 1.48
相关问题