BeautifulSoup-在此HTML中查找链接

时间:2019-05-07 22:41:01

标签: python beautifulsoup

这是我获取HTML的代码

from bs4 import BeautifulSoup
import urllib.request
from fake_useragent import UserAgent

url = "https://blahblah.com"
ua = UserAgent()
ran_header = ua.random
req = urllib.request.Request(url,data=None,headers={'User-Agent': ran_header})
uClient = urllib.request.urlopen(req)
page_html = uClient.read()
uClient.close()

html_source = BeautifulSoup(page_html, "html.parser")
results = html_source.findAll("a",{"onclick":"googleTag('click-listings-item-image');"})

从这里results包含各种包含不同信息的列表。如果我那么print(results[0])

<a href="https://blahblah.com//link//asdfqwersdf" onclick="googleTag('click-listings-item-image');">
    <div class="results-panel-new col-sm-12">
        <div class="row">
            <div class="col-xs-12 col-sm-3 col-lg-2 text-center thumb-table-cell">
                <span class="eq-table-new text-center"><img class="img-thumbnail" src="//images/120x90/7831a94157234bc6.jpg" /></span>
            </div>
            <div class="col-xs-12 hidden-sm hidden-md col-lg-1 text-center thumb-table-cell">
                <span class="eq-table-new text-center"><span class="hidden-sm hidden-md hidden-lg">Year: </span>2000</span>
            </div>
            <div class="col-xs-12 hidden-sm hidden-md col-lg-2 text-center thumb-table-cell">
                <span class="eq-table-new text-center">Fake City, USA</span>
            </div>
            <div class="col-xs-12 col-sm-3 col-lg-2 text-center thumb-table-cell">
                <span class="eq-table-new text-center"><span class="hidden-sm hidden-md hidden-lg">Price: </span>$900</span>
            </div>
        </div>
        <div class="row">
            <div class="hidden-xs col-sm-12 table_details_new"><span>Descriptive details</span></div>
        </div>
    </div><!-- results-panel-new -->
</a>

我可以通过以下方法获取图像,年份,位置和价格:

ModelYear = results[0].div.find("div",{"class":"col-xs-12 hidden-sm hidden-md col-lg-1 text-center thumb-table-cell"}).span.text

如何从results[0]获得第一个href?

3 个答案:

答案 0 :(得分:1)

您可以使用find_all( , href=True)

例如:

results[0].find_all('a', href=True)[0]

答案 1 :(得分:1)

基于聊天讨论,href链接看起来很简单:results[0]['href']

答案 2 :(得分:0)

您的选择器将返回一个a标签元素,如您所看到的打印输出所示。因此,是的,您只需使用results[0]['href']直接访问href。您也可以这样说,因为页面上的整个面板(显示清单的卡片)都是可单击的元素。如果您想更清楚一点,可以将结果的选择器更改为#js_thumb_view ~ a。这也是一个更快的选择器。

results = html_source.select('#js_thumb_view ~ a')

然后使用所有链接,例如,

links = [result['href'] for result in results]