使用BeautifulSoup从div提取文本

时间:2018-11-08 02:43:04

标签: python html parsing beautifulsoup

我正在使用以下代码段,并尝试从下面的链接解析html的一部分,即div如下所示:

<div id="avg-price" class="price big-price">4.02</div>
<div id="best-price" class="price big-price">0.20</div>
<div id="worst-price" class="price big-price">15.98</div>

这是我尝试使用的代码

import requests, urllib.parse
from bs4 import BeautifulSoup, element
r = requests.get('https://herf.io/bids?search=tatuaje%20tattoo')
soup = BeautifulSoup(r.text, 'html.parser')

avgPrice = soup.find("div", {"id": "avg-price"})
lowPrice = soup.find("div", {"id": "best-price"})
highPrice = soup.find("div", {"id": "worst-price"})

print(avgPrice)
print(lowPrice)
print(highPrice)
print("Average Price: {}".format(avgPrice))
print("Low Price: {}".format(lowPrice))
print("High Price: {}".format(highPrice))

但是,它不包括div之间的价格...结果如下:

<div class="price big-price" id="avg-price"></div>
<div class="price big-price" id="best-price"></div>
<div class="price big-price" id="worst-price"></div>
Average Price: <div class="price big-price" id="avg-price"></div>
Low Price: <div class="price big-price" id="best-price"></div>
High Price: <div class="price big-price" id="worst-price"></div>

有什么想法吗?我确定我正在忽略一些小东西,但是我现在机智了。哈哈。

3 个答案:

答案 0 :(得分:1)

当然可以,但是仅当不需要使用javascrip计算数据时。就是现在! 在此网站中,您可以使用fiddler找出javascrip用来加载数据的URL,然后可以从中获取json或其他名称。这是一个简单的示例,在我使用提琴手找出数据来自何处之后。请记住,使用提琴手证书时需要设置verify=False

import requests 

with requests.Session() as se:
    se.headers = {
        "X-Requested-With": "XMLHttpRequest",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36",
        "Referer": "https://herf.io/bids?search=tatuaje%20tattoo",
        "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
        "Accept-Encoding":"gzip, deflate, br",
        }
    data = [
        "search=tatuaje+tattoo",
        "types=",
        "sites=",
    ]

    cookies = {
        "Cookie": "connect.sid=s%3ANYNh5s6LzCVWY8yE9Gra8lxj9OGHPAK_.vGiBmTXvfF4iDScBF94YOXFDmC80PQxY%2FX9FLQ23hYI"}

    url = "https://herf.io/bids/search/open"

    price = "https://herf.io/bids/search/stats"

    req = se.post(price,data="&".join(data),cookies=cookies,verify=False)
    print(req.text)

输出

  

{“ bottomQuarter”:4.4,“ topQuarter”:3.31,“ median”:3.8,“ mean”:4.03,“ stddev”:1.44,“ moe”:0.08,“ good”:2.59,“ great”: 1.14,“差”:5.47,“差”:6.91,“最佳”:0.2,“最差”:15.98,“计数”:1121}

答案 1 :(得分:0)

您可以使用text属性删除文本:

print("Average Price: {}".format(avgPrice.text))
print("Low Price: {}".format(lowPrice.text))
print("High Price: {}".format(highPrice.text))

答案 2 :(得分:0)

尝试

avgPrice[0].text 

对于其余部分,请执行相同操作。