使用BeautifulSoup提取html div类

时间:2017-12-25 11:19:03

标签: python web-scraping beautifulsoup python-requests python-3.6

我想从下面的HTML获得'8.0':

<div class="js-otelpuani" style="float: left;"> ==$0
 "8.0"
 <span class="greyish" style="font-size:13px; font-
 family:arial;"> /10</span>
 ::after
</div>

我已经尝试过以下代码来提取div class ='js-otelpuani'中的'8.0',但它似乎没有用;

import urllib
import requests
from bs4 import BeautifulSoup
import pyodbc

headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"accept-charset": "cp1254,ISO-8859-9,utf-8;q=0.7,*;q=0.3",
"accept-encoding": "gzip,deflate,sdch",
"accept-language": "tr,tr-TR,en-US,en;q=0.8",
}
r = requests.get('https://www.otelz.com/otel/elvin-deluxehotel#.WkDIBd9l_IU', headers=headers)
if r.status_code != 200:
    print("request denied")
else:
    print("ok")
    soup = BeautifulSoup(r.text) 
    score = soup.find('div',attrs={'class': 'js-otelpuani'})
    print(score)

我将这些作为输出,但不幸的是我无法获得我想要提取的“8.0”值;

ok
<div class="js-otelpuani" style="float: left;">
<span id="comRatingValue">.0</span>
<span class="greyish" style="font-size: 13px; font-family: arial;">
/
<span itemprop="bestRating">10</span></span>
<span id="comRatingCount" itemprop="ratingCount" style="display: 
none;">0</span>
<span id="comReviewCount" itemprop="reviewCount" style="display: 
none;">0</span>
</div>

我将不胜感激任何帮助!

3 个答案:

答案 0 :(得分:2)

如果您检查页面的HTML代码并搜索js-otelpuani,您会注意到它也会在script标记内使用,如果您遵循该脚本的逻辑,则会看到评级本身是由对GeneralPartial/Degerlendirmeler/8974端点的单独查询形成的,其中8974是酒店ID。

让我们在您的脚本中模拟这个确切的逻辑 - 首先提取酒店ID,发出单独的请求并提取评级值:

import requests

from bs4 import BeautifulSoup


headers = {
    "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "accept-charset": "cp1254,ISO-8859-9,utf-8;q=0.7,*;q=0.3",
    "accept-encoding": "gzip,deflate,sdch",
    "accept-language": "tr,tr-TR,en-US,en;q=0.8",
}

with requests.Session() as session:
    session.headers = headers

    r = session.get('https://www.otelz.com/otel/elvin-deluxehotel#.WkDIBd9l_IU', headers=headers)
    if r.status_code != 200:
        print("request denied")
    else:
        print("ok")
        soup = BeautifulSoup(r.text, "html.parser")

        # get the hotel id
        hotel_id = soup.find(attrs={"data-hotelid": True})["data-hotelid"]

        # go for the hotel rating
        response = session.get("https://www.otelz.com/GeneralPartial/Degerlendirmeler/{hotel_id}".format(hotel_id=hotel_id))
        soup = BeautifulSoup(response.text, "html.parser")

        rating_value = soup.find(attrs={'data-rating-value': True})['data-rating-value']
        print(rating_value)

打印:

8.0

答案 1 :(得分:1)

你应该使用这样的东西:

soup.find('div', {'class' :'js-otelpuani'}).text

答案 2 :(得分:1)

如果您想购买硒,那么您所访问的数据可以很容易地解析,如下所示:

from bs4 import BeautifulSoup
from selenium  import webdriver

driver = webdriver.Chrome()
driver.get('https://www.otelz.com/otel/elvin-deluxehotel#.WkDf39KWa1t')
soup = BeautifulSoup(driver.page_source,"lxml")
for item in soup.select('.js-otelpuani'):
    [elem.extract() for elem in soup("span")]
    print(item.text)
driver.quit()

输出:

8.0