我想从zomato.com获得评分和numVotes,但不幸的是,这些元素似乎粘在一起。很难解释,但是我做了一个简短的视频来说明我的意思。
整个代码:https://pastebin.com/JFKNuK2a
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/san-francisco/restaurants?q=restaurants&page=1",headers=headers)
content = response.content
bs = BeautifulSoup(content,"html.parser")
zomato_containers = bs.find_all("div", {"class": "search-snippet-card"})
for zomato_container in zomato_containers:
rating = zomato_container.find('div', {'class': 'search_result_rating'})
# numVotes = zomato_container.find("div", {"class": "rating-votes-div"})
print("rating: ", rating.get_text().strip())
# print("numVotes: ", numVotes.text())
答案 0 :(得分:0)
您可以使用re
模块来分析投票计数:
import re
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/san-francisco/restaurants?q=restaurants&page=1",headers=headers)
content = response.content
bs = BeautifulSoup(content,"html.parser")
zomato_containers = bs.find_all("div", {"class": "search-snippet-card"})
for zomato_container in zomato_containers:
print('name:', zomato_container.select_one('.result-title').get_text(strip=True))
print('rating:', zomato_container.select_one('.rating-popup').get_text(strip=True))
votes = ''.join( re.findall(r'\d', zomato_container.select_one('[class^="rating-votes"]').text) )
print('votes:', votes)
print('*' * 80)
打印:
name: The Original Ghirardelli Ice Cream and Chocolate...
rating: 4.9
votes: 344
********************************************************************************
name: Tadich Grill
rating: 4.6
votes: 430
********************************************************************************
name: Delfina
rating: 4.8
votes: 718
********************************************************************************
...and so on.
OR:
如果您不想使用re
,则可以使用str.split()
:
votes = zomato_container.select_one('[class^="rating-votes"]').get_text(strip=True).split()[0]
答案 1 :(得分:0)
根据剪辑中的要求,应更改选择器以使其更具体,以便定位适当的子元素(而不是父元素)。目前,通过针对父母,您得到了多余的多余孩子。要获得适当的收视率元素,您可以使用以{开头的运营商)CSS attribute = value。
此
[class^=rating-votes-div]
说匹配具有class
属性且其值以rating-votes-div
开头的元素
视觉:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get("https://www.zomato.com/san-francisco/restaurants?q=restaurants&page=1",headers=headers)
content = response.content
bs = BeautifulSoup(content,"html.parser")
zomato_containers = bs.find_all("div", {"class": "search-snippet-card"})
for zomato_container in zomato_containers:
name = zomato_container.select_one('.result-title').text.strip()
rating = zomato_container.select_one('.rating-popup').text.strip()
numVotes = zomato_container.select_one('[class^=rating-votes-div]').text
print('name: ', name)
print('rating: ' , rating)
print('votes: ', numVotes)