鉴于我的以下代码,我无法获得评分和相应的日期。
我可以得到评级,但不能使用.text。结果就是:
</div>, <div class="star-rating star-rating--medium">
<img alt="5 stars: Excellent" src="//cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-5.svg"/>
这意味着我需要清洁,但是我敢肯定,仅获得“ 5星:极好”。只是不确定该怎么做。
关于日期,我的“ date = star.find(” div“,attrs = {” class“:” tooltip-container-1“})”行仅使我获得None值,并且我不确定为什么。
请在下面查看我的代码,评分的HTML和日期。
我的代码:
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"}
#def get_total_items(url):
#soup = BeautifulSoup(requests.get(url, format(0),headers).text, 'lxml')
stars = []
dates = []
with requests.Session() as s:
for num in range(1,2):
url = "https://www.trustpilot.com/review/www.boozt.com?page={}".format(num)
r = s.get(url, headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
for star in soup.find_all("section", attrs={"class":"review__content"}):
rating = star.find("div", attrs={"class":"star-rating star-rating--medium"})
date = star.find("div", attrs={"class":"tooltip-container-1"})
#print(rating)
stars.append(rating)
dates.append(date)
#data = {"Rating": stars, "Dates": dates}
time.sleep(2)
print(dates)
Trustpilot的html评级:
<div class="star-rating star-rating--medium">
<img src="//cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-5.svg" alt="5 stars: Excellent">
</div>
Trustpilot中的日期html:
<div class="v-popover">
<span aria-describedby="popover_o7e1fd7whi" class="trigger" style="display: inline-block;">
<time datetime="2020-01-20T10:09:54.000Z" title="Monday, January 20, 2020, 11:09:54 AM" class="review-date--tooltip-target">Jan 20, 2020</time>
<div class="tooltip-container-1"></div> <!----></span> </div>
答案 0 :(得分:1)
首先,要获得评级值(例如“ 5星:极好”),只需从alt
下的img
中将{{1} }类
然后,要获取日期值,这有点棘手,因为您要定位的日期是由javascript加载的。但是您可以从上方的div
标记中获取它。像这样:star-rating star-rating--medium
我对您的代码段进行了一些更新,这里是:
代码:
script
结果:
star.find('script')
答案 1 :(得分:0)
评级位于图像标记内,日期位于脚本标记内。您需要获取scripts标签的文本并加载到json中,然后获取json的键值。
使用以下CSS选择器。
import json
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"}
stars = []
dates = []
with requests.Session() as s:
for num in range(1,2):
url = "https://www.trustpilot.com/review/www.boozt.com?page={}".format(num)
r = s.get(url, headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
for star in soup.find_all("section", attrs={"class":"review__content"}):
rating = star.select_one(".star-rating.star-rating--medium >img")
date = star.select_one(".review-content-header__dates > script").text
date1=json.loads(date)
stars.append(rating['alt'])
dates.append(date1['publishedDate'])
data = {"Rating": stars, "Dates": dates}
print(data)
输出:
{'Rating': ['5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '2 stars: Poor', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent', '5 stars: Excellent'], 'Dates': ['2020-01-28T05:37:13Z', '2020-01-28T00:00:48Z', '2020-01-27T23:22:58Z', '2020-01-27T21:20:32Z', '2020-01-27T21:06:42Z', '2020-01-27T19:37:16Z', '2020-01-27T19:27:38Z', '2020-01-27T18:20:48Z', '2020-01-27T17:18:42Z', '2020-01-27T16:15:17Z', '2020-01-27T15:58:49Z', '2020-01-27T15:46:29Z', '2020-01-27T15:39:23Z', '2020-01-27T15:32:43Z', '2020-01-27T15:29:21Z', '2020-01-27T15:27:30Z', '2020-01-27T14:35:29Z', '2020-01-27T13:43:40Z', '2020-01-27T13:37:53Z', '2020-01-27T12:58:58Z']}
答案 2 :(得分:0)
将for循环更改为
for star in soup.find_all("section", attrs={"class":"review__content"}):
rating = star.select("div.star-rating > img")
date_tag = star.select("div.review-content-header__dates > script")
date = json.loads(date_tag[0].text)
dt = datetime.strptime(date['publishedDate'], "%Y-%m-%dT%H:%M:%SZ")
stars.append(rating[0]['alt'])
dates.append(dt)