我有一个准备好的代码,可以从Trustpilot抓取信息。我已经成功地在所有页面上检索了评论,标题,时间戳和排名信息。我还想抓取评论者的详细信息和位置信息。
我已经尝试添加消费者信息变量和用户信息变量。但这不起作用。 '
import requests
from bs4 import BeautifulSoup as bs
import json
import math
import pandas as pd
def getInfo(url):
res=requests.get(url)
soup = bs(res.content, 'lxml')
data = json.loads(soup.select_one('[type="application/ld+json"]').text.strip()[:-1])[0]
return data
def addItems(data):
result = []
for item in data['review']:
review = {
'Headline': item['headline'] ,
'Ranking': item['reviewRating']['ratingValue'],
'Review': item['reviewBody'],
'ReviewDate': item['datePublished']
}
result.append(review)
return result
url = 'https://uk.trustpilot.com/review/instagram.com?page={}'
results = []
data = getInfo(url.format(1))
results.append(addItems(data))
totalReviews = int(data['aggregateRating']['reviewCount'])
reviewsPerPage = len(data['review'])
totalPages = math.ceil(totalReviews/reviewsPerPage)
if totalPages > 1:
for page in range(2, totalPages + 1):
data = getInfo(url.format(page))
results.append(addItems(data))
final = [item for result in results for item in result]
df = pd.DataFrame(final)
df.head()
'
我想获取用户和位置信息。 Below is the error I get if I add user.
<ipython-input-11-91758e06aa39> in addItems(data)
17 review = {
18 'Headline': item['headline'] ,
---> 19 'User': item['user'] ,
20 'Ranking': item['reviewRating']['ratingValue'],
21 'Review': item['reviewBody'],
KeyError: 'user'
答案 0 :(得分:0)
项目字典不包含user
键,它具有
'author': {
'@type': 'Person',
'name': 'Mike Crocker',
'url': 'https://uk.trustpilot.com/users/5d5ef7c9e427cd04ec0804db',
'image': 'https://user-images.trustpilot.com/5d5ef7c9e427cd04ec0804db/73x73.png'
}
,因此,如果要获取用户位置,请更改addItems(data)
功能。
例如。
def addItems(data):
result = []
for item in data['review']:
user_location = None
url = item['author']['url']
try:
user_location = bs(requests.get(url).content, "lxml").find('div',\
{'class':'user-summary-overview'}).find("div",\
{'class':'user-summary-location'}).text.strip()
except Exception as e:
pass
review = {
'Headline': item['headline'] ,
'Ranking': item['reviewRating']['ratingValue'],
'Review': item['reviewBody'],
'ReviewDate': item['datePublished'],
'User' : item['author']['name'],
'Location' : user_location
}
result.append(review)
return result