我想从Trustpilot评论中抓取用户和位置信息

时间:2019-08-23 06:39:05

标签: python web-scraping beautifulsoup

我有一个准备好的代码,可以从Trustpilot抓取信息。我已经成功地在所有页面上检索了评论,标题,时间戳和排名信息。我还想抓取评论者的详细信息和位置信息。

我已经尝试添加消费者信息变量和用户信息变量。但这不起作用。 '

import requests
from bs4 import BeautifulSoup as bs
import json
import math
import pandas as pd

def getInfo(url):
    res=requests.get(url)
    soup = bs(res.content, 'lxml')
    data = json.loads(soup.select_one('[type="application/ld+json"]').text.strip()[:-1])[0]
    return data

def addItems(data):
    result = []
    for item in data['review']:

        review = {    
                  'Headline': item['headline'] ,
                  'Ranking': item['reviewRating']['ratingValue'],
                  'Review': item['reviewBody'],
                  'ReviewDate': item['datePublished']
                }

        result.append(review)
    return result

url = 'https://uk.trustpilot.com/review/instagram.com?page={}'
results = []
data = getInfo(url.format(1))
results.append(addItems(data))  
totalReviews = int(data['aggregateRating']['reviewCount'])
reviewsPerPage = len(data['review'])
totalPages = math.ceil(totalReviews/reviewsPerPage)

if totalPages > 1:
    for page in range(2, totalPages + 1):
        data = getInfo(url.format(page))
        results.append(addItems(data)) 

final = [item for result in results for item in result]
df = pd.DataFrame(final)
df.head()

'

我想获取用户和位置信息。 Below is the error I get if I add user.

<ipython-input-11-91758e06aa39> in addItems(data)
     17         review = {    
     18                   'Headline': item['headline'] ,
---> 19                   'User': item['user'] ,
     20                   'Ranking': item['reviewRating']['ratingValue'],
     21                   'Review': item['reviewBody'],

KeyError: 'user'

1 个答案:

答案 0 :(得分:0)

项目字典不包含user键,它具有 'author': { '@type': 'Person', 'name': 'Mike Crocker', 'url': 'https://uk.trustpilot.com/users/5d5ef7c9e427cd04ec0804db', 'image': 'https://user-images.trustpilot.com/5d5ef7c9e427cd04ec0804db/73x73.png' },因此,如果要获取用户位置,请更改addItems(data)功能。

例如。

def addItems(data):
    result = []
    for item in data['review']:
        user_location = None
        url = item['author']['url']
        try:
            user_location = bs(requests.get(url).content, "lxml").find('div',\
                    {'class':'user-summary-overview'}).find("div",\ 
                    {'class':'user-summary-location'}).text.strip()
        except Exception as e:
            pass
        review = {
                  'Headline': item['headline'] ,
                  'Ranking': item['reviewRating']['ratingValue'],
                  'Review': item['reviewBody'],
                  'ReviewDate': item['datePublished'],
                   'User' : item['author']['name'],
                   'Location' : user_location
                }
        result.append(review)
    return result