网页抓取-遍历

时间:2019-10-26 15:05:29

标签: python-3.x web-scraping beautifulsoup

我正在寻找一个网站酒店平台进行评论。 我无法弄清楚两件事: 1-为什么我不能一次提取所有评论?假设有14条评论,我只检索了7条左右。我认为托管网站的服务器存在限制吗?

2-当我遍历对象review_list时,每次检索的子对象都是相同的-即,我检索了相同的review_item。而不是遍历各个对象的是tag liclass review_item(请参见第二代码段)。

我正在运行Python 3.7,示例网址是: url example 希望您能在这里找到一些启示。

谢谢!

代码段1:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import json
import re
import sys
import warnings 
if not sys.warnoptions:
    warnings.simplefilter("ignore")#For ignoring SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE# url = input('Enter url - ' )
url=input("Enter Url - ")
html = urllib.request.urlopen(url, context=ctx).read()

soup = BeautifulSoup(html, 'html.parser')

html = soup.prettify("utf-8")

hotel_json_details = {}
hotel_json = {}
for line in soup.find_all('script',attrs={"type" : "application/ld+json"}):
    details = line.text.strip()
    details = json.loads(details)
    hotel_json_details["name"] = details["name"]
    hotel_json_details["aggregateRating"]={}
    hotel_json_details["aggregateRating"]["ratingValue"]=details["aggregateRating"]["ratingValue"]
    hotel_json_details["aggregateRating"]["reviewCount"]=details["aggregateRating"]["reviewCount"]
    hotel_json_details["address"]={}
    hotel_json_details["address"]["Street"]=details["address"]["streetAddress"]
    hotel_json_details["address"]["Locality"]=details["address"]["addressLocality"]
    hotel_json_details["address"]["Region"]=details["address"]["addressRegion"]
    hotel_json_details["address"]["Zip"]=details["address"]["postalCode"]
    hotel_json_details["address"]["Country"]=details["address"]["addressCountry"]

print(hotel_json_details)

div = soup.find_all(['li'],attrs={"class" : "review_item"})
print(div)

代码段2:

hotel_reviews= []
for line in soup.find_all('li', class_='review_item'): 
    review={}
    review["review_metadata"]={}
    review["review"]={}

    review["review_metadata"]["review_date"] = soup.find('p', class_='review_item_date').text.strip()
    review["review_metadata"]["review_staydate"] = soup.find('p', class_='review_staydate').text.strip()
    review["review_metadata"]["reviewer_name"] = soup.find('p', class_='reviewer_name').text.strip()
    review["review_metadata"]["reviewer_country"] = soup.find('span', class_='reviewer_country').text.strip()
    review["review_metadata"]["reviewer_score"] = soup.find('span', class_='review-score-badge').text.strip()
    review["review"]["review_pos"] = soup.find('p', class_='review_pos').text.strip()
    review["review"]["review_neg"] = soup.find('p', class_='review_neg').text.strip()
    scoreword = soup.find('span', class_='review_item_header_scoreword')
    if scoreword != None :
        review["review_metadata"]["review_header"] = scoreword.text.strip()
    else:
        review["review_metadata"]["review_header"] = ""
    hotel_reviews.append(x)
print(hotel_reviews)

1 个答案:

答案 0 :(得分:2)

遍历审阅项目时,需要使用line.find()而不是soup.find()。这样,您将在每个评论容器内查找评论字段,而不是搜索整个HTML树:

for line in soup.find_all('li', class_='review_item'): 
    review = {"review_metadata": {}, "review": {}}

    review["review_metadata"]["review_date"] = line.find('p', class_='review_item_date').text.strip()
    #                                          ^ HERE