Question

我正在使用Python和BeautifulSoup抓取网页。

我必须抓这个页面。

http://www.starwoodhotels.com//sheraton/property/reviews/index.html?language=en_US&propertyID=115

在此页面中，我已成功删除酒店地址，但我无法抓取用户评论部分

这是我的代码

hotel_link = "http://www.starwoodhotels.com//sheraton/property/reviews/index.html?language=en_US&propertyID=115"

hotel_page_html = requests.get(hotel_link,headers = header).text
hotel_page_soup = BeautifulSoup(hotel_page_html)

for hotel_address in hotel_page_soup.select("div#propertyAddressContainer ul#propertyAddress"):
  print("Address: "+hotel_address.select("li")[0].text)

print(hotel_page_soup.select("div.BVRRRatingNormalOutOf"))

如您所见，使用CSS Selector div#propertyAddressContainer ul#propertyAddress，我已获得地址但无法抓取User Reviews部分。

我在页面加载时检查了Console但是我没有看到任何用户评论被AJAX调用加载的内容。

那么如何刮掉评论部分呢？

Answer 1

你为什么这么复杂？

就这么做，

soup.find("span",{"itemprop":"aggregateRating"}).text.encode('ascii','ignore').replace('\n',' ')

Out[]:
Rated 3.4 out of 5by 625 reviewers.

不是你需要的吗？

Answer 2

工作代码

rev = hotel_page_soup.find( "span",
                            { "itemprop": "aggregateRating" }
                            ).text.encode( 'ascii',
                                           'ignore'
                                           ).replace( '\n', ' ' )

for total_rating_score in rev.select( "span" ):
    print ( total_rating_score.string )

无法使用BeautifulSoup废弃内容

2 个答案: