Question

我在刮：

http://www.wotif.com/hotel/View?hotel=W3830&page=1&adults=2&startDay=2014-11-08&region=1&descriptionSearch=true#property-reviews

使用以下代码：

hotel_page  = requests.get(hotel_url).text
hotel_page_soup = BeautifulSoup(hotel_page)

但是，这不包括Guest Review部分，原因是它在页面加载后由AJAX调用加载。

问题：如何在完成所有AJAX调用后才能抓取页面？

Answer 1

您需要调用此URL并确保X-Requested-With为XMLHttpRequest

URL="http://www.wotif.com/review/fragment?propertyId=W3830&limit=5"

headers={"X-Requested-With":"XMLHttpRequest",
"User-Agent":"Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"}

r=requests.get(URL,headers=headers)

#response here will be in json format
#Page source can be extracted using key `html'`
response=r.json()['html']
soup=BeautifulSoup(response)
reviews=soup.find(class_="review-score review-score-large").text
print reviews

Out[]:u'\n\n4.4\nOut of 5\n\n\n'

print reviews.strip()

Out[]:u'4.4\nOut of 5'

Answer 2

这简单得多。如果您请求URL http://www.wotif.com/review/fragment.json?propertyId=W3830&limit=100&bestThing=True，您将获得json格式的所有评论。

网址http://www.wotif.com/review/fragment?propertyId=W3830&limit=100&为您提供嵌入json的html中的评论。你必须看看自己，最适合你的需求。

在AJAX请求完成后抓取页面

2 个答案: