使用“更多信息”下隐藏的数据对网站进行爬网

时间:2019-06-20 08:53:37

标签: python web-scraping beautifulsoup

我正试图从Tripadvisor.com上获取评论,我想在网站的“阅读更多”按钮下获取数据。反正不用硒就可以刮吗?

到目前为止,这是我使用的代码

resp = requests.get('https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html#REVIEWS') 
rsp_soup = BeautifulSoup(resp.text, 'html.parser')
rsp_soup.findAll(attrs={"class": "hotels-review-list-parts-ExpandableReview__reviewText--3oMkH"})

但是它无法抓取“更多内容”下的内容

2 个答案:

答案 0 :(得分:1)

通常,不会。这完全取决于您点击“阅读更多”时发生的情况,即实际数据在哪里

通常有两种可能性(不互斥):

  • 数据位于同一页面中,处于隐藏状态,而“阅读更多内容”例如一个隐藏复选框的标签,选中该标签后,该标签隐藏“阅读更多”范围,并显示其余文本。这样,显示的页面更小,更易读,但所有页面都加载在同一调用中。在这种情况下,您只需要找到合适的选择器即可(例如#someotherselector+input[type=checkbox] ~ div.moreText或类似的东西)。
  • 数据不存在,一段时间后将通过AJAX加载,保持隐藏状态,或者仅在单击“更多”后显示。这样一来,可以保留一个小页面,该页面可以快速加载,但是包含许多项目,这些项目将缓慢加载,无论是在后台还是按需加载。在这种情况下,您需要检查实际的AJAX调用(通常带有ID或'Load More ...'元素中保存的数据值:<span class="loadMore" data-text-id="x19834">Read more...</span>),并使用适当的标头发出相同的调用:

    resp2 = requests.get('https://www.tripadvisor.com.ph/whatever/api/is/used?id='+ element.attr('data-text-id'))

在不知道如何获取数据以及相关元素(例如,id携带属性的名称和内容等)在哪里的情况下,不可能给出每次都能起作用的答案。

您可能也对doing this the right way感兴趣。您要抓取的数据受版权保护,TripAdvisor可能会进行一些更改,以至于您在维护抓取工具时会遇到问题。

答案 1 :(得分:0)

评论会以html的形式部分显示,直到您单击read more为止,该按钮实际上并不进行Ajax调用,而是根据window.__WEB_CONTEXT__中包含的数据更新页面。您可以通过查看其中出现的<script>标签来访问此数据:

<script>
     window.__WEB_CONTEXT__={pageManifest:{"assets":["/components/dist/@ta/platform.polyfill.084d8cdf5f.js","/components/dist/runtime.56c5df2842.js", ....  }
</script>

一旦有了它,您就可以提取和处理JSON格式的数据。这是完整的代码:

import json
from bs4 import BeautifulSoup
resp = requests.get('https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html#REVIEWS') 

data = BeautifulSoup(resp.content).find('script', text = re.compile('window.__WEB_CONTEXT__')).text

#Some text processing to make the tag content a valid json
pageManifest = json.loads(data.replace('window.__WEB_CONTEXT__=','').replace('{pageManifest:', '{"pageManifest":')[:-1])


for x in pageManifest['pageManifest']['apolloCache']:
    try:
        reviews = x['result']['locations'][0]['reviewList']['reviews']       
    except:
        pass

print([x['text'] for x in reviews])

输出

['Do arrange for airport transfers! From the airport, you will be taking a van for around 20 minutes, then you\'ll be transferred to a banca/boat for a 25 minute ride to the resort. Upon arrival, you\'ll be greeted by a band that plays their "welcome, welcome" song and in our case, we were met by Maria (awesome gal!) who introduced the group to the resort facilities and checks you in at the bar.I booked a deluxe room, which is actually a duplex with 2 adjoining rooms, ideal
for families, which accommodates 4 to a room.Rooms are clean and bed is comfortable.Potable water is provided upon check in , but is chargeable thereafter.Don\ 't worry, ...FULL REVIEW...',
 "Stayed with my wife and 2 children, 10y and 13y. ...FULL REVIEW...",
 'Beginning at now been in Coron for a couple of   ...FULL REVIEW...',
 'This was the most beautiful and relaxing place   ...FULL REVIEW...',
 'We spent 2 nights at El rio. It was incredible,  ...FULL REVIEW... ']