使用Python刮取Javascript加载的页面

时间:2016-11-24 22:01:57

标签: python web-scraping

我尝试使用Python进行webscrape,来自BBC文章的评论:http://www.bbc.co.uk/news/education-37750489/comments?comments_page=1&initial_page_size=10&filter=none&sortBy=Created&sortOrder=Descending#

评论模块是Javascript,带有下一页的按钮。但是,我无法找到一个AJAX网址 - 在网络控制台上有一个链接,但这不起作用:https://ssl.live.bbc.co.uk/modules/comments/?siteId=newscommentsmodule&parentUri=http%3A%2F%2Fwww.bbc.co.uk%2Fnews%2Feducation-37750489%2Fcomments&forumId=__CPS__37750489

但是,我想刮掉多个页面,但是当我尝试更改' page = x'在第一页中,它只需要我到第一页。

我已经考虑过使用Selenium / Dryscape,但我不确定如何访问每个页面来运行它们。

1 个答案:

答案 0 :(得分:0)

您可以通过右键单击网络部分中的xhr请求来尝试copy as Curl命令。

这就是我得到的

curl "https://ssl.bbc.co.uk/modules/comments/ajax/comments/?siteId=newscommentsmodule^&forumId=__CPS__37750489^&filter=none^&sortOrder=Descending^&sortBy=Created^&mock=0^&mockUser=^&parentUri=http^%^3A^%^2F^%^2Fwww.bbc.com^%^2Fnews^%^2Feducation-37750489^%^2Fcomments^%^3Fcomments_page^%^3D1^%^26initial_page_size^%^3D10^%^26filter^%^3Dnone^%^26sortBy^%^3DCreated^%^26sortOrder^%^3DDescending^&loc=en-GB^&preset=responsive^&initial_page_size=10^&transTags=0^&comments_page=4" -H "Origin: http://www.bbc.com" -H "Accept-Encoding: gzip, deflate, sdch, br" -H "Accept-Language: en-US,en;q=0.8" -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.44 Safari/537.36" -H "Accept: application/json, text/javascript, */*; q=0.01" -H "Referer: http://www.bbc.com/news/education-37750489/comments?comments_page=1^&initial_page_size=10^&filter=none^&sortBy=Created^&sortOrder=Descending" -H "Cookie: BBC-UID=f5b82387c6a5f59cfb1e4e702165956d3c4a59fc40f0a0cc2289331f59e23c7f0Mozilla^%^2f5^%^2e0^%^20^%^28Windows^%^20NT^%^2010^%^2e0^%^3b^%^20Win64^%^3b^%^20x64^%^29^%^20AppleWebKit^%^2f537^%^2e36^%^20^%^28KHTML^%^2c^%^20like^%^20Gecko^%^29^%^20Chrome^%^2f55^%^2e0^%^2e2883^%^2e44^%^20Safari^%^2f537^%^2e36; BGUID=e50803c74685961b76a3bae761e263da9bbf269019d8d4abbed18707f72c1098; s1=208.5.385837657400859000FDD0E52985" -H "Connection: keep-alive" --compressed

或者您可以使用selenium直接点击分页按钮

driver.find_element_by_css_selector('li.comments-pagination-page.comments-pagination-page-{} a'.format(pageNumber)).click()

此处li.comments-pagination-page.comments-pagination-page-3是页面中第3个分页按钮的li标记。