如果有“显示更多”按钮,如何从网站上抓取信息?

时间:2019-01-21 01:06:33

标签: python web-scraping web-crawler

我正在尝试使用“显示更多”按钮来抓取网站,并且在单击“显示更多”后无法获取显示的信息。

当前,我正在尝试抓取该网站中所有文章的链接:“ https://www.nytimes.com/section/world

我已经设法使用硒单击“显示更多”按钮,但是我仍然无法获得额外的链接。这是我到目前为止的内容:

driver = webdriver.Chrome(executable_path="/Users/cherlin/Documents/北大/大一/文计/期末大作业/程序/chromedriver")
driver.get("https://www.nytimes.com/section/world")

element = driver.find_element_by_xpath('//*[@id="latest-panel"]/div[1]/div/div/button').click()
links = driver.find_elements_by_css_selector('a.story-link')

这些链接显示为40个网络元素的列表。我仍在尝试找出如何获取实际链接,但是我需要首先找出如何获得隐藏链接。

1 个答案:

答案 0 :(得分:0)

可以使用requests库来获取JSON数据:

import requests

for page in range(3):
    data = {"q" : "", "sort" : "newest", "page" : page, "dom" : "www.nytimes.com", "dedupe_hl" : "y"}
    r = requests.get("https://www.nytimes.com/svc/collections/v1/publish/www.nytimes.com/section/world", params=data)
    json_data = r.json()

    for item in json_data['members']['items']:
        print("{:50}  {}".format(item['headline'][:50], item['url']))

这将为您提供开始输出:

Lunar Eclipse and Supermoon: Photos From Around th  https://www.nytimes.com/2019/01/21/science/lunar-eclipse-supermoon.html
By the Numbers, China’s Economy Is Worse Than It L  https://www.nytimes.com/2019/01/20/business/china-economy-gdp-fourth-quarter.html
Henry Sy, the Philippines’ Richest Man and a Shopp  https://www.nytimes.com/2019/01/20/world/asia/henry-sy-dead.html
Carlos Ghosn Offers Higher Bail and Security Guard  https://www.nytimes.com/2019/01/20/business/carlos-ghosn-bail-japan.html
American Airstrike in Somalia Kills 52 Shabab Extr  https://www.nytimes.com/2019/01/20/world/africa/airstrike-shabab-somalia.html

这种方法比使用硒要快得多。