我正在尝试使用“显示更多”按钮来抓取网站,并且在单击“显示更多”后无法获取显示的信息。
当前,我正在尝试抓取该网站中所有文章的链接:“ https://www.nytimes.com/section/world”
我已经设法使用硒单击“显示更多”按钮,但是我仍然无法获得额外的链接。这是我到目前为止的内容:
driver = webdriver.Chrome(executable_path="/Users/cherlin/Documents/北大/大一/文计/期末大作业/程序/chromedriver")
driver.get("https://www.nytimes.com/section/world")
element = driver.find_element_by_xpath('//*[@id="latest-panel"]/div[1]/div/div/button').click()
links = driver.find_elements_by_css_selector('a.story-link')
这些链接显示为40个网络元素的列表。我仍在尝试找出如何获取实际链接,但是我需要首先找出如何获得隐藏链接。
答案 0 :(得分:0)
可以使用requests
库来获取JSON数据:
import requests
for page in range(3):
data = {"q" : "", "sort" : "newest", "page" : page, "dom" : "www.nytimes.com", "dedupe_hl" : "y"}
r = requests.get("https://www.nytimes.com/svc/collections/v1/publish/www.nytimes.com/section/world", params=data)
json_data = r.json()
for item in json_data['members']['items']:
print("{:50} {}".format(item['headline'][:50], item['url']))
这将为您提供开始输出:
Lunar Eclipse and Supermoon: Photos From Around th https://www.nytimes.com/2019/01/21/science/lunar-eclipse-supermoon.html
By the Numbers, China’s Economy Is Worse Than It L https://www.nytimes.com/2019/01/20/business/china-economy-gdp-fourth-quarter.html
Henry Sy, the Philippines’ Richest Man and a Shopp https://www.nytimes.com/2019/01/20/world/asia/henry-sy-dead.html
Carlos Ghosn Offers Higher Bail and Security Guard https://www.nytimes.com/2019/01/20/business/carlos-ghosn-bail-japan.html
American Airstrike in Somalia Kills 52 Shabab Extr https://www.nytimes.com/2019/01/20/world/africa/airstrike-shabab-somalia.html
这种方法比使用硒要快得多。