Question

我正在尝试编写一个Python脚本，从https://www.ssa.gov/OACT/babynames/index.html获取名称流行度数据。有一个CGI脚本/cgi-bin/popularnames.cgi，它基于两个请求参数year（或yob）和top（前10,20等）以表格格式返回输出。我需要能够通过传递不同年份的请求网址来汇总不同年份的结果，但具有相同的top值，例如10。但该页面不会刷新不同的请求网址，例如https://www.ssa.gov/cgi-bin/popularnames.cgi?yob=2000&top=10和https://www.ssa.gov/cgi-bin/popularnames.cgi?yob=2004&top=10的回复页面相同。

>>> QUERY_URL = 'https://www.ssa.gov/cgi-bin/popularnames.cgi'
>>> results_page_04 = requests.get(QUERY_URL, params={'year': 2004, 'top': 10}, headers={'Cache-Control': 'no-cache, no
 ...: -store, must-revalidate'}).text
>>> results_page_00 = requests.get(QUERY_URL, params={'year': 2000, 'top': 10}, headers={'Cache-Control': 'no-cache, no
 ...: -store, must-revalidate'}).text

这两个回答完全相同，实际上对2015年的反应很奇怪。

在发送请求之前是否需要设置一些标头（我正在使用requests库）。

Answer 1

该网站使用post而非get参数进行搜索。忘记查询参数时会显示页面2015。

import requests

url = 'https://www.ssa.gov/cgi-bin/popularnames.cgi'

for year in range(2000, 2016):
    data = {'year': year, 'number': 'p', 'top': '10'}

    response = requests.post(url, data=data)

    if response.ok:
        print(response.text)

浏览器中的开发工具是您的朋友，您可以通过检查网络选项卡获得所需的一切。

Python请求 - 每次都不能从CGI脚本获得新的响应

1 个答案: