Question

我最近关注了如何使用BeautifulSoup和Python的一些教程，并学会了如何简单地从网页上抓取文本和网址。我现在正试图从以下链接中删除数据，

http://www.study.cam.ac.uk/undergraduate/apply/statistics/

在页面底部有一个交互式图形生成器，我想从中抓取所有数据，而不必花费很多时间来繁琐地手写下来生成的所有可能图形中的值。我曾尝试使用我的可测量初学者技术，但对于我来说，图形数据来自HTML的地方并不明显 - 此外，HTML似乎是动态的，具体取决于鼠标在屏幕上的位置。

问题：是否可以使用这些工具来抓取这些数据？如果是这样的话？

Answer 1

使用浏览器开发者工具，您可以看到当您点击Show Graph按钮时，POST请求转到http://www.study.cam.ac.uk/undergraduate/apply/statistics/data.php。结果是一个JSON对象，其中包含构建图形所需的所有数据。

在Python中模拟此请求，例如，使用requests模块：

import requests

URL = "http://www.study.cam.ac.uk/undergraduate/apply/statistics/data.php"
HEADERS = {'X-Requested-With': 'XMLHttpRequest'}

data = {
    'when': 'year',
    'year': 2014,
    'applications': 'on',
    'offers': 'on',
    'acceptances': 'on',
    'groupby': 'college',
    'for-5-years-what': 'university'
}

response = requests.post(URL, data=data, headers=HEADERS)
print response.json()

这里不需要BeautifulSoup。至少，从我对你的问题的理解来看。

屏幕抓取建议：交互式图表

1 个答案: