一个简单的演示

Question

我使用python从以下网页获取所有可能的href：

http://www.congresovisible.org/proyectos-de-ley/

这两个例子

href="ppor-medio-de-la-cual-se-dictan-medidas-para-defender-el-acceso-de-los-usuarios-del-sistema-de-salud-a-medicamentos-de-calidad-eficacia-y-seguridad-acceso-de-los-usuarios-del-sistema-de-salud-a-medicamentos/8683">

href="ppor-medio-del-cual-el-congreso-de-la-republica-facultado-por-el-numeral-17-del-articulo-150-de-la-constitucion-politica-de-colombia-y-en-aras-de-facilitar-la-paz-decreta-otorgar-amnistia-e-indulto-a-los-miembros-del-grupo-armado-organizado-al-margen-de-la-ley-farc-ep/8682">

并且最后有一个列表，其中包含该页面中所有可能的href。

然而，通过点击ver todos（“查看全部”），还有更多的href。但是，如果您检查源页面，即使您将/ #page = 4或任何页面添加到URL，总href仍保持不变（实际上页面源不会更改）。我怎么能得到所有隐藏的hrefs？

Answer 1

Prenote：我假设您使用的是Python 3 +。

当您点击＆＃34;查看全部＆＃34;时，它会请求API，获取数据，转储到视图中。这是所有AJAX过程。

艰难而复杂的方式是使用Selenium，但实际上并不需要。通过浏览器上的一点调试，您可以看到where it loads the data。

这是第一页。 q可能是搜索查询，page正好是哪个页面。每页5个元素。您可以通过urllib或requests申请，并将json包解析为dict。

一个简单的演示

我想自己尝试一下，似乎服务器我们需要一个User-Agent标头来处理数据，否则，它只会抛出403（禁止）。我正在尝试使用Python 3.5.1。

from urllib.request import urlopen, Request
import json

# Creating headers as dict, to pass User-Agent. I am using my own User-Agent here.
# You can use the same or just google it.
# We need to use User-Agent, otherwise, server does not accept request and returns 403.
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36 OPR/39.0.2256.48"
}

# Creating a Request object.
# See, we pass headers, below.
req = Request("http://www.congresovisible.org/proyectos-de-ley/search/proyectos-de-ley/?q=%20&page=1", headers=headers)

# Getting a response
res = urlopen(req)

# The thing is, it returns binary, we need to convert it to str in order to pass it on json.loads function.
# This is just a little bit complicated.
data_b = res.read()
data_str = data_b.decode("utf-8")

# Now, this is the magic.
data = json.loads(data_str)

print(data)

# Now you can manipulate your data. :)

对于Python 2.7

您可以使用urllib2。 urllib2不会像在Python 3中那样分离到包中。因此，您只需要做from urllib2 import Request, urlopen。

web-scrapping使用python隐藏href

1 个答案:

一个简单的演示

对于Python 2.7