如何使用python 3从网站提取所有页面URL?

时间:2019-10-27 09:42:15

标签: python-3.x beautifulsoup python-requests

我想要一个网站上所有页面URL的列表。以下代码不返回任何内容:

from bs4 import BeautifulSoup
import requests

base_url = 'http://www.techadvisorblog.com'
response = requests.get(base_url + '/a')
soup = BeautifulSoup(response.text, 'html.parser')

urls = []

for tr in soup.select('tbody tr'):
    urls.append(base_url + tr.td.a['href'])

2 个答案:

答案 0 :(得分:0)

来自后端的响应是406。 您可以通过指定用户代理来解决此问题。

>>> response = requests.get(base_url + '/a', headers={"User-Agent": "XY"})

Python Requests HTTP Response 406

您可以获取网址

>>> for link in soup.find_all('a'):
...     print(link.get('href'))
...
#content
https://techadvisorblog.com/
https://techadvisorblog.com
https://techadvisorblog.com/contact-us/
https://techadvisorblog.com/about-us/
https://techadvisorblog.com/disclaimer/
https://techadvisorblog.com/privacy-policy/
None
https://techadvisorblog.com/
https://techadvisorblog.com
https://techadvisorblog.com/contact-us/
https://techadvisorblog.com/about-us/
https://techadvisorblog.com/disclaimer/
https://techadvisorblog.com/privacy-policy/
None
https://techadvisorblog.com/
https://www.instagram.com/techadvisorblog
//www.pinterest.com/pin/create/button/?url=https://techadvisorblog.com/about-us/
https://techadvisorblog.com/contact-us/
https://techadvisorblog.com/
https://techadvisorblog.com/what-is-world-wide-web-www/
https://techadvisorblog.com/best-free-password-manager-for-windows-10/
https://techadvisorblog.com/solved-failed-to-start-emulator-the-emulator-was-not-properly-closed/
https://techadvisorblog.com/is-telegram-safe/
https://techadvisorblog.com/will-technology-ever-rule-the-world/
https://techadvisorblog.com/category/android/
https://techadvisorblog.com/category/knowledge/basic-computer/
https://techadvisorblog.com/category/games/
https://techadvisorblog.com/category/knowledge/
https://techadvisorblog.com/category/security/
http://Techadvisorblog.com/
http://Techadvisorblog.com
None
None
None
None
None
>>>

答案 1 :(得分:0)

我不确定为什么要在URL末尾连接\ a,因为这会重定向到About-us页面。另外,我看不到要在基本url或about-us上使用的table / tr / td标签。相反,如果您打算循环浏览作为基本url的两个页面(或更多页面),则可以通过测试值rel的{​​{1}}属性的存在来实现。是的,您需要一个有效的User-Agent标头。

next