我想要一个网站上所有页面URL的列表。以下代码不返回任何内容:
from bs4 import BeautifulSoup
import requests
base_url = 'http://www.techadvisorblog.com'
response = requests.get(base_url + '/a')
soup = BeautifulSoup(response.text, 'html.parser')
urls = []
for tr in soup.select('tbody tr'):
urls.append(base_url + tr.td.a['href'])
答案 0 :(得分:0)
来自后端的响应是406。 您可以通过指定用户代理来解决此问题。
>>> response = requests.get(base_url + '/a', headers={"User-Agent": "XY"})
Python Requests HTTP Response 406
您可以获取网址
>>> for link in soup.find_all('a'):
... print(link.get('href'))
...
#content
https://techadvisorblog.com/
https://techadvisorblog.com
https://techadvisorblog.com/contact-us/
https://techadvisorblog.com/about-us/
https://techadvisorblog.com/disclaimer/
https://techadvisorblog.com/privacy-policy/
None
https://techadvisorblog.com/
https://techadvisorblog.com
https://techadvisorblog.com/contact-us/
https://techadvisorblog.com/about-us/
https://techadvisorblog.com/disclaimer/
https://techadvisorblog.com/privacy-policy/
None
https://techadvisorblog.com/
https://www.instagram.com/techadvisorblog
//www.pinterest.com/pin/create/button/?url=https://techadvisorblog.com/about-us/
https://techadvisorblog.com/contact-us/
https://techadvisorblog.com/
https://techadvisorblog.com/what-is-world-wide-web-www/
https://techadvisorblog.com/best-free-password-manager-for-windows-10/
https://techadvisorblog.com/solved-failed-to-start-emulator-the-emulator-was-not-properly-closed/
https://techadvisorblog.com/is-telegram-safe/
https://techadvisorblog.com/will-technology-ever-rule-the-world/
https://techadvisorblog.com/category/android/
https://techadvisorblog.com/category/knowledge/basic-computer/
https://techadvisorblog.com/category/games/
https://techadvisorblog.com/category/knowledge/
https://techadvisorblog.com/category/security/
http://Techadvisorblog.com/
http://Techadvisorblog.com
None
None
None
None
None
>>>
答案 1 :(得分:0)
我不确定为什么要在URL末尾连接\ a,因为这会重定向到About-us页面。另外,我看不到要在基本url或about-us上使用的table / tr / td标签。相反,如果您打算循环浏览作为基本url的两个页面(或更多页面),则可以通过测试值rel
的{{1}}属性的存在来实现。是的,您需要一个有效的User-Agent标头。
next