Question

我是网络报废和学习目的的新手，我想找到https://retty.me/网站中的所有href链接。但我发现我的代码只在该网站中找到一个链接。但我发现页面源它有很多链接没有打印。我还打印完整页面，其中只包含一个链接。我做错了什么？

请纠正我。

这是我的python代码：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
data=[]
html = urlopen('https://retty.me')
soup = BeautifulSoup(html,'lxml')
print(soup)
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
    data.append(link.attrs['href'])



file=open('scrapped_data.txt','w')
for item in data:
    file.write("%s\n"%item)
file.close()

Answer 1

如果您输入html中显示的消息，您将进入谷歌翻译，它说＆＃34;我们为您的麻烦而道歉＃34;。他们不希望人们抓取他们的网站，以便他们根据用户代理过滤请求。您只需要将用户代理添加到看起来像浏览器的请求标头中。

from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import re

data=[]

url = 'https://retty.me'
req = Request(
    url, 
    data=None, 
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
)
html = urlopen(req)
soup = BeautifulSoup(html,'lxml')
print(soup)
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
    data.append(link.attrs['href'])

for item in data:
    print(item)

实际上，此特定站点仅需要存在用户代理标头，并且即使是空字符串也将接受任何用户代理。 Rishav提到的请求库默认情况下提供了一个用户代理，这就是为什么它在没有添加自定义标题的情况下工作的原因。

Answer 2

我不知道为什么网站在与urllib一起使用时会返回不同的HTML，但您可以使用优秀的requests库，它比urllib更容易使用。

from bs4 import BeautifulSoup
import re
import requests

data = []
html = requests.get('https://retty.me').text
soup = BeautifulSoup(html, 'lxml')
for link in soup.findAll('a', attrs={'href': re.compile("^https://")}):
    data.append(link.attrs['href'])
print(data)

Answer 3

您可以找到请求here和美味汤here的官方文档。

import requests
from bs4 import BeautifulSoup

# your Response object called response
response = requests.get('https://retty.me')

# your html as string
html = response.text

#verify that you get the correct html code
print(html)

#make the html, a soup object
soup = BeautifulSoup(html, 'html.parser')

# initialization of your list
data = []

# append to your list all the URLs found within a page’s <a> tags
for link in soup.find_all('a'):
    data.append(link.get('href'))

#print your list items
print(data)

为什么我不能使用urllib，beautifulsoup访问此页面的完整html

3 个答案: