Question

我正在测试漂亮的汤网报废工具。下面的代码只是连接到subreddit，并尝试从第一页上的用户帖子打印所有图像的链接。

import requests
from bs4 import BeautifulSoup

url = "https://www.reddit.com/r/pics"
r = requests.get(url)

if r.status_code != 200:
    print "failed to connect"
    exit()

sourcecode = r.text
soup = BeautifulSoup(sourcecode, "html.parser")

print soup

for tag in soup.find_all('a', {'class': 'title may-blank outbound srTagged'}):
    print "entered into for loop"

    if tag['href'].startswith('http'):
        print tag['href']

此代码导致打印正确的soup对象，我可以看到它。但是，soup.find_all('a', {'class':'title may-blank outbound srTagged'})命令返回一个空列表。没有错误，只有一个空列表，这意味着最后的for循环甚至无法运行。

我想知道这里有什么问题。我复制并粘贴了字符串，我可以看到我正在尝试在网络源代码 1 上打印的链接。

我指的是这条线：

<a class = "title may-blank outbound srTagged" ...

我将其复制并粘贴到我的代码中以避免拼写错误，但仍然没有任何反应...为什么命令返回空列表的任何想法？

我已将for循环更改为for tag in soup.find_all('a', {'class': 'thumbnail may-blank outbound'}):，这是另一个类名，它的行为正常。

网站是否已完全阻止Beautiful Soup访问该部分源代码？

Answer 1

首先，您遇到differences between parsers，切换到更宽松的<script src="data/data.module.js" type="text/javascript"></script> <script src="data/dataservice.js" type="text/javascript"></script> <script src="src/public/xxx.js" type="text/javascript"></script> <script src="src/public/xxx/xxx.controller.js" type="text/javascript"></script>：

html5lib

这需要安装soup = BeautifulSoup(sourcecode, "html5lib")。

此外，您可以简化查找链接的方式：

使用CSS selectors并仅检查html5lib和title类
不要检查outbound值是否以href开头，因为http类隐式定义它

修正版：

outbound

Answer 2

在Python2中打破了

html.parser，而BeautifulSoup Document提到了它：

如果可以，我建议你安装并使用lxml来提高速度。如果   你使用的是早于2.7.3的Python 2版本，或者是版本的   Python 3.2早于3.2.2，您必须安装lxml或   html5lib Python的内置HTML解析器并不是很好   旧版本。

请注意，如果文档无效，将生成不同的解析器   它的不同的美丽的汤树。有关详细信息，请参阅Differences between parsers。

Python 2.7 - Beautiful Soup Web Scrapping find_all命令无法正常工作

2 个答案: