Question

我尝试为Reddit＆＃39; s / r / all创建一个网络抓取工具，用于收集热门帖子的链接。我一直关注YouTube上thenewboston's web crawler tutorial series的第一部分。

在我的代码中，我删除了 while循环，它设置了 thenewboston 案例中要抓取的网页数限制（我只会抓取/ r / all的前25个帖子，只有一个页面）。当然，我已经做了这些更改以适应我的网络抓取工具的目的。

在我的代码中，我已将URL变量更改为＆＃39; http://www.reddit.com/r/all/＆＃39; （出于显而易见的原因）Soup.findAll可迭代到Soup.findAll('a', {'class': 'title may-blank loggedin'})（title may-blank loggedin是Reddit上帖子的标题类。）

这是我的代码：

import requests
from bs4 import BeautifulSoup

def redditSpider():
    URL = 'http://www.reddit.com/r/all/'
    sourceCode = requests.get(URL)
    plainText = sourceCode.text
    Soup = BeautifulSoup(plainText)
    for link in Soup.findAll('a', {'class': 'title may-blank loggedin'}):
        href = 'http://www.reddit.com/r/all/' + link.get('href')
        print(href)

redditSpider()

我已经使用每行之间的print语句完成了一些业余错误检查，似乎for循环没有被执行。

要跟随或比较 thenewboston 的代码，请跳到他的迷你剧中的第二部分，并在他的视频中找到他的代码显示的位置。

根据要求

编辑 thenewboston 的代码：

import requests
from bs4 import BeautifulSoup

def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'https://buckysroom.org/trade/search.php?page=' + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text)
        for link in Soup.findAll('a', {'class': 'item-name'}):
            href = 'http://buckysroom.org' + link.get('href')
            print(href)
        page += 1

trade_spider()

Answer 1

这不是你问题的直接答案，但我想我会告诉你，有一个用于Python的Reddit API叫做PRAW（Python Reddit Api Wrapper），你可能想把它看作是它可以做你想做的事情更容易。

链接：https://praw.readthedocs.org/en/v2.1.20/

Answer 2

首先，newboston似乎是一个截屏视频，因此在那里获取代码会有所帮助。

其次，我建议在本地输出文件，以便您可以在浏览器中打开它，并在Web Tools中查看以查看您想要的内容。我还建议使用ipython在本地文件中使用BeautfulSoup，而不是每次都使用它。

如果你把它丢在那里你就可以做到：

plainText = sourceCode.text
f = open('something.html', 'w')
f.write(sourceCode.text.encode('utf8'))

当我运行你的代码时，首先我不得不等待，因为有几次它给了我一个我经常请求的错误页面。这可能是你的第一个问题。

当我获得该页面时，有很多链接，但没有与您的课程。我不确定在没有观看整个Youtube系列的情况下，“空白登录”的标题应该代表什么。

现在我看到了问题

这是登录类，您没有使用刮刀登录。

你不应该只是登录来查看/ r / all，而只需使用它：

soup.findAll('a', {'class': 'title may-blank '})

Answer 3

您不是＆＃34;登录＆＃34;，因此永远不会应用类样式。这可以在没有登录的情况下工作：

import requests
from bs4 import BeautifulSoup

def redditSpider():
    URL = 'http://www.reddit.com/r/all'
    source = requests.get(URL)
    Soup = BeautifulSoup(source.text)
    for link in Soup.findAll('a',attrs={'class' : 'title may-blank '}):
        href = 'http://www.reddit.com/r/all/' + link.get('href')
        print(href)

redditSpider()

Python - 使用BeautifulSoup4的Reddit网络爬虫返回任何内容

3 个答案: