Question

我的蜘蛛有问题。我尝试按照一些教程来更好地理解scrapy并将教程扩展到也抓取子页面。我的蜘蛛的问题是它只抓取入口页面的一个元素，而不是它应该在页面上的25。

我不知道失败的地方。也许你们中的某些人可以在这里帮助我：

from datetime import datetime as dt
import scrapy
from reddit.items import RedditItem

class PostSpider(scrapy.Spider):
    name = 'post'
    allowed_domains = ['reddit.com']

    def start_requests(self):
        reddit_urls = [
            ('datascience', 'week')
        ]

        for sub, period in reddit_urls:
            url = 'https://www.reddit.com/r/' + sub + '/top/?sort=top&t=' + period
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # get the subreddit from the URL
        sub = response.url.split('/')[4]

        # parse thru each of the posts
        for post in response.css('div.thing'):
            item = RedditItem()
            item['title'] = post.css('a.title::text').extract_first()
            item['commentsUrl'] = post.css('a.comments::attr(href)').extract_first()

            ### scrap comments page.
            request = scrapy.Request(url=item['commentsUrl'], callback=self.parse_comments)
            request.meta['item'] = item
            return request


    def parse_comments(self, response):
        item = response.meta['item']
        item['commentsText'] = response.css('div.comment div.md p::text').extract()
        self.logger.info('Got successful response from {}'.format(response.url))
        yield item

感谢您的帮助。 BR

Answer 1

感谢您的评论：事实上，我必须要求它，而不是返回请求。现在它正在发挥作用。

Scrapy Crawl Page和Supage但只抓取一个项目

1 个答案: