Question

我正在尝试从 https://www.finextra.com/latest-news

中抓取新闻文章

我已经浏览了有关 stackoverflow 的类似问题，关于scrapy 分页问题，但似乎都没有反映我的问题。

除了我想要关注“next_page”链接的部分之外，我的代码中的所有内容都可以正常工作。我使用完全相同的代码（除了 xpath 选择器）为另一个新闻网站编写了另一个蜘蛛程序，它运行良好。

我已经检查过 xpath 选择器是否正确提取了链接，并且我已经注释掉了 allowed_domains，因为一些答案表明中间件存在问题。

有人可以帮我吗。

class FinextraSpider(scrapy.Spider):
    name = 'finextra'
    # allowed_domains = ["finextra.com"]
    start_urls = ["https://www.finextra.com/latest-news"]

    def parse(self, response):
        articles = response.xpath("//div[@class='module--story']")

        for article in articles:
            category = article.xpath("./div[@class='story--content']/h6/a/text()").get()
            category = category.replace("/", "")
            article_link = article.xpath("./div[@class='story--content']/h4/a/@href").get()
            title = article.xpath("./div[@class='story--content']/h4/a/text()").get()
            title = title.replace("'", "''")

            yield scrapy.Request(response.urljoin(article_link),
                                  cb_kwargs={'category': category,
                                             'article_link': article_link,
                                             'title': title},
                                  callback=self.parse_readmore)

        # DOESNT WORK
        next_page = response.xpath("//div[@id='pagination']/a[last()-1]/@href")
        if next_page:
            yield response.follow(next_page,
                                  callback=self.parse)

Answer 1

我发现了问题。脚本卡在 category = category.replace("/", "") 处，因为某篇文章没有类别，因此终止了爬虫。

添加 if else 语句以在类别为空时继续解决它。

感谢所有阅读本文的人。

Scrapy 不遵循分页链接

1 个答案: