Question

我有一个问题，我想解析一个网站并抓取每个文章的链接，但问题是Scrapy不会抓取所有链接并随机抓取其中一些链接。

import scrapy

from tutorial.items import GouvItem

class GouvSpider(scrapy.Spider):

    name = "gouv"

    allowed_domains = ["legifrance.gouv.fr"]

    start_urls = [

        "http://www.legifrance.gouv.fr/affichCode.do?cidTexte=LEGITEXT000006069577&dateTexte=20160128"

        ]

    def parse(self, response):
        for href in response.xpath('//span/a/@href'):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_article)

    def parse_article(self, response):
        for art in response.xpath("//div[@class='corpsArt']"):
            item = GouvItem()
            item['article'] = art.xpath('p/text()').extract()
            yield item




#And this is the GouvItem :

import scrapy

class GouvItem(scrapy.Item):
    title1 = scrapy.Field()
    title2 = scrapy.Field()
    title3 = scrapy.Field()
    title4 = scrapy.Field()
    title5 = scrapy.Field()
    title6 = scrapy.Field()
    link = scrapy.Field()
    article = scrapy.Field()

This is some lines of the json file and we can see that some articles missing and others are there but many times

问题在于，每条法律条款都应该只存在一次。在网站上，每篇文章只出现一次。

非常感谢！

Answer 1

指向网站子页面的链接包含一个sessionID。看起来请求的响应会以一种与scrapy发送多个并发请求的方式不一致的方式考虑sessionID。

解决此问题的一种方法是将CONCERRENT_REQUESTS中settings.py的数量设置为1。使用此设置时，刮擦将花费更长的时间。

另一种方法是使用列表手动控制请求。在SO上看到这个answer。

要防止出现空结果，请使用相对XPath（尾随点）并提取所有文字：

item['article'] = art.xpath('.//text()').extract()

希望这有帮助。

用Scrapy解析文档

1 个答案: