Question

我想要刮什么？

实际上，我正在努力搜索产品网站，并为每个magasin收集一些产品信息。为此，我使用必要的 POST 请求来指定我的magasin（获取相应的cookie），然后在我的类别上执行 GET 。 Scrapy已经建立了一个使用cookie发送请求的机制。我的问题是，有些时候parse中的请求是使用 相同的Cookie 进行的，这不是我想要的。

    我创建的
parse_mag只是为了检查我是否属于特定的magasin

class BricoMarcheSpider(scrapy.Spider):
name = 'brico_marche'

def start_requests(self):
    # full path 
    with open('file.csv') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            # check empty value
            magasin_id = row['Id']
            if row['Id'][0] == '0':
                magasin_id = row['Id'][1:]
            formdata = {'city' : row['City'], 'market' : row['Brand'], 'idPdv' : magasin_id}
            #print(row['City'], row['Brand'], row['Id'])
            yield scrapy.FormRequest(url='http://www.bricomarche.com/bma_popin/Geolocalisation/choisirMagasin', formdata=formdata, dont_filter=True, callback=self.parse)

def parse(self, response):
    yield scrapy.Request('http://www.bricomarche.com/l/nos-produits/jardin/abri-garage-carport-et-rangement/abri-de-jardin/les-abris-bois-1121.html?limit=90', dont_filter=True, callback=self.parse_mag)


def parse_mag(self, response):
    yield {"City" : response.xpath('//div[@class="store-details"]/p/strong/text()').extract_first()}

Answer 1

您的parse()方法始终对同一网址发出完全相同的请求，并通过响应调用parse_mag()。

因此，单个POST请求不会多次调用parse_mag()，每个请求调用一次，使用相同的参数，返回相同的结果。

使用https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#std:reqmeta-cookiejar

def start_requests(self):
    # full path 
    with open('file.csv') as csvfile:
        reader = csv.DictReader(csvfile)
        for i, row in enumerate(reader):
            # check empty value
            magasin_id = row['Id']
            if row['Id'][0] == '0':
                magasin_id = row['Id'][1:]
            formdata = {'city' : row['City'], 'market' : row['Brand'], 'idPdv' : magasin_id}
            #print(row['City'], row['Brand'], row['Id'])
            yield scrapy.FormRequest(url='http://www.bricomarche.com/bma_popin/Geolocalisation/choisirMagasin', formdata=formdata, dont_filter=True, callback=self.parse, meta={'cookiejar': i})

def parse(self, response):
    yield scrapy.Request('http://www.bricomarche.com/l/nos-produits/jardin/abri-garage-carport-et-rangement/abri-de-jardin/les-abris-bois-1121.html?limit=90', dont_filter=True, callback=self.parse_mag, meta={'cookiejar': response.meta['cookiejar']})

在Scrapy请求中保留cookie

我想要刮什么？

1 个答案: