通过Scrapy Pipeline组织项目

时间:2018-11-06 12:18:12

标签: python scrapy

我只有一只蜘蛛,效果很好。我可以通过命令行获得简单的导出CSV,并像这样组织输出:

def parse(self, response):
    sel = Selector(response)
    price_list = sel.css("li.lvprice.prc span::text").extract()
    price_list = [itemprice.replace("\t","").replace("\n","").strip() for itemprice in price_list]
    desc_list = sel.css("h3.lvtitle a::text").extract()
    desc_list = [itemdesc.replace("\t","").replace("\n","").strip() for itemdesc in desc_list]
    for price, desc in zip(price_list, desc_list):
        yield {
            'ean': sel.css("span.kwcat b::text").extract_first(), 'price':price, 'desc':desc
        }

输出示例:

3596206198001,"4,43",Weleda - Savon Végétal au Calendula - 100 g

3596206198001,"4,08",WELEDA Savon Végétal au Calendula Bain et douche - 100 g

但是现在我试图通过一个简单的文件同时运行多个蜘蛛(这是可以的),并通过项和管道解析结果。这是解析函数的代码:

def parse(self, response):
    item = ScrapybotItem()
    item['ean'] = response.css("span.kwcat b::text").extract()
    price_list = response.css("li.lvprice.prc span::text").extract()
    item['price'] = [itemprice.replace("\t","").replace("\n","").strip() for itemprice in price_list]
    desc_list = response.css("h3.lvtitle a::text").extract()
    item['desc'] = [itemdesc.replace("\t","").replace("\n","").strip() for itemdesc in desc_list]

    return item

然后... CSV项目导出器的结果:

['EAN'],"['19,95', '', '1,00', '', 'à', '49,99', '', '1,00', '1,13', '19,95', '', '1,13', '', 'à', '205,56', '', '0,01', '', '1,13', '', 'à', '1\xa0370,47', '', '1,20', '1,00', '12,50', '1,13', '10,85', '34,90', '19,95', '19,95', '195,00', '17,13', '22,09', '33,09', '37,49', '485,00', '6,00', '19,95', '19,95', '26,95', '2,85', '29,95', '1,85', '39,00', '489,00', '1\xa0099,00', '1\xa0755,00', '1\xa0645,00', '', '1,14', '', 'à', '11,42', '', '755,00', '11,00', '15,49', '8,57', '14,99', '599,00', '12,90', '136,90', '4,45', '10,00', '3,29', '18,90', '18,90', '1,49', '2,97', '2,42', '12,99', '6,83', '2,97', '12,26', '49,50', 'Prix de mise en vente\xa0:', 'Prix de vente initial', '55,00 EUR']","['67811 Boondock Saints Movie ean Patrick Flanery FRAMED CANVAS PRINT Toile', 'EAN CODE', '15 EAN Code barres Barcodes chiffres codes barres pour Amazon', '50 UPC & EAN Code-barres codes chiffres bar code codes barres pour Amazon', '65518 Dr. No Movie ean Connery rsula Andress FRAMED CANVAS PRINT Toile', 'Code-Barres EAN 13 Upc Codes-barres Bar code chiffres pour Amazon et eBay 20 - 1...', 'EAN/UPC numéro/Bar Code QR pour Ebay et Amazon - 1p enchère (OS-016) C', 'UPC EAN chiffres des codes barres Bar code Amazon UK UE Garantie à vie' [...]

那么如何通过管道组织CSV输出呢?我想在一行上包含一项的所有字段...例如:

EAN, 19,95, 67811 Boondock Saints Movie ean Patrick Flanery FRAMED CANVAS 
PRINT Toile...

我进行了搜索,但没有找到有关如何重新组织项目输出的简单示例!抱歉,如果这是一个愚蠢的问题:)我在学习scrapy的同时正在学习python!

1 个答案:

答案 0 :(得分:0)

我认为您可以通过一个简单的循环来实现所需的目标:

def parse(self, response):
    ean = response.css("span.kwcat b::text").extract_first()
    price_list = response.css("li.lvprice.prc span::text").extract()
    desc_list = response.css("h3.lvtitle a::text").extract()

    for price, desc in zip(price_list, desc_list):
        item = ScrapybotItem()

        item['ean'] = ean
        item['price'] = price.replace("\t","").replace("\n","").strip()
        item['desc'] = desc.replace("\t","").replace("\n","").strip()

        yield item