如何在scrapy中使用2种不同方法的2个产品?

时间:2018-06-10 03:57:30

标签: python scrapy

我是python和scrapy的新手。我从2种不同的方法中产生2个项目,第一个用于第一页数据,第二个用于第二页数据。我无法以相同的顺序保存数据,第二项保存在第一项之后,但我需要一次保存。     提前谢谢。

class FirstPipeline(object):

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        current_date = datetime.now().strftime("%Y%m%d")
        filename = 'License_Vehicle Inspection Stations_NY_CurationReady_' + current_date +'_v1.csv'
        self.file = open(filename, 'w+b')
        self.exporter = CsvItemExporter(self.file, delimiter = '|')

        self.exporter.csv_writer.writerow(["Premises Name","Principal's Name","","Trade Name","Zone","County","Address/zone","License Class","License Type Code","License Type","Expiration Date","License Status","Serial Number","Credit Group","Filing Date","Effective Date"," "," "," "," "])
        self.exporter.fields_to_export = ["company_name","mixed_name","mixed_subtype","dba_name","zone","county","location_address_string","licence_class","licence_type_code","permity_subtype","permit_lic_exp_date","permit_licence_status","permit_lic_no","credit_group","permit_lic_eff_date","permit_applied_date","permit_type","url","source_name","ingestion_timestamp"]
        self.exporter.start_exporting()


    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        print("got the item in pipeline")
        self.exporter.export_item(item)
        return item

class SecondPipeline(object):

    def process_item(self, item, spider):
        if isinstance(item, SecondItem):
            pass

        return item

1 个答案:

答案 0 :(得分:0)

Raj725,你的问题对于Scrapy的初学者来说是实际的,可能是在Python中。在我阅读Scrapy文档之前,我有同样的问题。没有阅读文档就无法理解Scrapy。您可以开始阅读tutorial,然后阅读项目部分和管道部分。

这是如何产生几种类型数据的例子。

1需要在items.py文件中准备您需要的项目:

from scrapy import Item, Field

class FirstItem(Item):
    field_one = Field()
    field_two = Field()

class SecondItem(Item):
    another_field_one = Field()
    another_field_two = Field()
    another_field_three = Field()

2现在,您可以在scrapy代码中使用Items。可以在您要保存数据的任何地方生成项目:

from ..items import FirstItem, SecondItem

        item = FirstItem(
            field_one=response.css("div.one span::text").extract_first(),
            field_two=response.css("div.two span::text").extract_first()
        )
        yield item

        item = SecondItem(
            another_field_one='some variable one',
            another_field_one='some variable two',
            another_field_three='some variable tree'
        )
        yield item

3 pipeline.py文件的示例。在保存之前不要忘记检查项目的类型。在“process_item”结束时,您必须返回项目。

from .items import FirstItem, SecondItem

class FirstPipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, FirstItem):            
            # Save your data here. It's possible to save it to CSV file. Also you can put data to any database you need.
            pass

        return item

class SecondPipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, SecondItem):            
            # Save your data here. It's possible to save it to CSV file. Also you can put data to any database you need.
            pass

        return item

4不要忘记在settings.py中声明您的管道。没有它,Scrapy不会使用Pilelines。

ITEM_PIPELINES = {
     'scrapy_project.pipelines.FirstPipeline': 300,
     'scrapy_project.pipelines.SecondPipeline': 300,
}

我没有准备好使用代码。我提供了代码示例,以了解它是如何工作的。可以将它放到您的代码中并进行所需的更改。 我没有展示如何将项目保存为CSV文件。您可以导入“csv”模块。您也可以在“scrapy.exporters”的pipeline.py中使用CsvItemExported。我提供了如何将不同项目保存到不同CSV文件的示例链接。