我是python和scrapy的新手。我从2种不同的方法中产生2个项目,第一个用于第一页数据,第二个用于第二页数据。我无法以相同的顺序保存数据,第二项保存在第一项之后,但我需要一次保存。 提前谢谢。
class FirstPipeline(object):
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
current_date = datetime.now().strftime("%Y%m%d")
filename = 'License_Vehicle Inspection Stations_NY_CurationReady_' + current_date +'_v1.csv'
self.file = open(filename, 'w+b')
self.exporter = CsvItemExporter(self.file, delimiter = '|')
self.exporter.csv_writer.writerow(["Premises Name","Principal's Name","","Trade Name","Zone","County","Address/zone","License Class","License Type Code","License Type","Expiration Date","License Status","Serial Number","Credit Group","Filing Date","Effective Date"," "," "," "," "])
self.exporter.fields_to_export = ["company_name","mixed_name","mixed_subtype","dba_name","zone","county","location_address_string","licence_class","licence_type_code","permity_subtype","permit_lic_exp_date","permit_licence_status","permit_lic_no","credit_group","permit_lic_eff_date","permit_applied_date","permit_type","url","source_name","ingestion_timestamp"]
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
self.file.close()
def process_item(self, item, spider):
print("got the item in pipeline")
self.exporter.export_item(item)
return item
class SecondPipeline(object):
def process_item(self, item, spider):
if isinstance(item, SecondItem):
pass
return item
答案 0 :(得分:0)
Raj725,你的问题对于Scrapy的初学者来说是实际的,可能是在Python中。在我阅读Scrapy文档之前,我有同样的问题。没有阅读文档就无法理解Scrapy。您可以开始阅读tutorial,然后阅读项目部分和管道部分。
这是如何产生几种类型数据的例子。
1需要在items.py文件中准备您需要的项目:
from scrapy import Item, Field
class FirstItem(Item):
field_one = Field()
field_two = Field()
class SecondItem(Item):
another_field_one = Field()
another_field_two = Field()
another_field_three = Field()
2现在,您可以在scrapy代码中使用Items。可以在您要保存数据的任何地方生成项目:
from ..items import FirstItem, SecondItem
item = FirstItem(
field_one=response.css("div.one span::text").extract_first(),
field_two=response.css("div.two span::text").extract_first()
)
yield item
item = SecondItem(
another_field_one='some variable one',
another_field_one='some variable two',
another_field_three='some variable tree'
)
yield item
3 pipeline.py文件的示例。在保存之前不要忘记检查项目的类型。在“process_item”结束时,您必须返回项目。
from .items import FirstItem, SecondItem
class FirstPipeline(object):
def process_item(self, item, spider):
if isinstance(item, FirstItem):
# Save your data here. It's possible to save it to CSV file. Also you can put data to any database you need.
pass
return item
class SecondPipeline(object):
def process_item(self, item, spider):
if isinstance(item, SecondItem):
# Save your data here. It's possible to save it to CSV file. Also you can put data to any database you need.
pass
return item
4不要忘记在settings.py中声明您的管道。没有它,Scrapy不会使用Pilelines。
ITEM_PIPELINES = {
'scrapy_project.pipelines.FirstPipeline': 300,
'scrapy_project.pipelines.SecondPipeline': 300,
}
我没有准备好使用代码。我提供了代码示例,以了解它是如何工作的。可以将它放到您的代码中并进行所需的更改。 我没有展示如何将项目保存为CSV文件。您可以导入“csv”模块。您也可以在“scrapy.exporters”的pipeline.py中使用CsvItemExported。我提供了如何将不同项目保存到不同CSV文件的示例链接。