我有一只运行良好的scrapy蜘蛛:
`# -*- coding: utf-8 -*-
import scrapy
class AllCategoriesSpider(scrapy.Spider):
name = 'vieles'
allowed_domains = ['examplewiki.de']
start_urls = ['http://www.exampleregelwiki.de/index.php/categoryA.html','http://www.exampleregelwiki.de/index.php/categoryB.html','http://www.exampleregelwiki.de/index.php/categoryC.html',]
#"Titel": :
def parse(self, response):
urls = response.css('a.ulSubMenu::attr(href)').extract() # links to den subpages
for url in urls:
url = response.urljoin(url)
yield scrapy.Request(url=url,callback=self.parse_details)
def parse_details(self,response):
yield {
"Titel": response.css("li.active.last::text").extract(),
"Content": response.css('div.ce_text.first.last.block').extract(),
}
` 与
scrapy runspider spider.py -o dat.json 它将所有信息保存到dat.json
我希望每个start url categoryA.json categoryB.json都有一个输出文件,依此类推。
similar question未得到答复,我无法重现this answer而且我无法从suggestions there开始学习。
我如何实现拥有多个输出文件的目标,每个初始文件一个? 我只想运行一个命令/ shellscript /文件来实现这个目标。
答案 0 :(得分:1)
您没有在代码中使用真实网址,因此我使用我的网页进行测试 我必须改变css选择器,我使用不同的字段。
我将其保存为csv
,因为它更容易附加数据
JSON
需要从文件中读取所有项目,添加新项目并将所有项目再次保存在同一文件中。
我创建额外的字段Category
以便稍后将其用作管道中的文件名
<强> items.py 强>
import scrapy
class CategoryItem(scrapy.Item):
Title = scrapy.Field()
Date = scrapy.Field()
# extra field use later as filename
Category = scrapy.Field()
在蜘蛛网中,我从网址获取类别,然后使用parse_details
中的meta
发送给Request
。
在parse_details
我将category
添加到Item
。
<强>蜘蛛/ example.py 强>
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
allowed_domains = ['blog.furas.pl']
start_urls = ['http://blog.furas.pl/category/python.html','http://blog.furas.pl/category/html.html','http://blog.furas.pl/category/linux.html']
def parse(self, response):
# get category from url
category = response.url.split('/')[-1][:-5]
urls = response.css('article a::attr(href)').extract() # links to den subpages
for url in urls:
# skip some urls
if ('/tag/' not in url) and ('/category/' not in url):
url = response.urljoin(url)
# add category (as meta) to send it to callback function
yield scrapy.Request(url=url, callback=self.parse_details, meta={'category': category})
def parse_details(self, response):
# get category
category = response.meta['category']
# get only first title (or empty string '') and strip it
title = response.css('h1.entry-title a::text').extract_first('')
title = title.strip()
# get only first date (or empty string '') and strip it
date = response.css('.published::text').extract_first('')
date = date.strip()
yield {
'Title': title,
'Date': date,
'Category': category,
}
在管道中,我得到category
并使用它打开文件以追加并保存项目。
<强> pipelines.py 强>
import csv
class CategoryPipeline(object):
def process_item(self, item, spider):
# get category and use it as filename
filename = item['Category'] + '.csv'
# open file for appending
with open(filename, 'a') as f:
writer = csv.writer(f)
# write only selected elements
row = [item['Title'], item['Date']]
writer.writerow(row)
#write all data in row
#warning: item is dictionary so item.values() don't have to return always values in the same order
#writer.writerow(item.values())
return item
在设置中,我必须取消注释管道以激活它。
<强> settings.py 强>
ITEM_PIPELINES = {
'category.pipelines.CategoryPipeline': 300,
}
GitHub上的完整代码:output
BTW:我认为您可以直接在parse_details
中写入文件。