我在目录中有一些JSON文件。在任何这些文件中,都有一些我需要的信息。我需要的第一个属性是" start_urls"的链接列表。在scrapy中。
每个文件都用于不同的进程,因此其输出必须是独立的。因此,我无法将所有json文件中的所有链接放入start_urls并将它们一起运行。我必须为每个文件运行蜘蛛。
我该怎么做?这是我目前的代码:
import scrapy
from os import listdir
from os.path import isfile, join
import json
class HotelInfoSpider(scrapy.Spider):
name = 'hotel_info'
allowed_domains = ['lastsecond.ir']
# get start urls from links list of every file
files = [f for f in listdir('lastsecond/hotels/') if
isfile(join('lastsecond/hotels/', f))]
with open('lastsecond/hotels/' + files[0], 'r') as hotel_info:
hotel = json.load(hotel_info)
start_urls = hotel["links"]
def parse(self, response):
print("all good")
答案 0 :(得分:0)
您可以使用dict
管理所有文件:
d_hotel_info = {}
for file in files:
with open('lastsecond/hotels/' + file, 'r') as hotel_info:
hotel = json.load(hotel_info)
d_hotel_info[file] = hotel
然后当您想要输出时,引用d_hotel_info
答案 1 :(得分:0)
我看到两种方法
<强>首先强>
使用不同的参数多次运行spider。它将需要更少的代码。
您可以使用手动添加不同参数的多行来创建批处理。
第一个参数是输出文件名-o result1.csv
,scrapy将自动创建
第二个参数是带有链接的输入文件名-a filename=process1.csv
。
scrapy crawl hotel_info -o result1.csv -a filename=process1.csv
scrapy crawl hotel_info -o result2.csv -a filename=process2.csv
scrapy crawl hotel_info -o result3.csv -a filename=process3.csv
...
只需在filename
__init__
即可
import scrapy
from os.path import isfile, join
import json
class HotelInfoSpider(scrapy.Spider):
name = 'hotel_info'
allowed_domains = ['lastsecond.ir']
def __init__(self, filename, *args, **kwargs): # <-- filename
super().__init__(*args, **kwargs)
filename = join('lastsecond/hotels/', filename)
if isfile(filename):
with open(filename) as f:
data = json.load(f)
self.start_urls = data['links']
def parse(self, response):
print('url:', response.url)
yield {'url':, response.url, 'other': ...}
您还可以使用CrawlerProcess
的Python脚本多次运行蜘蛛。
from scrapy.crawler import CrawlerProcess
import HotelInfoSpider
from os.path import isfile, join
import json
files = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]
for i, input_file in enumerate(files):
output_file = 'result{}.csv'.format(i)
c = CrawlerProcess({'FEED_FORMAT': 'csv','FEED_URI': output_file})
c.crawl(HotelInfoSpider, filename=input_file) #input_file='process1.csv')
c.start()
或使用scrapy.cmdline.execute()
import scrapy.cmdline
from os.path import isfile, join
import json
files = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]
for i, input_file in enumerate(files):
output_file = 'result{}.csv'.format(i)
scrapy.cmdline.execute(["scrapy", "crawl", "hotel_info", "-o", output_file, "-a" "filename=" + input_file])
<强>第二强>
它需要更多代码,因为您必须创建管道导出器,它将使用不同的文件来保存结果。
您必须使用start_requests()
和Request(..., meta=...)
创建start_urls
,其中包含extra
个meta
数据的请求,以后您可以使用这些数据进行保存不同的文件。
在parse()
中,您必须从extra
获取此meta
并添加到item
。
在管道导出程序中,您必须从extra
获取item
并打开其他文件。
import scrapy
from os import listdir
from os.path import isfile, join
import json
class HotelInfoSpider(scrapy.Spider):
name = 'hotel_info'
allowed_domains = ['lastsecond.ir']
def start_requests(self):
# get start urls from links list of every file
files = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]
for i, filename in enumerate(files):
with open('lastsecond/hotels/' + filename) as f:
data = json.load(f)
links = data["links"]
for url in links:
yield scrapy.Request(url, meta={'extra': i})
def parse(self, response):
print('url:', response.url)
extra = response.meta['extra']
print('extra:', extra)
yield {'url': response.url, 'extra': extra, 'other': ...}
pipelines.py
class MyExportPipeline(object):
def process_item(self, item, spider):
# get extra and use it in filename
filename = 'result{}.csv'.format(item['extra'])
# open file for appending
with open(filename, 'a') as f:
writer = csv.writer(f)
# write only selected elements - skip `extra`
row = [item['url'], item['other']
writer.writerow(row)
return item
settings.py
ITEM_PIPELINES = {
'your_porject_name.pipelines.MyExportPipeline': 300,
}