在scrapy中从多个文件中抓取多个地址

时间:2018-01-05 22:18:30

标签: python json file scrapy

我在目录中有一些JSON文件。在任何这些文件中,都有一些我需要的信息。我需要的第一个属性是" start_urls"的链接列表。在scrapy中。

每个文件都用于不同的进程,因此其输出必须是独立的。因此,我无法将所有json文件中的所有链接放入start_urls并将它们一起运行。我必须为每个文件运行蜘蛛。

我该怎么做?这是我目前的代码:

import scrapy
from os import listdir
from os.path import isfile, join
import json
class HotelInfoSpider(scrapy.Spider):
    name = 'hotel_info'
    allowed_domains = ['lastsecond.ir']
    # get start urls from links list of every file
    files = [f for f in listdir('lastsecond/hotels/') if 
    isfile(join('lastsecond/hotels/', f))]
    with open('lastsecond/hotels/' + files[0], 'r') as hotel_info:
        hotel = json.load(hotel_info)
    start_urls = hotel["links"]

    def parse(self, response):
        print("all good")

2 个答案:

答案 0 :(得分:0)

您可以使用dict管理所有文件:

d_hotel_info = {}
for file in files:
    with open('lastsecond/hotels/' + file, 'r') as hotel_info:
        hotel = json.load(hotel_info)
    d_hotel_info[file] = hotel

然后当您想要输出时,引用d_hotel_info

的键

答案 1 :(得分:0)

我看到两种方法

<强>首先

使用不同的参数多次运行spider。它将需要更少的代码。

您可以使用手动添加不同参数的多行来创建批处理。

第一个参数是输出文件名-o result1.csv,scrapy将自动创建 第二个参数是带有链接的输入文件名-a filename=process1.csv

scrapy crawl hotel_info -o result1.csv -a filename=process1.csv
scrapy crawl hotel_info -o result2.csv -a filename=process2.csv
scrapy crawl hotel_info -o result3.csv -a filename=process3.csv
...

只需在filename

中获取__init__即可
import scrapy
from os.path import isfile, join
import json

class HotelInfoSpider(scrapy.Spider):

    name = 'hotel_info'

    allowed_domains = ['lastsecond.ir']

    def __init__(self, filename, *args, **kwargs): # <-- filename
        super().__init__(*args, **kwargs)

        filename = join('lastsecond/hotels/', filename) 

        if isfile(filename):
            with open(filename) as f:
                data = json.load(f)
                self.start_urls = data['links']

    def parse(self, response):
        print('url:', response.url)

        yield {'url':, response.url, 'other': ...}

您还可以使用CrawlerProcess的Python脚本多次运行蜘蛛。

from scrapy.crawler import CrawlerProcess
import HotelInfoSpider
from os.path import isfile, join
import json

files = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]

for i, input_file in enumerate(files):
    output_file = 'result{}.csv'.format(i)
    c = CrawlerProcess({'FEED_FORMAT': 'csv','FEED_URI': output_file})
    c.crawl(HotelInfoSpider, filename=input_file) #input_file='process1.csv')
    c.start()

或使用scrapy.cmdline.execute()

import scrapy.cmdline
from os.path import isfile, join
import json

files = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]

for i, input_file in enumerate(files):
    output_file = 'result{}.csv'.format(i)
    scrapy.cmdline.execute(["scrapy", "crawl", "hotel_info", "-o", output_file, "-a" "filename=" + input_file])

<强>第二

它需要更多代码,因为您必须创建管道导出器,它将使用不同的文件来保存结果。

您必须使用start_requests()Request(..., meta=...)创建start_urls,其中包含extrameta数据的请求,以后您可以使用这些数据进行保存不同的文件。

parse()中,您必须从extra获取此meta并添加到item

在管道导出程序中,您必须从extra获取item并打开其他文件。

import scrapy
from os import listdir
from os.path import isfile, join
import json

class HotelInfoSpider(scrapy.Spider):

    name = 'hotel_info'

    allowed_domains = ['lastsecond.ir']

    def start_requests(self):

        # get start urls from links list of every file
        files = [f for f in listdir('lastsecond/hotels/') if isfile(join('lastsecond/hotels/', f))]

        for i, filename in enumerate(files):
            with open('lastsecond/hotels/' + filename) as f:
                data = json.load(f)
                links = data["links"]
                for url in links:
                    yield scrapy.Request(url, meta={'extra': i})

    def parse(self, response):
        print('url:', response.url)
        extra = response.meta['extra']
        print('extra:', extra)

        yield {'url': response.url, 'extra': extra, 'other': ...}

pipelines.py

class MyExportPipeline(object):

    def process_item(self, item, spider):

        # get extra and use it in filename
        filename = 'result{}.csv'.format(item['extra'])

        # open file for appending
        with open(filename, 'a') as f:
            writer = csv.writer(f)

            # write only selected elements - skip `extra`
            row = [item['url'], item['other']
            writer.writerow(row)

        return item

settings.py

ITEM_PIPELINES = {
   'your_porject_name.pipelines.MyExportPipeline': 300,
}