迭代刮擦

时间:2015-09-04 08:37:56

标签: python-2.7 scrapy

我的localhost中有几千页,名为article1.html,article2.html,依此类推。我的目标是单独刮取所有这些页面并将其内容转储到具有相同名称的JSON文件(article1.html的内容将在article1.json中,article2.html在article2.json中,依此类推)。我试图通过一个简单的循环运行这些页面并将计数整数传递给解析函数,但它似乎没有工作。我的代码如下所示:

class scraper0Spider(scrapy.Spider):
name = "scraper0"
allowed_domains = ["localhost"]
start_urls = [
   "http://localhost/"
]

def start_requests(self):
     for i in xrange(1,1084):
        yield scrapy.Request("http://localhost/article%s.html" %i, self.parse)

def parse(self,response):
    #grab relevant content and do other stuff, all the content will be in the variable fullstring

    with open("article%s.json" %i, 'w') as f:
        #f.write(stringjson)
        json.dump(fullstring, f)

尝试使用全局变量也没有帮助。我应该如何将i传递给解析函数?

1 个答案:

答案 0 :(得分:1)

当然, i变量是在start_requests范围内定义的。

如果您希望它在parse()中可用,请在meta内传递:

class scraper0Spider(scrapy.Spider):
    name = "scraper0"
    allowed_domains = ["localhost"]
    start_urls = [
       "http://localhost/"
    ]

    def start_requests(self):
         for i in xrange(1, 1084):
            yield scrapy.Request("http://localhost/article%s.html" % i, self.parse, meta={"index": i})

    def parse(self, response):
        with open("article%s.json" % response.meta["index"], 'w') as f:
            json.dump(fullstring, f)