用scrapy废弃多个网址

时间:2013-04-19 11:49:20

标签: scrapy web-crawler

如何用scrapy废弃多个网址?

我被迫制作多个抓取工具?

class TravelSpider(BaseSpider):
    name = "speedy"
    allowed_domains = ["example.com"]
    start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4),"http://example.com/superurl/top/page-%d/" % i for i in xrange(55)]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        item = TravelItem()
        item['url'] = hxs.select('//a[@class="out"]/@href').extract()
        out = "\n".join(str(e) for e in item['url']);
        print out

Python说:

  

NameError:名称'i'未定义

但是当我使用一个网址时它工作正常!

   start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4)"]

3 个答案:

答案 0 :(得分:2)

您可以在start_urls方法中初始化__init__.py

from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider


class TravelItem(Item):
    url = Field()


class TravelSpider(BaseSpider):
    def __init__(self, name=None, **kwargs):
        self.start_urls = []
        self.start_urls.extend(["http://example.com/category/top/page-%d/" % i for i in xrange(4)])
        self.start_urls.extend(["http://example.com/superurl/top/page-%d/" % i for i in xrange(55)])

        super(TravelSpider, self).__init__(name, **kwargs)

    name = "speedy"
    allowed_domains = ["example.com"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        items = []
        item = TravelItem()
        item['url'] = hxs.select('//a[@class="out"]/@href').extract()
        out = "\n".join(str(e) for e in item['url']);
        print out

希望有所帮助。

答案 1 :(得分:2)

您的python语法不正确,请尝试:

start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4)] + \
    ["http://example.com/superurl/top/page-%d/" % i for i in xrange(55)]

如果您需要编写代码来生成启动请求,则可以定义start_requests()方法而不是使用start_urls。

答案 2 :(得分:0)

Python中只有四个范围:LEGB,因为class定义的局部范围和list derivation的局部范围不是嵌套函数,因此它们不会形成因此,它们是两个单独的本地范围,无法彼此访问。

因此,请勿同时使用'for'和class变量