如何用scrapy废弃多个网址?
我被迫制作多个抓取工具?
class TravelSpider(BaseSpider):
name = "speedy"
allowed_domains = ["example.com"]
start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4),"http://example.com/superurl/top/page-%d/" % i for i in xrange(55)]
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
item = TravelItem()
item['url'] = hxs.select('//a[@class="out"]/@href').extract()
out = "\n".join(str(e) for e in item['url']);
print out
Python说:
NameError:名称'i'未定义
但是当我使用一个网址时它工作正常!
start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4)"]
答案 0 :(得分:2)
您可以在start_urls
方法中初始化__init__.py
:
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
class TravelItem(Item):
url = Field()
class TravelSpider(BaseSpider):
def __init__(self, name=None, **kwargs):
self.start_urls = []
self.start_urls.extend(["http://example.com/category/top/page-%d/" % i for i in xrange(4)])
self.start_urls.extend(["http://example.com/superurl/top/page-%d/" % i for i in xrange(55)])
super(TravelSpider, self).__init__(name, **kwargs)
name = "speedy"
allowed_domains = ["example.com"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
item = TravelItem()
item['url'] = hxs.select('//a[@class="out"]/@href').extract()
out = "\n".join(str(e) for e in item['url']);
print out
希望有所帮助。
答案 1 :(得分:2)
您的python语法不正确,请尝试:
start_urls = ["http://example.com/category/top/page-%d/" % i for i in xrange(4)] + \
["http://example.com/superurl/top/page-%d/" % i for i in xrange(55)]
如果您需要编写代码来生成启动请求,则可以定义start_requests()方法而不是使用start_urls。
答案 2 :(得分:0)
Python中只有四个范围:LEGB
,因为class
定义的局部范围和list derivation
的局部范围不是嵌套函数,因此它们不会形成因此,它们是两个单独的本地范围,无法彼此访问。
因此,请勿同时使用'for'和class变量