因为三天我试图在meta属性中保存相应的start_urs以将其作为项目传递给scrapy中的后续请求,所以我可以使用start_url来调用字典以使用其他数据填充我的输出。实际上它应该是直截了当的,因为它在documentation ...
中有解释Google scrapy group中有一个讨论,还有一个问题here,但我无法让它运行:(
我是scrapy的新手,我认为这是一个很棒的框架,但对于我的项目,我必须知道所有请求的start_urls,它看起来很复杂。
我真的很感激一些帮助!
目前我的代码看起来像这样:
class example(CrawlSpider):
name = 'example'
start_urls = ['http://www.example.com']
rules = (
Rule(SgmlLinkExtractor(allow=('/blablabla/', )), callback='parse_item'),
)
def parse(self, response):
for request_or_item in super(example, self).parse(response):
if isinstance(request_or_item, Request):
request_or_item = request_or_item.replace(meta = {'start_url': response.meta['start_url']})
yield request_or_item
def make_requests_from_url(self, url):
return Request(url, dont_filter=True, meta = {'start_url': url})
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
item = testItem()
print response.request.meta, response.url
答案 0 :(得分:2)
我想删除这个答案,因为它没有解决OP的问题,但我想把它作为一个scrapy示例。
编写爬网蜘蛛规则时,请避免使用
parse
作为回调CrawlSpider
使用parse
方法本身来实现其逻辑。 因此,如果您覆盖parse
方法,则抓取蜘蛛将不再存在 工作
改为使用BaseSpider:
class Spider(BaseSpider):
name = "domain_spider"
def start_requests(self):
last_domain_id = 0
chunk_size = 10
cursor = settings.db.cursor()
while True:
cursor.execute("""
SELECT domain_id, domain_url
FROM domains
WHERE domain_id > %s AND scraping_started IS NULL
LIMIT %s
""", (last_domain_id, chunk_size))
self.log('Requesting %s domains after %s' % (chunk_size, last_domain_id))
rows = cursor.fetchall()
if not rows:
self.log('No more domains to scrape.')
break
for domain_id, domain_url in rows:
last_domain_id = domain_id
request = self.make_requests_from_url(domain_url)
item = items.Item()
item['start_url'] = domain_url
item['domain_id'] = domain_id
item['domain'] = urlparse.urlparse(domain_url).hostname
request.meta['item'] = item
cursor.execute("""
UPDATE domains
SET scraping_started = %s
WHERE domain_id = %s
""", (datetime.now(), domain_id))
yield request
...