我正在尝试让Scrapy为大量的URL运行蜘蛛我已经存储在数据库中。
“蜘蛛”一切正常。
我无法让Scrapy“记住”它正在处理的对象。下面的代码让它使用URL字段将其匹配回我的django数据库。
问题是,在浏览器中访问URL时,URL通常会发生变化,因此scrapy不知道将数据放在何处。
理想情况下 - 我可以'告诉'scrapy对象的主键 - 删除所有错误空间。
import sys, os, scrapy, django
from scrapy.crawler import CrawlerProcess
from scrapy.exceptions import DropItem
## Django init ##
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) ## direct to where manage.py is
os.environ['DJANGO_SETTINGS_MODULE'] = 'XYZDB.settings'
django.setup()
#################
## Settings ##
#queryset_chunksize = 1000
##############
from XYZ import models
from parsers import dj, asos, theiconic
stores = [dj, asos, theiconic]
parsers = dict((i.domain, i) for i in stores)
def urls():
for i in models.Variation.objects.iterator():
yield i.link_original if i.link_original else i.link
class Superspider(scrapy.Spider):
name = 'Superspider'
start_urls = urls()
def parse(self, response):
for i in parsers:
if i in response.url:
return parsers[i].parse(response)
## Reference - models
'''
Stock_CHOICES = (
(1, 'In Stock'),
(2, 'Low Stock'),
(3, 'Out of Stock'),
(4, 'Discontinued'),
)
'''
class ProductPipeline:
def process_item(self, item, spider):
var = models.Variation.objects.get(link_original=item['url'])
size = models.Size.objects.get(variation=var)
if item['stock'] != models.Stock.objects.filter(size=size)[0]:
models.Stock(size=size, stock=item['stock']).save()
if int(item['price']) != int(models.Price.objects.filter(variation=var)):
models.Price(variation=var, price=item['price']).save()
return
if __name__ == '__main__':
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'ITEM_PIPELINES': {'__main__.ProductPipeline': 1,},
'DOWNLOAD_DELAY': 0.4
})
process.crawl(Superspider)
process.start()
答案 0 :(得分:0)
您可以使用scrapy的response.meta属性。将start_urls的定义替换为您可以start_requests(self)
的例程yield Request(your url, meta={'pk': primary key})
。然后,您可以使用item['pk'] = response.meta['pk']
访问parse()例程中的元数据。 start_requests() docs.