嗨,我有2个不同的域,并且在一个蜘蛛中运行2种不同的方法,我已经尝试了此代码,但是没有任何效果吗?
class SalesitemSpiderSpider(scrapy.Spider):
name = 'salesitem_spider'
allowed_domains = ['www2.hm.com','www.forever21.com']
url = ['https://www.forever21.com/eu/shop/Catalog/GetProducts' , 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20']
#Json Payload code here
def start_requests(self):
for i in self.url:
if (i == 'https://www.forever21.com/eu/shop/Catalog/GetProducts'):
print("sample: " + i)
payload = self.payload.copy()
payload['page']['pageNo'] = 1
yield scrapy.Request(
i, method='POST', body=json.dumps(payload),
headers={'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/json; charset=UTF-8'},
callback=self.parse_2, meta={'pageNo': 1})
if (i == 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20'):
yield scrapy.Request(i, callback=self.parse_1)
def parse_1(self, response):
#Some code of getting item
def parse_2(self, response):
data = json.loads(response.text)
for product in data['CatalogProducts']:
item = GpdealsSpiderItem_f21()
#item yield
yield item
# simulate pagination if we are not at the end
if len(data['CatalogProducts']) == self.payload['page']['pageSize']:
payload = self.payload.copy()
payload['page']['pageNo'] = response.meta['pageNo'] + 1
yield scrapy.Request(
self.url, method='POST', body=json.dumps(payload),
headers={'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/json; charset=UTF-8'},
callback=self.parse_2, meta={'pageNo': payload['page']['pageNo']}
)
我总是遇到这个问题
NameError:名称'url'未定义
答案 0 :(得分:1)
同一班上有两个不同的蜘蛛。为了可维护性,建议您将它们保存在不同的文件中。
如果您真的想将它们放在一起,则将URL分为两个列表会更容易:
type1_urls = ['https://www.forever21.com/eu/shop/Catalog/GetProducts', ]
type2_urls = ['https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20', ]
def start_requests(self):
for url in self.type1_urls:
payload = self.payload.copy()
yield Request(
# ...
callback=self.parse_1
)
for url in self.type2_urls:
yield scrapy.Request(url, callback=self.parse_2)
答案 1 :(得分:0)
您应该在self.url
周期中使用for
,然后在循环内使用i
变量进行比较,请求屈服等。
def start_requests(self):
for i in self.url:
if (i == 'https://www.forever21.com/eu/shop/Catalog/GetProducts'):
payload = self.payload.copy()
payload['page']['pageNo'] = 1
yield scrapy.Request(
i, method='POST', body=json.dumps(payload),
headers={'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/json; charset=UTF-8'},
callback=self.parse_2, meta={'pageNo': 1})
if (i == 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20'):
yield scrapy.Request(i, callback=self.parse_1)