一只蜘蛛,使用Scrapy具有2个不同的URL和2个解析

时间:2019-04-19 12:01:11

标签: python web-scraping scrapy

嗨,我有2个不同的域,并且在一个蜘蛛中运行2种不同的方法,我已经尝试了此代码,但是没有任何效果吗?

class SalesitemSpiderSpider(scrapy.Spider):
    name = 'salesitem_spider'
    allowed_domains = ['www2.hm.com','www.forever21.com']
    url = ['https://www.forever21.com/eu/shop/Catalog/GetProducts' , 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20']

   #Json Payload code here

    def start_requests(self):
       for i in self.url:
        if (i == 'https://www.forever21.com/eu/shop/Catalog/GetProducts'):
            print("sample: " + i)
            payload = self.payload.copy()
            payload['page']['pageNo'] = 1
            yield scrapy.Request(
            i, method='POST', body=json.dumps(payload),
            headers={'X-Requested-With': 'XMLHttpRequest',
                 'Content-Type': 'application/json; charset=UTF-8'},
            callback=self.parse_2, meta={'pageNo': 1})

        if (i == 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20'):
            yield scrapy.Request(i, callback=self.parse_1)

    def parse_1(self, response):
     #Some code of getting item 

    def parse_2(self, response):
     data = json.loads(response.text)
        for product in data['CatalogProducts']:
            item = GpdealsSpiderItem_f21()
         #item yield

        yield item

        # simulate pagination if we are not at the end
        if len(data['CatalogProducts']) == self.payload['page']['pageSize']:
            payload = self.payload.copy()
            payload['page']['pageNo'] = response.meta['pageNo'] + 1
            yield scrapy.Request(
              self.url, method='POST', body=json.dumps(payload),
             headers={'X-Requested-With': 'XMLHttpRequest',
                        'Content-Type': 'application/json; charset=UTF-8'},
               callback=self.parse_2, meta={'pageNo': payload['page']['pageNo']}
           )

我总是遇到这个问题

NameError:名称'url'未定义

2 个答案:

答案 0 :(得分:1)

同一班上有两个不同的蜘蛛。为了可维护性,建议您将它们保存在不同的文件中。

如果您真的想将它们放在一起,则将URL分为两个列表会更容易:

type1_urls = ['https://www.forever21.com/eu/shop/Catalog/GetProducts', ]
type2_urls = ['https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20', ]

def start_requests(self):
    for url in self.type1_urls:
        payload = self.payload.copy()
        yield Request(
            # ...
            callback=self.parse_1
       )

    for url in self.type2_urls:
        yield scrapy.Request(url, callback=self.parse_2)

答案 1 :(得分:0)

您应该在self.url周期中使用for,然后在循环内使用i变量进行比较,请求屈服等。

def start_requests(self):
    for i in self.url:
        if (i == 'https://www.forever21.com/eu/shop/Catalog/GetProducts'):
            payload = self.payload.copy()
            payload['page']['pageNo'] = 1
            yield scrapy.Request(
                i, method='POST', body=json.dumps(payload),
                headers={'X-Requested-With': 'XMLHttpRequest',
                     'Content-Type': 'application/json; charset=UTF-8'},
                callback=self.parse_2, meta={'pageNo': 1})

        if (i == 'https://www2.hm.com/en_us/sale/shopbyproductladies/view-all.html?sort=stock&image-size=small&image=stillLife&offset=0&page-size=20'):
            yield scrapy.Request(i, callback=self.parse_1)