我应该如何格式化产量要求?

时间:2019-09-06 14:56:25

标签: python-2.7 amazon-web-services amazon-ec2 scrapy

我的风雨如磐的蜘蛛很困惑,或者我很困惑,但是我们中的一个人没有按预期工作。我的Spider从文件中提取起始网址,并且应该:从Amazon搜索页面开始,抓取页面并获取每个搜索结果的网址,点击链接到商品页面,搜寻商品页面以获取商品信息,在第一页上的所有项目均已爬网后,请按分页直到X页,然后冲洗并重复。

我正在使用ScraperAPI和Scrapy-user-agent来随机化中间件。我已根据文件中的索引对start_requests进行了优先级格式化,因此应按顺序对其进行爬网。我已经检查并确保我能从亚马逊页面成功收到200 html响应,并带有实际的html。这是蜘蛛的代码:

class AmazonSpiderSpider(scrapy.Spider):
    name = 'amazon_spider'
    page_number = 2
    current_keyword = 0
    keyword_list = []

    payload = {'api_key': 'mykey', 'url':'https://httpbin.org/ip'}
    r = requests.get('http://api.scraperapi.com', params=payload)
    print(r.text)
#/////////////////////////////////////////////////////////////////////
def start_requests(self):
    with open("keywords.txt") as f:
        for index, line in enumerate(f):
            try:
                keyword = line.strip()
                AmazonSpiderSpider.keyword_list.append(keyword)
                formatted_keyword = keyword.replace(' ', '+')
                url = "http://api.scraperapi.com/?api_key=mykey&url=https://www.amazon.com/s?k=" + formatted_keyword + "&ref=nb_sb_noss_2"
                yield scrapy.Request(url, meta={'priority': index})
            except:
                continue
#/////////////////////////////////////////////////////////////////////
def parse(self, response):
    print("========== starting parse ===========")

    for next_page in response.css("h2.a-size-mini a").xpath("@href").extract():
        if next_page is not None:
            if "https://www.amazon.com" not in next_page:
                next_page = "https://www.amazon.com" + next_page
            yield scrapy.Request('http://api.scraperapi.com/?api_key=mykey&url=' + next_page, callback=self.parse_dir_contents)

    second_page = response.css('li.a-last a').xpath("@href").extract_first()
    if second_page is not None and AmazonSpiderSpider.page_number < 3:
        AmazonSpiderSpider.page_number += 1
        yield scrapy.Request('http://api.scraperapi.com/?api_key=mykey&url=' + second_page, callback=self.parse_pagination)
    else:
        AmazonSpiderSpider.current_keyword = AmazonSpiderSpider.current_keyword + 1
#/////////////////////////////////////////////////////////////////////
def parse_pagination(self, response):
    print("========== starting pagination ===========")

    for next_page in response.css("h2.a-size-mini a").xpath("@href").extract():
        if next_page is not None:
            if "https://www.amazon.com" not in next_page:
                next_page = "https://www.amazon.com" + next_page
            yield scrapy.Request(
                'http://api.scraperapi.com/?api_key=mykey&url=' + next_page,
                callback=self.parse_dir_contents)

    second_page = response.css('li.a-last a').xpath("@href").extract_first()
    if second_page is not None and AmazonSpiderSpider.page_number < 3:
        AmazonSpiderSpider.page_number += 1
        yield scrapy.Request(
            'http://api.scraperapi.com/?api_key=mykey&url=' + second_page,
            callback=self.parse_pagination)
    else:
        AmazonSpiderSpider.current_keyword = AmazonSpiderSpider.current_keyword + 1
#/////////////////////////////////////////////////////////////////////
def parse_dir_contents(self, response):
    items = ScrapeAmazonItem()

    print("============= parsing page ==============")

    temp = response.css('#productTitle::text').extract()
    product_name = ''.join(temp)
    product_name = product_name.replace('\n', '')
    product_name = product_name.strip()

    temp = response.css('#priceblock_ourprice::text').extract()
    product_price = ''.join(temp)
    product_price = product_price.replace('\n', '')
    product_price = product_price.strip()

    temp = response.css('#SalesRank::text').extract()
    product_score = ''.join(temp)
    product_score = product_score.strip()
    product_score = re.sub(r'\D', '', product_score)

    product_ASIN = response.css('li:nth-child(2) .a-text-bold+ span').css('::text').extract()

    keyword = AmazonSpiderSpider.keyword_list[AmazonSpiderSpider.current_keyword]

    items['product_keyword'] = keyword
    items['product_ASIN'] = product_ASIN
    items['product_name'] = product_name
    items['product_price'] = product_price
    items['product_score'] = product_score

    yield items

对于第一个起始URL,它将爬网三个或四个项目,然后跳转到第二个起始URL。它将跳过对其余项目和分页的处理,直接进入第二个起始URL。对于第二个URL,它将爬网三个或四个项目,然后再次跳到THIRD起始URL。它以这种方式继续,抓取三个或四个项目,然后跳到下一个URL,直到到达最终的起始URL。它将完全收集该URL上的所有信息。有时,蜘蛛会完全跳过第一个或第二个起始网址。这种情况很少发生,但是我不知道是什么原因引起的。

我用于跟踪结果项URL的代码工作正常,但是我没有得到“开始分页”的打印语句,因此它无法正确跟踪页面。另外,还有odd with middlewares。它在分配中间件之前开始解析

0 个答案:

没有答案