Python Scrapy不执行解析函数

时间:2017-05-05 05:35:21

标签: python web-scraping scrapy scrapy-spider

我制作了亚马逊刮刀,通过亚马逊链接复制了那些有超过800条评论的产品的链接

当我使Integer.MIN_VALUE等于单个网址时,它可以正常工作 但当我使start_urls等于从文件中提取的start_urls列表时,甚至不执行parse函数 如果解析函数执行,urls 将回显到屏幕,但它不是 print '\n\n', 'IAM ECECUTED'

这是我在区域

之前评论过的代码
SEE THE SCRAPY DEBUG OUTPUT FROM MY TERMINAL

当我这样做时它会起作用

    # -*- coding: utf-8 -*-
import scrapy
from amazon.items import AmazonItem
from urlparse import urljoin
#co = 1
linkfile = open('links.txt', 'r')
listoflinks = [line.strip() for line in linkfile.readlines()]


class AmazonspiderSpider(scrapy.Spider):
    name = "amazonspider"
    DOWNLOAD_DELAY = 1
    #it works if start with one url

    #start_urls = ['https://www.amazon.com/s/ref=lp_165993011_nr_n_0?fst=as%3Aoff&rh=n%3A165793011%2Cn%3A%21165795011%2Cn%3A165993011%2Cn%3A2514571011&bbn=165993011&ie=UTF8&qid=1493778423&rnid=165993011']

    start_urls = listoflinks


    def parse(self, response):



        # THIS PRINT STATEMENT IS NOT EVEN EXECUTING
        print '\n\n', 'IAM ECECUTED'


        SET_SELECTOR = '.s-item-container'
        for attr in response.css(SET_SELECTOR):

            item = AmazonItem()

            link_selector = '.a-link-normal.s-access-detail-page.s-color-twister-title-link.a-text-normal ::attr(href)'

            if attr.css(link_selector).extract_first():

                yield scrapy.Request(urljoin(response.url, attr.css(link_selector).extract_first()), callback=self.parse_link, meta={'item': item})  


        next_page = './/span[@class="pagnRA"]/a[@id="pagnNextLink"]/@href'
        next_page = response.xpath(next_page).extract_first()
        if next_page:
            yield scrapy.Request(
                urljoin(response.url, next_page),
                callback=self.parse
            )
    def parse_link(self, response):

        review_selector = './/span[@id="acrCustomerReviewText"]/text()'

        item = AmazonItem(response.meta['item'])
        if response.xpath(review_selector).extract_first():
            if response.xpath(review_selector).extract_first().split(" ")[0].isdigit():
                if int(response.xpath(review_selector).extract_first().split(" ")[0]) > 800:

                    catselector = '.a-unordered-list.a-horizontal.a-size-small li:nth-child(5) span a ::text'
                    defaultcatselector = '.nav-search-label ::text'
                    cat = response.css(catselector).extract_first()
                    item['LINK'] = response.url
                    if cat:
                        item['CATAGORY'] = cat
                    else:
                        item['CATAGORY'] = response.css(defaultcatselector).extract_first()
                    return item

来自links.txt文件的一些链接

start_urls = ['https://www.amazon.com/s/ref=lp_165993011_nr_n_0?fst=as%3Aoff&rh=n%3A165793011%2Cn%3A%21165795011%2Cn%3A165993011%2Cn%3A2514571011&bbn=165993011&ie=UTF8&qid=1493778423&rnid=165993011']

,这是scrapy显示的调试输出

https://www.amazon.com/s/ref=lp_11057241_nr_n_3?fst=as%3Aoff&rh=n%3A3760911%2Cn%3A%2111055981%2Cn%3A11057241%2Cn%3A11057451&bbn=11057241&ie=UTF8&qid=1493793266&rnid=11057241
https://www.amazon.com/s/ref=lp_11057241_nr_n_4?fst=as%3Aoff&rh=n%3A3760911%2Cn%3A%2111055981%2Cn%3A11057241%2Cn%3A10666241011&bbn=11057241&ie=UTF8&qid=1493793266&rnid=11057241
https://www.amazon.com/s/ref=lp_11057241_nr_n_5?fst=as%3Aoff&rh=n%3A3760911%2Cn%3A%2111055981%2Cn%3A11057241%2Cn%3A10898755011&bbn=11057241&ie=UTF8&qid=1493793266&rnid=11057241
https://www.amazon.com/s/ref=lp_11057241_nr_n_6?fst=as%3Aoff&rh=n%3A3760911%2Cn%3A%2111055981%2Cn%3A11057241%2Cn%3A11057971&bbn=11057241&ie=UTF8&qid=1493793266&rnid=11057241
https://www.amazon.com/s/ref=lp_11057241_nr_n_7?fst=as%3Aoff&rh=n%3A3760911%2Cn%3A%2111055981%2Cn%3A11057241%2Cn%3A11058091&bbn=11057241&ie=UTF8&qid=1493793266&rnid=11057241
https://www.amazon.com/s/ref=lp_11057241_nr_n_8?fst=as%3Aoff&rh=n%3A3760911%2Cn%3A%2111055981%2Cn%3A11057241%2Cn%3A16236250011&bbn=11057241&ie=UTF8&qid=1493793266&rnid=11057241

那么这里发生了什么?为什么解析函数甚至没有执行导致

如果PARE功能执行 2017-05-05 10:51:10 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/Action-Toy-Figures/b?ie=UTF8&node=2514571011> from <GET https://www.amazon.com/s/ref=lp_165993011_nr_n_0?fst=as%3Aoff&rh=n%3A165793011%2Cn%3A%21165795011%2Cn%3A165993011%2Cn%3A2514571011&bbn=165993011&ie=UTF8&qid=1493778423&rnid=165993011> 2017-05-05 10:51:11 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/Action-Figure-Vehicles-Playsets/b?ie=UTF8&node=7620514011> from <GET https://www.amazon.com/s/ref=lp_165993011_nr_n_1?fst=as%3Aoff&rh=n%3A165793011%2Cn%3A%21165795011%2Cn%3A165993011%2Cn%3A7620514011&bbn=165993011&ie=UTF8&qid=1493778423&rnid=165993011> 2017-05-05 10:51:11 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/Statue-Maquette-Bust-Action-Figures/b?ie=UTF8&node=166026011> from <GET https://www.amazon.com/s/ref=lp_165993011_nr_n_2?fst=as%3Aoff&rh=n%3A165793011%2Cn%3A%21165795011%2Cn%3A165993011%2Cn%3A166026011&bbn=165993011&ie=UTF8&qid=1493778423&rnid=165993011> 2017-05-05 10:51:12 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/Action-Toy-Figure-Accessories/b?ie=UTF8&node=165994011> from <GET https://www.amazon.com/s/ref=lp_165993011_nr_n_3?fst=as%3Aoff&rh=n%3A165793011%2Cn%3A%21165795011%2Cn%3A165993011%2Cn%3A165994011&bbn=165993011&ie=UTF8&qid=1493778423&rnid=165993011> 将被移至屏幕

如果我这样做

print '\n\n', 'IAM ECECUTED'

使start_urls只包含文件链接列表中第一个网址的列表

它的工作方式应该是这样的 here

并在我打印listoflinks时输出 here

我在这里做错了什么?

0 个答案:

没有答案