我制作了亚马逊刮刀,通过亚马逊链接复制了那些有超过800条评论的产品的链接
当我使Integer.MIN_VALUE
等于单个网址时,它可以正常工作
但当我使start_urls等于从文件中提取的start_urls
列表时,甚至不执行parse
函数
如果解析函数执行,urls
将回显到屏幕,但它不是 print '\n\n', 'IAM ECECUTED'
这是我在区域
之前评论过的代码SEE THE SCRAPY DEBUG OUTPUT FROM MY TERMINAL
当我这样做时它会起作用
# -*- coding: utf-8 -*-
import scrapy
from amazon.items import AmazonItem
from urlparse import urljoin
#co = 1
linkfile = open('links.txt', 'r')
listoflinks = [line.strip() for line in linkfile.readlines()]
class AmazonspiderSpider(scrapy.Spider):
name = "amazonspider"
DOWNLOAD_DELAY = 1
#it works if start with one url
#start_urls = ['https://www.amazon.com/s/ref=lp_165993011_nr_n_0?fst=as%3Aoff&rh=n%3A165793011%2Cn%3A%21165795011%2Cn%3A165993011%2Cn%3A2514571011&bbn=165993011&ie=UTF8&qid=1493778423&rnid=165993011']
start_urls = listoflinks
def parse(self, response):
# THIS PRINT STATEMENT IS NOT EVEN EXECUTING
print '\n\n', 'IAM ECECUTED'
SET_SELECTOR = '.s-item-container'
for attr in response.css(SET_SELECTOR):
item = AmazonItem()
link_selector = '.a-link-normal.s-access-detail-page.s-color-twister-title-link.a-text-normal ::attr(href)'
if attr.css(link_selector).extract_first():
yield scrapy.Request(urljoin(response.url, attr.css(link_selector).extract_first()), callback=self.parse_link, meta={'item': item})
next_page = './/span[@class="pagnRA"]/a[@id="pagnNextLink"]/@href'
next_page = response.xpath(next_page).extract_first()
if next_page:
yield scrapy.Request(
urljoin(response.url, next_page),
callback=self.parse
)
def parse_link(self, response):
review_selector = './/span[@id="acrCustomerReviewText"]/text()'
item = AmazonItem(response.meta['item'])
if response.xpath(review_selector).extract_first():
if response.xpath(review_selector).extract_first().split(" ")[0].isdigit():
if int(response.xpath(review_selector).extract_first().split(" ")[0]) > 800:
catselector = '.a-unordered-list.a-horizontal.a-size-small li:nth-child(5) span a ::text'
defaultcatselector = '.nav-search-label ::text'
cat = response.css(catselector).extract_first()
item['LINK'] = response.url
if cat:
item['CATAGORY'] = cat
else:
item['CATAGORY'] = response.css(defaultcatselector).extract_first()
return item
来自links.txt文件的一些链接
start_urls = ['https://www.amazon.com/s/ref=lp_165993011_nr_n_0?fst=as%3Aoff&rh=n%3A165793011%2Cn%3A%21165795011%2Cn%3A165993011%2Cn%3A2514571011&bbn=165993011&ie=UTF8&qid=1493778423&rnid=165993011']
,这是scrapy显示的调试输出
https://www.amazon.com/s/ref=lp_11057241_nr_n_3?fst=as%3Aoff&rh=n%3A3760911%2Cn%3A%2111055981%2Cn%3A11057241%2Cn%3A11057451&bbn=11057241&ie=UTF8&qid=1493793266&rnid=11057241
https://www.amazon.com/s/ref=lp_11057241_nr_n_4?fst=as%3Aoff&rh=n%3A3760911%2Cn%3A%2111055981%2Cn%3A11057241%2Cn%3A10666241011&bbn=11057241&ie=UTF8&qid=1493793266&rnid=11057241
https://www.amazon.com/s/ref=lp_11057241_nr_n_5?fst=as%3Aoff&rh=n%3A3760911%2Cn%3A%2111055981%2Cn%3A11057241%2Cn%3A10898755011&bbn=11057241&ie=UTF8&qid=1493793266&rnid=11057241
https://www.amazon.com/s/ref=lp_11057241_nr_n_6?fst=as%3Aoff&rh=n%3A3760911%2Cn%3A%2111055981%2Cn%3A11057241%2Cn%3A11057971&bbn=11057241&ie=UTF8&qid=1493793266&rnid=11057241
https://www.amazon.com/s/ref=lp_11057241_nr_n_7?fst=as%3Aoff&rh=n%3A3760911%2Cn%3A%2111055981%2Cn%3A11057241%2Cn%3A11058091&bbn=11057241&ie=UTF8&qid=1493793266&rnid=11057241
https://www.amazon.com/s/ref=lp_11057241_nr_n_8?fst=as%3Aoff&rh=n%3A3760911%2Cn%3A%2111055981%2Cn%3A11057241%2Cn%3A16236250011&bbn=11057241&ie=UTF8&qid=1493793266&rnid=11057241
那么这里发生了什么?为什么解析函数甚至没有执行导致
如果PARE功能执行 2017-05-05 10:51:10 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/Action-Toy-Figures/b?ie=UTF8&node=2514571011> from <GET https://www.amazon.com/s/ref=lp_165993011_nr_n_0?fst=as%3Aoff&rh=n%3A165793011%2Cn%3A%21165795011%2Cn%3A165993011%2Cn%3A2514571011&bbn=165993011&ie=UTF8&qid=1493778423&rnid=165993011>
2017-05-05 10:51:11 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/Action-Figure-Vehicles-Playsets/b?ie=UTF8&node=7620514011> from <GET https://www.amazon.com/s/ref=lp_165993011_nr_n_1?fst=as%3Aoff&rh=n%3A165793011%2Cn%3A%21165795011%2Cn%3A165993011%2Cn%3A7620514011&bbn=165993011&ie=UTF8&qid=1493778423&rnid=165993011>
2017-05-05 10:51:11 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/Statue-Maquette-Bust-Action-Figures/b?ie=UTF8&node=166026011> from <GET https://www.amazon.com/s/ref=lp_165993011_nr_n_2?fst=as%3Aoff&rh=n%3A165793011%2Cn%3A%21165795011%2Cn%3A165993011%2Cn%3A166026011&bbn=165993011&ie=UTF8&qid=1493778423&rnid=165993011>
2017-05-05 10:51:12 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.amazon.com/Action-Toy-Figure-Accessories/b?ie=UTF8&node=165994011> from <GET https://www.amazon.com/s/ref=lp_165993011_nr_n_3?fst=as%3Aoff&rh=n%3A165793011%2Cn%3A%21165795011%2Cn%3A165993011%2Cn%3A165994011&bbn=165993011&ie=UTF8&qid=1493778423&rnid=165993011>
将被移至屏幕
如果我这样做
print '\n\n', 'IAM ECECUTED'
使start_urls只包含文件链接列表中第一个网址的列表
我在这里做错了什么?