我正试图在amazon.com上使用scrapy爬行,并尝试收集列出的所有数据品牌。 这是scrapy脚本::
class StayuncleCrawlerSpider(CrawlSpider):
name = 'amazon_crawler'
allowed_domains = ['amazon.com']
start_urls = ['https://www.amazon.com/gp/search/other/ref=sr_in_a_V?rh=i%3Aelectronics%2Cn%3A172282&pickerToList=brandtextbin&indexField=a&ie=UTF8&qid=1466664617']
CrawlSpider.DOWNLOAD_DELAY=2;
rules = [Rule(SgmlLinkExtractor(allow=("/gp/search/other/ref")), callback='parse_item', follow=True) ]
def parse_item(self,response):
global i
body = response.xpath('//body//div[@id="center"]')
texts = body.xpath('.//span').extract()
print texts
ptext ="/Users/Nand/crawledData/html/"+response.url.split("/")[-2] +str(i)+'.txt'
for text in texts:
if text:
text = text.rstrip()
print text.encode('utf-8')
with open(ptext, 'ab') as f:
f.write(text.encode('utf-8'))
f.write("\n")
item = DmozItem()
yield item
这是起始网址
https://www.amazon.com/gp/search/other/ref=sr_in_a_V?rh=i%3Aelectronics%2Cn%3A172282&pickerToList=brandtextbin&indexField=a&ie=UTF8&qid=1466664617
这是我要抓取的HTML部分
<div class="a-row a-spacing-none pagn">
<span class="pagnLead">Viewing:</span>
<span class="pagnLink"><a href="/gp/search/other/ref=sr_in_-2_A?rh=i%3Aelectronics%2Cn%3A172282&pickerToList=brandtextbin&ie=UTF8&qid=1466668789">Top Brands</a>
</span>
# A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
我正在尝试使用div id indexBarHeader
下方列出的字母表定义的所有链接,并尝试打印所有列出的品牌
A & I Products
A & L Engraving
and so on..
有人可以帮助纠正我的剧本