Question

我正试图在amazon.com上使用scrapy爬行，并尝试收集列出的所有数据品牌。这是scrapy脚本::

class StayuncleCrawlerSpider(CrawlSpider):

    name = 'amazon_crawler'

    allowed_domains = ['amazon.com']
    start_urls = ['https://www.amazon.com/gp/search/other/ref=sr_in_a_V?rh=i%3Aelectronics%2Cn%3A172282&pickerToList=brandtextbin&indexField=a&ie=UTF8&qid=1466664617']
    CrawlSpider.DOWNLOAD_DELAY=2; 
    rules = [Rule(SgmlLinkExtractor(allow=("/gp/search/other/ref")), callback='parse_item', follow=True) ]

    def parse_item(self,response):
        global i
        body = response.xpath('//body//div[@id="center"]')
        texts = body.xpath('.//span').extract()
        print texts
        ptext ="/Users/Nand/crawledData/html/"+response.url.split("/")[-2] +str(i)+'.txt'
        for text in texts:
            if text:
                            text = text.rstrip()
                print text.encode('utf-8')
                            with open(ptext, 'ab') as f:
                             f.write(text.encode('utf-8'))
                             f.write("\n") 


    item = DmozItem()
    yield item

这是起始网址

 https://www.amazon.com/gp/search/other/ref=sr_in_a_V?rh=i%3Aelectronics%2Cn%3A172282&pickerToList=brandtextbin&indexField=a&ie=UTF8&qid=1466664617

这是我要抓取的HTML部分

 <div class="a-row a-spacing-none pagn">
<span class="pagnLead">Viewing:</span>
<span class="pagnLink"><a href="/gp/search/other/ref=sr_in_-2_A?rh=i%3Aelectronics%2Cn%3A172282&amp;pickerToList=brandtextbin&amp;ie=UTF8&amp;qid=1466668789">Top Brands</a>
                                    </span>

# A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

我正在尝试使用div id indexBarHeader下方列出的字母表定义的所有链接，并尝试打印所有列出的品牌

A & I Products
A & L Engraving
and so on..

有人可以帮助纠正我的剧本

如何使用scrapy从亚马逊中提取所有品牌的列表

0 个答案: