如何使用scrapy从亚马逊中提取所有品牌的列表

时间:2016-06-23 08:30:06

标签: scrapy scrapy-spider

我正试图在amazon.com上使用scrapy爬行,并尝试收集列出的所有数据品牌。 这是scrapy脚本::

class StayuncleCrawlerSpider(CrawlSpider):

    name = 'amazon_crawler'

    allowed_domains = ['amazon.com']
    start_urls = ['https://www.amazon.com/gp/search/other/ref=sr_in_a_V?rh=i%3Aelectronics%2Cn%3A172282&pickerToList=brandtextbin&indexField=a&ie=UTF8&qid=1466664617']
    CrawlSpider.DOWNLOAD_DELAY=2; 
    rules = [Rule(SgmlLinkExtractor(allow=("/gp/search/other/ref")), callback='parse_item', follow=True) ]

    def parse_item(self,response):
        global i
        body = response.xpath('//body//div[@id="center"]')
        texts = body.xpath('.//span').extract()
        print texts
        ptext ="/Users/Nand/crawledData/html/"+response.url.split("/")[-2] +str(i)+'.txt'
        for text in texts:
            if text:
                            text = text.rstrip()
                print text.encode('utf-8')
                            with open(ptext, 'ab') as f:
                             f.write(text.encode('utf-8'))
                             f.write("\n") 


    item = DmozItem()
    yield item

这是起始网址

 https://www.amazon.com/gp/search/other/ref=sr_in_a_V?rh=i%3Aelectronics%2Cn%3A172282&pickerToList=brandtextbin&indexField=a&ie=UTF8&qid=1466664617

这是我要抓取的HTML部分

 <div class="a-row a-spacing-none pagn">
<span class="pagnLead">Viewing:</span>
<span class="pagnLink"><a href="/gp/search/other/ref=sr_in_-2_A?rh=i%3Aelectronics%2Cn%3A172282&amp;pickerToList=brandtextbin&amp;ie=UTF8&amp;qid=1466668789">Top Brands</a>
                                    </span>

#                                         A B                                         C                                         D                                         E                                         F                                         G                                         H                                         I                                         J                                         K                                         L                                         M                                         N                                         O                                         P                                         Q                                         R                                         S                                         T                                         U                                         V                                         W                                         X                                         Y                                         Z                                         

我正在尝试使用div id indexBarHeader下方列出的字母表定义的所有链接,并尝试打印所有列出的品牌

A & I Products
A & L Engraving
and so on..

有人可以帮助纠正我的剧本

0 个答案:

没有答案