第一次刮亚马逊交易可能是AJAX

时间:2014-11-29 19:05:33

标签: python ajax scrapy amazon

作为第一次scrapy用户,我希望能够在Amazon.com上抓取交易信息,更具体地说是在此页面:http://www.amazon.com/Cyber-Monday/b/ref=sv_gb_2?ie=UTF8&node=5550342011&gb_hero_f_100=p:1,c:all,s:missed

抱歉,我希望我能在这里发布一个屏幕截图,但我没有声誉。

我想在“即将到来的”和“错过的交易”下提取所有交易项目信息(标题,价格,每笔交易和其他交易的折扣,点击页面上的“下一步”按钮) “部分,我尝试scrapy只是使用我的代码如下,但它没有运气。我对潜在问题的思考是:

(1)我在“rules”或“parse_items”中定义了错误的xpath(这是可能的,但不太可能,因为我使用chrome开发人员复制了xpath)

(2)该站点在AJAX中运行,然后它将探测我使用Selenium作为其他线程的建议。

这是我的代码:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.selector import Selector, HtmlXPathSelector
from selenium import selenium
from deal.items import DealItem

class Dealspider(BaseSpider):
    name = 'deal'
    allowed_domains = ['amazon.com']
    start_urls = ['http://www.amazon.com/b/ref=br_imp_ara-1?_encoding=UTF8&node=5550342011&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-hero-2&pf_rd_r=16WPRNKJ91B97JW7TQ27&pf_rd_t=36701&pf_rd_p=1990071642&pf_rd_i=desktop']
    rules = (Rule(SgmlLinkExtractor(allow=('//td[@id="missed_filter"]'),restrict_xpaths=('//a[starts-with(@title,"Next ")]',)),callback='parse_items') , Rule(SgmlLinkExtractor(allow=('//td[@id="upcoming_filter"]'), restrict_xpaths=('//a[starts-with(@title,"Next ")]',)), callback='parse_items_2') )

def __init__(self):
    CrawlSpider.__init__(self)
    self.verificationErrors = []
    self.selenium = selenium("localhost", 4444, "*chrome", "http://www.amazon.com")
    self.selenium.start()

def __del__(self):
    self.selenium.stop()
    print self.verificationErrors
    CrawlSpider.__del__(self)

解析错过的交易

def parse_items(self, response):
    hxs = HtmlXPathSelector(response)
    pdt = hxs.select('//ul[@class="ulResized pealdshoveler"]')
    sel = self.selenium
    sel.open(response.url) #I don't know where the url is
    items = []
    for t in pdt:
        item = dealItem()
        item ["missedproduct"] = t.select('//li[contains(@id,"dealTitle")]/a/@title').extract()
        item ["price"] = t.select('//li[contains(@id,"dealDealPrice")]/b').extract()
        item ["percentoff"] = t.select('//li[contains(@id,"dealPercentOff")]/span').extract()
        items.append(item)
    return items

解析即将到来的交易#

def parse_items_2(self, response):
    hxs = HtmlXPathSelector(response)
    pdt = hxs.select('//ul[@class="ulResized pealdshoveler"]')
    itemscurrent = []
    for t in pdt:
        item = dealItem()
        item ["c_product"] = t.select('//li[contains(@id,"dealTitle")]/a/text()').extract()
        item ["c_price"] = t.select('//li[contains(@id,"dealDealPrice")]/b').extract()
        item ["c_percentoff"] = t.select('//li[contains(@id,"dealPercentOff")]/span').extract()
        items.append(item)
    return itemscurrent

此时此刻,scrapy一无所获,我只是迫切希望自己解决这个问题,我希望你们所有聪明的人能帮助我。

无论您有什么见解,请把它放在这里,我们将不胜感激! =)谢谢!

1 个答案:

答案 0 :(得分:0)

我向你证实Selenium是一种刮掉它的方法。

这是一个部分解决方案,你可以继续,找到交易并打印标题:

class AmazonSpider(CrawlSpider):
    name = "amazon"
    allowed_domains = ['amazon.com']
    start_urls = ['http://www.amazon.com/b/ref=br_imp_ara-1?_encoding=UTF8&node=5550342011&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-hero-2&pf_rd_r=16WPRNKJ91B97JW7TQ27&pf_rd_t=36701&pf_rd_p=1990071642&pf_rd_i=desktop']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        for element in self.driver.find_elements_by_css_selector('a.titleLink'):
            print element.text


        self.driver.close()

结果将是:

*精选孩之宝玩具低至5折

佳能PowerShot S110数码相机折扣超过45%

儿童数码相机低至4折

" Dragon Age Inquisition" *

我建议你阅读Selenium文档来模拟用户按下" next"链接 (http://selenium-python.readthedocs.org/en/latest/api.html#module-selenium.webdriver.common.action_chains