作为第一次scrapy用户,我希望能够在Amazon.com上抓取交易信息,更具体地说是在此页面:http://www.amazon.com/Cyber-Monday/b/ref=sv_gb_2?ie=UTF8&node=5550342011&gb_hero_f_100=p:1,c:all,s:missed。
抱歉,我希望我能在这里发布一个屏幕截图,但我没有声誉。
我想在“即将到来的”和“错过的交易”下提取所有交易项目信息(标题,价格,每笔交易和其他交易的折扣,点击页面上的“下一步”按钮) “部分,我尝试scrapy只是使用我的代码如下,但它没有运气。我对潜在问题的思考是:
(1)我在“rules”或“parse_items”中定义了错误的xpath(这是可能的,但不太可能,因为我使用chrome开发人员复制了xpath)
(2)该站点在AJAX中运行,然后它将探测我使用Selenium作为其他线程的建议。
这是我的代码:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.selector import Selector, HtmlXPathSelector
from selenium import selenium
from deal.items import DealItem
class Dealspider(BaseSpider):
name = 'deal'
allowed_domains = ['amazon.com']
start_urls = ['http://www.amazon.com/b/ref=br_imp_ara-1?_encoding=UTF8&node=5550342011&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-hero-2&pf_rd_r=16WPRNKJ91B97JW7TQ27&pf_rd_t=36701&pf_rd_p=1990071642&pf_rd_i=desktop']
rules = (Rule(SgmlLinkExtractor(allow=('//td[@id="missed_filter"]'),restrict_xpaths=('//a[starts-with(@title,"Next ")]',)),callback='parse_items') , Rule(SgmlLinkExtractor(allow=('//td[@id="upcoming_filter"]'), restrict_xpaths=('//a[starts-with(@title,"Next ")]',)), callback='parse_items_2') )
def __init__(self):
CrawlSpider.__init__(self)
self.verificationErrors = []
self.selenium = selenium("localhost", 4444, "*chrome", "http://www.amazon.com")
self.selenium.start()
def __del__(self):
self.selenium.stop()
print self.verificationErrors
CrawlSpider.__del__(self)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
pdt = hxs.select('//ul[@class="ulResized pealdshoveler"]')
sel = self.selenium
sel.open(response.url) #I don't know where the url is
items = []
for t in pdt:
item = dealItem()
item ["missedproduct"] = t.select('//li[contains(@id,"dealTitle")]/a/@title').extract()
item ["price"] = t.select('//li[contains(@id,"dealDealPrice")]/b').extract()
item ["percentoff"] = t.select('//li[contains(@id,"dealPercentOff")]/span').extract()
items.append(item)
return items
def parse_items_2(self, response):
hxs = HtmlXPathSelector(response)
pdt = hxs.select('//ul[@class="ulResized pealdshoveler"]')
itemscurrent = []
for t in pdt:
item = dealItem()
item ["c_product"] = t.select('//li[contains(@id,"dealTitle")]/a/text()').extract()
item ["c_price"] = t.select('//li[contains(@id,"dealDealPrice")]/b').extract()
item ["c_percentoff"] = t.select('//li[contains(@id,"dealPercentOff")]/span').extract()
items.append(item)
return itemscurrent
此时此刻,scrapy一无所获,我只是迫切希望自己解决这个问题,我希望你们所有聪明的人能帮助我。
无论您有什么见解,请把它放在这里,我们将不胜感激! =)谢谢!
答案 0 :(得分:0)
我向你证实Selenium是一种刮掉它的方法。
这是一个部分解决方案,你可以继续,找到交易并打印标题:
class AmazonSpider(CrawlSpider):
name = "amazon"
allowed_domains = ['amazon.com']
start_urls = ['http://www.amazon.com/b/ref=br_imp_ara-1?_encoding=UTF8&node=5550342011&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=desktop-hero-2&pf_rd_r=16WPRNKJ91B97JW7TQ27&pf_rd_t=36701&pf_rd_p=1990071642&pf_rd_i=desktop']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
for element in self.driver.find_elements_by_css_selector('a.titleLink'):
print element.text
self.driver.close()
结果将是:
*精选孩之宝玩具低至5折
佳能PowerShot S110数码相机折扣超过45%
儿童数码相机低至4折
" Dragon Age Inquisition" *
我建议你阅读Selenium文档来模拟用户按下" next"链接 (http://selenium-python.readthedocs.org/en/latest/api.html#module-selenium.webdriver.common.action_chains)