我对Python和Scrapy比较陌生。我正在尝试废弃“购买此商品的客户也购买”中的链接。 例如:http://www.amazon.com/Confessions-Economic-Hit-John-Perkins-ebook/dp/B001AFF266/。共有17页“购买此商品的顾客也购买了”。如果我要scrapy废弃该网址,它只会废弃第一页(6项)。如何让scrapy按下“下一步按钮”来删除17页中的所有项目?我们非常感谢示例代码(只是crawler.py中重要的部分)。谢谢你的时间!
确定。这是我的代码。正如我所说,我是Python新手所以代码可能看起来很愚蠢,但它可以废弃第一页(6项)。我主要使用Fortran或Matlab。我想系统地学习Python如果我有时间的话。
# Code of my crawler.py:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from beta.items import BetaItem
class AlphaSpider(CrawlSpider):
name = 'alpha'
allowed_domains = ['amazon.com']
start_urls = ['http://www.amazon.com/s/ref=lp_4366_nr_p_n_publication_date_0?rh=n%3A283155%2Cn%3A%211000%2Cn%3A4366%2Cp_n_publication_date%3A1250226011&bbn=4366&ie=UTF8&qid=1384729756&rnid=1250225011']
rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//h3/a',)), callback='parse_item'), )
def parse_item(self, response):
sel = Selector(response)
stuff = BetaItem()
isbn10R = sel.xpath('//li[b[contains(text(),"ISBN-10:")]]/text()').extract()
isbn10 = []
if len(isbn10R) > 0:
isbn10 = [(isbn10R[0].split(' '))[1]]
stuff['isbn10'] = isbn10
starsR = sel.xpath('//div[contains(@id,"averageCustomerReviews")]/span/@title').extract()
stars = []
if len(starsR) > 0:
stars = [(starsR[0].split(' '))[0]]
stuff['stars'] = stars
reviewsR = sel.xpath('//div[contains(@id,"averageCustomerReviews")]/a[contains(@href,"showViewpoints=1")]/text()').extract()
reviews = []
if len(reviewsR) > 0:
reviews = [(reviewsR[0].split(' '))[0]]
stuff['reviews'] = reviews
copsR = sel.xpath('//a[@class="sim-img-title"]/@href').extract()
ncops = len(copsR)
cops = [None] * ncops
if ncops > 0:
for idx, cop in enumerate(copsR):
cops[idx]=((cop.split('dp/'))[1].split('/ref'))[0]
stuff['cops'] = cops
return stuff
答案 0 :(得分:2)
所以我知道你能够抓住这些“买了这个项目的顾客也买了”的产品细节。正如您可能看到的那样,它们位于ul
div
范围内,其中包含“shoveler-content”类:
<div id="purchaseButtonWrapper" class="shoveler-button-wrapper">
<a class="back-button" onclick="return false;" style="" href="#Back">
<div class="shoveler-content">
<ul tabindex="-1">
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">
<div id="purchase_B003LSTK8G" class="new-faceout p13nimp" data-ref="pd_sim_kstore_1" data-asin="B003LSTK8G">
...
</div>
</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
<li class="shoveler-cell" style="margin-left: 16px; margin-right: 16px;">...</li>
</ul>
</div>
<a class="next-button" onclick="return false;" style="" href="#Next">
<span class="auiTestSprite s_shvlNext">...</span>
</a>
</div>
</div>
当您检查所选择的浏览器的网络活动时(通过Firebug或Chrome Inspect工具),当您单击下一个建议产品的“下一步”按钮时,您将看到对此类URL的AJAX查询:
http://www.amazon.com
/gp/product/features/similarities/shoveler/cell-render.html/ref=pd_sim_kstore?
id=B00261OOWQ,B003XQEVUI,B001NLL5WC,B000FC1KZC,B005G5PPGS,B0043RSJB8,
B004TSBWYC,B000RH0C8G,B0035IID08,B002AQRVXQ,B005DIAUN6,B000FC10QG
&pos=7&refTag=pd_sim_kstore&wdg=ebooks_display_on_website
&shovelerName=purchase
(我正在使用此产品页面:http://www.amazon.com/Boomerang-Travels-New-Third-World-ebook/dp/B005CRQ2OE)
id
查询参数中的内容是ASIN列表,它是下一个建议的产品。显示6个ASINs?可能是用户可能会进行下一次“下一次”点击的页内缓存。
你从这个AJAX查询中得到什么?仍然在你的浏览器的检查工具中,你会看到响应的类型为application/json
,响应数据是一个JSON数组12个元素,每个元素都是一些HTML片段,类似于:
<div class="new-faceout p13nimp" id="purchase_B00261OOWQ" data-asin="B00261OOWQ" data-ref="pd_sim_kstore_7">
<a href="/Home-Game-Accidental-Guide-Fatherhood-ebook/dp/B00261OOWQ/ref=pd_sim_kstore_7" class="sim-img-title" >
<div class="product-image">
<img src="http://ecx.images-amazon.com/images/I/51ZBpvGgsUL._SL500_PIsitb-sticker-arrow-big,TopRight,35,-73_OU01_SS100_.jpg" width="100" alt="" height="100" border="0" />
</div> Home Game: An Accidental Guide to Fatherhood
</a>
<div class="byline">
<span class="carat">›</span>
<a href="http://www.amazon.com/Michael-Lewis/e/B000APZ33E/ref=pd_sim_kstore_bl_7">Michael Lewis</a>
</div>
<div class="rating-price">
<span class="rating-stars">
<span class="crAvgStars" style="white-space:no-wrap;">
<span class="asinReviewsSummary" name="B00261OOWQ">
<a href="http://www.amazon.com/Home-Game-Accidental-Guide-Fatherhood-ebook/product-reviews/B00261OOWQ/ref=pd_sim_kstore_cm_cr_acr_img_7">
<span class="auiTestSprite s_star_4_0 " title="4.1 out of 5 stars" >
<span>4.1 out of 5 stars</span>
</span>
</a>
</span>
(<a href="http://www.amazon.com/Home-Game-Accidental-Guide-Fatherhood-ebook/product-reviews/B00261OOWQ/ref=pd_sim_kstore_cm_cr_acr_txt_7">99</a>)
</span>
</span>
</div>
<div class="binding-platform"> Kindle Edition </div>
<div class="pricetext"><span class="price" style="margin-right:5px">$11.36</span></div>
</div>
因此,您基本上可以获得之前建议产品的内容,<li>
来自<div class="shoveler-content"><ul>
但是如何让这些ASIN代码附加到AJAX查询的id
参数?
好吧,在产品页面中,您会注意到这一部分
<div id="purchaseSimsData"
class="sims-data" style="display:none"
data-baseAsin="B005CRQ2OE" data-featureId="pd_sim"
data-pageId="B005CRQ2OEr_sim_2" data-reftag="pd_sim_kstore"
data-wdg="ebooks_display_on_website" data-widgetName="purchase">
B003LSTK8G,B000VKVZR6,B003E20ZRY,B000RH0C9A,B000RH0CA4,B000YMDQRS,
B00261OOWQ,B003XQEVUI,B001NLL5WC,B000FC1KZC,B005G5PPGS,B0043RSJB8,
B004TSBWYC,B000RH0C8G,B0035IID08,B002AQRVXQ,B005DIAUN6,B000FC10QG,
B0018QQQKS,B002OTKEP6,B005PUWUKS,B007V65R54,B00B3VOTTI,B004EYT932,
B002UBRFFU,B000WJSB50,B000RH0DYE,B004JXXKWY,B003E8AJXI,B008TRU7PE,
B00555X8OA,B007OSIOWM,B00DLJIA54,B00139XTG4,B0058Z4NR8,B00ALBR6JG,
B004H0M8QS,B003F3PL7Q,B008UX8YPC,B000U913GG,B003HOXLVQ,B000VWM0MI,
B000SEIU28,B006VE7YS0,B008KPMBIG,B003CIQ57E,B0064EHZY0,B008UX3ITE,
B001NLKY38,B003VIWK4C,B005GSYZRA,B007YGGOVM,B004H4X84K,B00B5ZQ72Y,
B000R1BAH4,B008W02TIG,B000W8HC8I,B0036QVOKU,B000VRBBDC,B00APDGFOC,
B00EOAS0EK,B000QCS888,B001QIGZEK,B0074B55IK,B000FC12C8,B00AP2XVJ0,
B000FCK5YE,B006ID6UAW,B001FA0W5W,B005HFI0X2,B006ZOYM9K,B003SNJZ3Y,
B00C1N5WOI,B008EKORIY,B00C4GRK4W,B004V3WRNU,B00BV6RTUG,B001AFF266,
B00DUM1W3E,B00APDGGCS,B008WOUFIS,B008EKOO46,B008JHXO6S,B005AJM3U6,
B00BKRW6GI,B00CDUVSQ0,B00A287PG2,B009H679WA,B000VDUWMC,B009NF6IRW
</div>
看起来像所有推荐的产品ASIN。
因此,我建议您模拟连续的AJAX查询以获取建议的产品,一次12个ASIN,使用json
包解码响应,然后解析每个HTML片段以提取您想要的产品信息。
答案 1 :(得分:0)
我建议你尽量避免scrapy,因为你是初学者。 使用awesome Requests模块下载页面 https://github.com/kennethreitz/requests
和BeautifulSoup用于解析网页。 http://www.crummy.com/software/BeautifulSoup/