我想从http://www.bigcmobiles.in/categories/Mobile-Phones-Smart-Phones/cid-CU00091056.aspx检索手机费用的信息。我使用hxs.select('.//div[1]/div/div[1]/div/span/label[2]').extract()
,它给了我一个空字典。
你能解释一下我的理由吗?
答案 0 :(得分:1)
问题是此站点上的产品(移动设备)是通过XHR请求动态加载的。 你必须在scrapy中模拟它才能获得必要的数据。有关该主题的更多信息,请参阅:
这是您案例中的蜘蛛。请注意,我从chrome开发人员工具获得的网址是网络标签:
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class BigCMobilesItem(Item):
title = Field()
price = Field()
class BigCMobilesSpider(BaseSpider):
name = "bigcmobile_spider"
allowed_domains = ["bigcmobiles.in"]
start_urls = [
"http://www.bigcmobiles.in/Handler/ProductShowcaseHandler.ashx?ProductShowcaseInput={%22PgControlId%22:1152173,%22IsConfigured%22:true,%22ConfigurationType%22:%22%22,%22CombiIds%22:%22%22,%22PageNo%22:1,%22DivClientId%22:%22ctl00_ContentPlaceHolder1_ctl00_ctl07_Showcase%22,%22SortingValues%22:%22%22,%22ShowViewType%22:%22%22,%22PropertyBag%22:null,%22IsRefineExsists%22:true,%22CID%22:%22CU00091056%22,%22CT%22:0,%22TabId%22:0}&_=1369724967084"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
mobiles = hxs.select("//div[@class='bucket']")
print mobiles
for mobile in mobiles:
item = BigCMobilesItem()
item['title'] = mobile.select('.//h4[@class="mtb-title"]/text()').extract()[0]
try:
item['price'] = mobile.select('.//span[@class="mtb-price"]/label[@class="mtb-ofr"]/text()').extract()[
1].strip()
except:
item['price'] = 'n/a'
yield item
将其保存在spider.py
中,然后通过scrapy runspider spider.py -o output.json
运行。然后在output.json
中,您会看到:
{"price": "13,999", "title": "Samsung Galaxy S Advance i9070"}
{"price": "9,999", "title": "Micromax A110 Canvas 2"}
{"price": "25,990", "title": "LG Nexus 4 E960"}
{"price": "39,500", "title": "Samsung Galaxy S4 I9500 - Black"}
...
这些是第一页的产品。要从其他网页获取移动设备,请查看该网站正在使用的XHR请求 - 它具有PageNo
参数 - 看起来就像您需要的那样。
希望有所帮助。