我想在scrapy中获取价格和卖家名称,但无法在正确的xpath中解析它们以便迭代它们。如何获得正确的xpath以便我可以检索卖家和所有价格?
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from scrapy.contrib.linkextractors import LinkExtractor
class mspItem(scrapy.Item):
model_name = scrapy.Field()
price = scrapy.Field()
seller = scrapy.Field()
class criticspider(CrawlSpider):
name = "msp_specs"
allowed_domains = ["mysmartprice.com/"]
#### Give array of URLS here, it will generate specs.json, run clean.py on it, mentioning words to include and remove ####
start_urls = ["http://www.mysmartprice.com/mobile/microsoft-lumia-535-msp5042"]
def parse(self, response):
sites = response.xpath('//div[@id="pricetable"]//div[@class="store_pricetable"]')
items = []
item = mspItem()
item['model_name'] = response.xpath('//h2[contains(@class,"priceindia")]/text()').extract()
for site in sites:
#item["seller"] = site.xpath("/@data-storename").extract()[0]
item['price'] = site.xpath('//div[store_price_out]/text()').extract()
items.append(item)
return items
更新代码 -
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin
from scrapy.contrib.linkextractors import LinkExtractor
class mspItem(scrapy.Item):
model_name = scrapy.Field()
price = scrapy.Field()
seller = scrapy.Field()
class criticspider(CrawlSpider):
name = "msp_specs"
allowed_domains = ["mysmartprice.com/"]
#### Give array of URLS here, it will generate specs.json, run clean.py on it, mentioning words to include and remove ####
start_urls = ["http://www.mysmartprice.com/mobile/microsoft-lumia-535-msp5042"]
def parse(self, response):
sites = response.xpath('//div[contains(@class,"store_pricetable")]')
items = []
for site in sites:
item = mspItem()
item['model_name'] = response.xpath('//h2[contains(@class,"priceindia")]/text()').extract()
item['price'] = site.xpath('.//div[@class="store_price"]/text()').extract()
items.append(item)
return items
答案 0 :(得分:0)
我猜你sites
的第一个xpath是错误的,根据网站的来源,这是错误的,因为div
的类属性为{{1} } {}不是'store_pricetable'
的小孩div
。
此外,还有一些div类为 - 'pricetable'
。
因此,您可以在此处使用'store_pricetable featured_seller'
查看获取contains()
的所有div。
'store_pricetable'
的xpath也是错误的,就像你做的那样 - price
,它不会检查你的意图。
您应该 - //div[store_price_out]
- 与开始时的item['price'] = site.xpath('.//div[@class="store_price"]/text()').extract()
一起使此xpath相对于当前元素。
此外,您应该为每个循环重新创建项目,不要反复使用相同的项目对象,它会覆盖前一个项目对象。
示例 -
dot