我正在通过搜索页面抓取网站,然后循环遍历其中的所有结果。但是它似乎只返回每页的第一个结果。我也不认为它会影响起始页面的结果。
其次,价格以某种Unicode(£符号)的形式返回 - 我怎样才能完全删除它而只是离开价格?
'regular_price': [u'\xa38.59'],
这是HTML: http://pastebin.com/F8Lud0hu
这是蜘蛛:
import scrapy
import random
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from cdl.items import candleItem
class cdlSpider(CrawlSpider):
name = "cdl"
allowed_domains = ["www.xxxx.co.uk"]
start_urls = ['https://www.xxxx.co.uk/advanced_search_result.php']
rules = [
Rule(LinkExtractor(
allow=['advanced_search_result\.php\?sort=2a&page=\d*']),
callback='parse_listings',
follow=True)
]
def parse_listings(self, response):
sel = Selector(response)
urls = sel.css('a.product_img')
for url in urls:
url = url.xpath('@href').extract()[0]
return scrapy.Request(url,callback=self.parse_item)
def parse_item(self, response):
candle = candleItem()
n = response.css('.prod_info_name h1')
candle['name'] = n.xpath('.//text()').extract()[0]
if response.css('.regular_price'):
candle['regular_price'] = response.css('.regular_price').xpath('.//text()').extract()
else:
candle['was_price'] = response.css('.was_price strong').xpath('.//text()').extract()
candle['now_price'] = response.css('.now_price strong').xpath('.//text()').extract()
candle['referrer'] = response.request.headers.get('Referer', None)
candle['url'] = response.request.url
yield candle
答案 0 :(得分:2)
是的,因为你的parse_listing
方法(你返回第一个网址而你应该放弃它),它只返回第一个结果。我会做类似的事情:
def parse_listings(self, response):
for url in response.css('a.product_img::attr(href)').extract():
yield Request(url, callback=self.parse_item)
在这种情况下,我甚至会做类似的事情:
class CdlspiderSpider(CrawlSpider):
name = 'cdlSpider'
allowed_domains = ['www.xxxx.co.uk']
start_urls = ['https://www.xxxx.co.uk/advanced_search_result.php']
rules = [
Rule(LinkExtractor(allow='advanced_search_result\.php\?sort=2a&page=\d*')),
Rule(LinkExtractor(restrict_css='a.product_img'), callback='parse_item')
]
def parse_item(self, response):
...
if response.css('.regular_price'):
candle['regular_price'] = response.css('.regular_price::text').re_first(r'\d+\.?\d*')
else:
candle['was_price'] = response.css('.was_price strong::text').re_first(r'\d+\.?\d*')
candle['now_price'] = response.css('.now_price strong::text').re_first(r'\d+\.?\d*')
...
return candle
答案 1 :(得分:1)
要删除£,只需将其替换为空字符串,如下所示:
pricewithpound = u'\xa38.59'
price = pricewithpound.replace(u'\xa3', '')
要调查scrapy问题,请提供HTML源代码?