我正在尝试将整数与某些HTML隔离开,例如“
5,500英里
“。
import scrapy
class AlfaShortSpider(scrapy.Spider):
name = 'alfashort'
def start_requests(self):
yield scrapy.Request(url = 'https://www.pistonheads.com/classifieds/used-cars/alfa-romeo/giulia',
callback = self.parse_data)
def parse_data( self, response ):
advert = response.xpath( '//*[@class="ad-listing"]')
title = advert.xpath( './/*[@class="listing-headline"]//h3/text()' ).extract()
price = advert.xpath( './/*[@class="price"]/text()' ).extract()
mileage = advert.xpath( './/*[@class="specs"]//li[1]/text()' ).extract()
mileage = [item.strip() for item in mileage]
mileage = [item.replace(',','') for item in mileage]
mileage = [item.replace(' miles','') for item in mileage]
for item in zip(title,price,mileage):
price_data = {
'title' : item[0],
'price' : item[1],
'mileage' : item[2]
}
yield price_data
我的代码成功删除了逗号和“英里”,但是在我的CSV输出中,我认为此列中出现不需要的空白行,我认为这是由于原始源中的回车所致。我的CSV如下所示:
因此title和price列很好。但是,“里程”列是错误所在。
我的Strip命令有问题吗?
答案 0 :(得分:1)
只需更改里程的XPath
来自
mileage = advert.xpath( './/*[@class="specs"]//li[1]/text()' ).extract()
到
mileage = advert.xpath( './/*[@class="specs"]//li[1]/text()[2]' ).extract()
您将获得正确的输出输出:
title,price,mileage
ALFA ROMEO GIULIA (0) V6 BITURBO QUADRIFOGLIO 2018 (2018),"£48,500",5500
ULEZ CHARGE EXEMPT! EURO 6 (2017),"£25,695",11450
ALFA ROMEO GIULIA (0) V6 BITURBO QUADRIFOGLIO NRING 2019 (2019),"£83,500",100
ALFA ROMEO GIULIA (0) TD SPECIALE 2017 (2017),"£22,500",23700