我正在尝试为诸如https://www.pistonheads.com/classifieds?Category=used-cars&M=1044&ResultsPerPage=750之类的页面上的每个广告解析各种数据项。
我的代码正确捕获了大多数项目。但是,我遇到了两个问题:
Year
列中的输出对于每一行都是相同的。尽管xpath
与title
列中使用的列完全相同,但仍然可以正常工作。Transmission
的值,该值是不正确的,因为并非所有广告都填充了此变量。 对我的代码的一般评论也表示赞赏。也许我应该为此使用ItemLoaders
? (我还没学会它们如何工作)。
import scrapy
from datetime import date
class SuperScraper(scrapy.Spider):
name = 'ss22'
def start_requests(self):
urls = 'https://www.pistonheads.com/classifieds?Category=used-cars&M=1044&ResultsPerPage=750'
yield scrapy.Request(urls, callback = self.parse_data)
def parse_data( self, response ):
advert = response.xpath( '//*[@class="ad-listing"]')
title = advert.xpath( './/*[@class="listing-headline"]//h3/text()' ).extract()
year = advert.xpath( './/*[@class="listing-headline"]//h3/text()' ).extract()
price = advert.xpath( './/*[@class="price"]/text()' ).extract()
mileage = advert.xpath( './/*[contains(@class, "flaticon solid gauge-1")]/following-sibling::text()' ).extract()
mileage = [item.strip() for item in mileage]
mileage = [item.replace(',','') for item in mileage]
mileage = [item.replace(' miles','') for item in mileage]
timestamp = str(date.today()).split('.')[0]
timestamps = [timestamp for i in range(len(title))]
model = response.xpath('//head/title/text()').extract()
model = [item.replace("Used ","") for item in model]
model = [item.replace(" cars for sale with PistonHeads","") for item in model]
models = [model for i in range(len(title))]
transmission = advert.xpath('.//*[contains(@class, "flaticon solid location-pin-4")]/following-sibling::text()').extract()
transmission = [item.strip() for item in transmission]
link = advert.xpath( './/*[@class="listing-headline"]/a/@href' ).extract()
link = ['https:\\www.pistonheads.com' + i for i in link]
for item in zip(timestamps,link,models,title,year,price,mileage,transmission):
price_data = {
'timestamp' : item[0],
'link' :item[1],
'model' : item[2],
'title' : item[3],
'year' : year[4],
'price' : item[5],
'mileage' : item[6],
'transmission' :item[7]
}
yield price_data
答案 0 :(得分:2)
您有'year' : year[4],
,是的,它将始终为您提供相同的值。
由于您有70个传输和73个项目,因此zip以错误的方式将传输合并到项目中。因此,我为您提供了这种方法:
class SuperScraper(scrapy.Spider):
name = 'ss22'
def start_requests(self):
urls = 'https://www.pistonheads.com/classifieds?Category=used-cars&M=1044&ResultsPerPage=750'
yield scrapy.Request(urls, self.parse_data)
def parse_data( self, response ):
model = response.xpath('//head/title/text()').get('')
model = model.replace("Used ", "").replace(" cars for sale with PistonHeads", "")
for row in response.xpath('//*[@class="ad-listing"]'):
transmisson = row.xpath('.//*[contains(@class, "flaticon solid location-pin-4")]/following-sibling::text()').get('')
mileage = row.xpath('.//*[contains(@class, "flaticon solid gauge-1")]/following-sibling::text()').get('')
price_data = {
'timestamp': str(date.today()).split('.')[0],
'link': 'https://www.pistonheads.com' + row.xpath('.//*[@class="listing-headline"]/a/@href').get(''),
'model': model,
'title': row.xpath('.//*[@class="listing-headline"]//h3/text()').get('').strip(),
'year': row.xpath('.//*[@class="listing-headline"]//h3/text()').get(''),
'price': row.xpath('.//*[@class="price"]/text()').get('').strip(),
'mileage': mileage.replace(',', '').replace(' miles', '').strip(),
'transmission': transmisson.strip(),
}
yield price_data
在这里我们按项目进行迭代,因此我们永远不会错过是否为该项目出现传输。