xpath有一个空值,这会搞乱列表

时间:2018-05-02 23:39:40

标签: python web-scraping scrapy

我正在使用以下代码从网页上抓取汽车的名称,地址和数量。

但是,每隔一段时间,汽车数量就会有一个空值。让我们假设在下面的例子中,第8个经销商返回了无效的汽车数量,因此返回的列表如下:

names = a,b,c,d,e,f,g,h,i,j

地址= aa,bb,cc,dd,ee,ff,gg,hh,ii,jj

cars = 1,2,3,4,5,6,7,9,10

地址aa的经销商a有1辆车,地址bb的经销商b有2辆车等但是因为经销商h在地址hh有空值的汽车被跳过而代码认为经销商h有9辆车等等经销商我和地址ii有10辆汽车,因此在地址jj的经销商j被错过,因为汽车列表已用完。

因此,如果代码返回汽车的空值,我该如何用0替换它?因此,在上面的例子中,经销商h和地址hh将有0辆汽车,因此地址ii的经销商i有9,地址jj的经销商j有10辆汽车

import scrapy

from autotrader.items import AutotraderItem

class AutotraderSpider(scrapy.Spider):
    name = "autotrader"
    allowed_domains = ["autotrader.co.uk"]

    start_urls = ["https://www.autotrader.co.uk/car-dealers/search?advertising-location=at_cars&postcode=m43aq&radius=1500&forSale=on&toOrder=on&sort=with-retailer-reviews&page=822"]

    def parse(self, response):
        for sel in response.xpath('//ul[@class="dealerList__container"]'):
            names = sel.xpath('.//*[@itemprop="legalName"]/text() ').extract()
            names = [name.strip() for name in names]
            addresses = sel.xpath('.//li/article/a/div/p[@itemprop="address"]/text()').extract()
            addresses = [address.strip() for address in addresses]
            carss = sel.xpath('.//li/article/a/div/p[@class="dealerList__itemCount"]/span/text()').extract() 
            carss = [cars.strip() for cars in carss]
            result = zip(names, addresses, carss)
            for name, address, cars in result:
                item = AutotraderItem()
                item['name'] = name
                item['address'] = address
                item['cars'] = cars
                yield item

2 个答案:

答案 0 :(得分:1)

你的选择器循环有点令人困惑。

在这里,您可以遍历未排序的列表,其中每个年龄段只有一个:

for sel in response.xpath('//ul[@class="dealerList__container"]'):

你想要的是遍历所有列表项:

for sel in response.xpath('//li[@class="dealerList__itemContainer"]'):

如果以这种方式循环,您可以获取每个列表项的名称,地址:

for sel in response.xpath('//li[@class="dealerList__itemContainer"]'):
    names = sel.xpath('.//*[@itemprop="legalName"]/text() ').extract()
    names = [name.strip() for name in names]
    addresses = sel.xpath('.//article/a/div/p[@itemprop="address"]/text()').extract()
    addresses = [address.strip() for address in addresses]
    carss = sel.xpath('.//article/a/div/p[@class="dealerList__itemCount"]/span/text()').extract() 
    carss = [cars.strip() for cars in carss]
    item = AutotraderItem()
    item['name'] = name
    item['address'] = address
    item['cars'] = cars
    yield item

答案 1 :(得分:0)

尝试此操作以获得结果。您可以按照下面显示的方式在scrapy项目中使用xpaths

class AutotraderSpider(scrapy.Spider):
    name = "autotrader"
    allowed_domains = ["autotrader.co.uk"]

    start_urls = ["https://www.autotrader.co.uk/car-dealers/search?advertising-location=at_cars&postcode=m43aq&radius=1500&forSale=on&toOrder=on&sort=with-retailer-reviews&page=822"]

    def parse(self, response):
        for items in response.xpath("//article[@class='dealerList__item']"):
            name = items.xpath(".//span[@itemprop='legalName']/text()").extract_first()
            address = ' '.join([' '.join(item.split()) for item in items.xpath(".//p[@class='dealerList__itemAddress']/text()").extract()])
            cars = items.xpath(".//span[@class='dealerList__itemCountNumber']/text()").extract_first()
            yield {"Name":name,"Address":address,"Cars":cars}

部分输出:

Midland Motors Leicester Street, Burton-On-Trent, Staffordshire DE14 3BA 2
Ns Cars 69 Eldon Street, Burton-On-Trent, Staffordshire DE15 0LT 1
RS Sales Nottingham Ltd Unit 1 TRINITY PARK, RANDALL PARK WAY, Retford, Nottinghamshire DN22 7WF 1
Adc Ltd Unit 3 HUCKNALL LANE, Nottingham, Nottinghamshire NG6 8AJ 5