如何使用xpath和scrapy

时间:2019-07-04 08:56:12

标签: xpath web-scraping scrapy web-crawler

我正在尝试从https://www.rawson.co.za/property/for-sale/cape-town提取图像的所有URL的列表。 但是,所有图像都可以在不同的页面上找到,而不是在主页面上。 我一直在使用xpath检索其他所需字段。

我不太确定如何从那些子页面中检索列表中的所有URL。这是我尝试过的:


    class PropDataSpider(scrapy.Spider):
        name = "rawson"
        start_urls = ['https://www.rawson.co.za/property/for-sale/cape-town']


        def parse(self, response):
            propertes = response.xpath("//div[@class='card__main']")
            for prop in propertes:
                title = prop.xpath(
                    "./div[@class='card__body']/h3[@class='card__title']/a/text()").extract_first()
                price = prop.xpath(
                    "./div[@class='card__body']/div[@class='card__footer card__footer--primary']/div[@class='card__price']/text()").extract_first()
                description = prop.xpath(
                    "./div[@class='card__body']/div[@class='card__synopsis']/p/text()").extract_first()
                bedrooms = prop.xpath(
                    "./div[@class='card__body']/div[@class='card__footer card__footer--primary']/div[@class='features features--inline']/ol[@class ='features__list']/li[@class ='features__item'][1]/div[@class='features__label']/text()").extract_first()

    ...



                images = ['https://' + img for img in prop.xpath(
                    "main[@class='l-main']/section[@class='l-section']/div[@class='l-wrapper']/div[@class='l-section__main']/div[@class ='content-block content-block--flat']/div[@class ='gallery gallery--flat js-lightbox']/div[@ class ='row row--flat']/div[@class ='col']/a[@class ='gallery__link js-lightbox-image']/img/@src")]

                yield {'title': title, 'price':price, "description": description, 'bedrooms': bedrooms, 'bathrooms': bathrooms, 'garages': garages, 'images':images}

但是此代码确实检索了图像的“无”,这是有道理的,但是我不确定该如何处理。如果有人提出建议,将不胜感激。预先谢谢你!

1 个答案:

答案 0 :(得分:0)

您需要使用new Comparator<Point>(){ public int compare(Point p1,Point p2){ ... } }

new List<X>();