Question

我尝试使用Scrapy 1.4.0进行网络浏览https://celulares.mercadolibre.com.ar/。我想要获得的是一个列表，其中包含产品描述以及该产品的img src。问题在于，当我运行我的蜘蛛时，它恰好返回前4个项目（描述+对应的img src），其余的项目列表只是描述了＆＃34; none＆＃34; img src。通过分析网页源代码，我可以看出前5个项目与其余项目之间的唯一区别是它所称的第一个类的属性＆＃34; lazy-load＆＃34;而其他人有一个特殊的ID，如＃34; ML2178321＆＃34;。但考虑到我没有在蜘蛛代码中引用类名，我不明白为什么行为会在最后一项中发生变化。我怀疑一些我不知道的JQuery / JS事情。这是第一个项目容器之一的代码：

＆＃13;

<div class="image-content">

 <a href="https://articulo.mercadolibre.com.ar/MLA-644049024-samsung-galaxy-j7-prime-lector-de-huella16gb3gb-ram-_JM" class="figure item-image item__js-link"> 
 
 <img alt="Samsung Galaxy J7 Prime Lector De Huella+16gb+3gb Ram" src="https://http2.mlstatic.com/samsung-celulares-smartphones-D_Q_NP_771296-MLA25977210113_092017-X.jpg" class="lazy-load" srcset="https://http2.mlstatic.com/samsung-celulares-smartphones-D_Q_NP_771296-MLA25977210113_092017-X.jpg 1x, https://http2.mlstatic.com/samsung-celulares-smartphones-D_NQ_NP_771296-MLA25977210113_092017-V.jpg 2x" width="160" height="160"> 
 
 </a> 

</div>

＆＃13;

这里的容器代码来自后面的一个图像（返回＆＃34;无＆＃34; img src）：

＆＃13;

 <div class="image-content">
 
 <a href="https://articulo.mercadolibre.com.ar/MLA-643729195-motorola-moto-g4-4ta-gen-4g-lte-16gb-ram-2gb-libre-gtia-_JM" class="figure item-image item__js-link"> 
 
 <img alt="Motorola Moto G4 4ta Gen 4g Lte 16gb Ram 2gb Libre Gtia" id="MLA643729195-I" srcset="https://http2.mlstatic.com/motorola-celulares-smartphones-D_Q_NP_765168-MLA26028117832_092017-X.jpg 1x, https://http2.mlstatic.com/motorola-celulares-smartphones-D_NQ_NP_765168-MLA26028117832_092017-V.jpg 2x" src="https://http2.mlstatic.com/motorola-celulares-smartphones-D_Q_NP_765168-MLA26028117832_092017-X.jpg" width="160" height="160"> 
 
 </a> 
 
 </div>

＆＃13;

＆＃13; 最后，这是我正在运行

的代码

import scrapy
import time

class MlarSpider(scrapy.Spider):
name = "mlar"
allowed_domains = ["mercadolibre.com.ar"]
start_urls = ['https://celulares.mercadolibre.com.ar/']

def parse(self, response):
    SET_SELECTOR = '.results-item'
    for item in response.css(SET_SELECTOR):

        PRODUCTO_SELECTOR = '.item__info-title span ::text'
        IMAGEN_SELECTOR = '.image-content a img'

        yield {
            'producto': item.css(PRODUCTO_SELECTOR).extract_first(),
            'imagen': item.css(IMAGEN_SELECTOR).xpath("@src").extract_first(),
        }

    NEXT_PAGE_SELECTOR = '.pagination__next a::attr(href)'
    next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
    if next_page:
        yield scrapy.Request(
            response.urljoin(next_page),
            callback=self.parse
        )

我已经实施了Barmar评论，让它像魅力一样运作。刚刚将这些行添加到我的蜘蛛中：

        IXPATH= '@src'
        if item.css(IMAGEN_SELECTOR).xpath(IXPATH).extract_first() is None:
            IXPATH = '@data-src'
        yield {
            'producto': item.css(PRODUCTO_SELECTOR).extract_first(),
            'imagen': item.css(IMAGEN_SELECTOR).xpath(IXPATH).extract_first(),
        }

Answer 1

以后的图片中没有src属性。这是该图片的代码：

<img width='160' height='160' alt='Motorola Moto G4 4ta Gen 4g Lte 16gb Ram 2gb Libre Gtia' id='MLA643729195-I' class='loading' title='https://http2.mlstatic.com/motorola-celulares-smartphones-D_Q_NP_765168-MLA26028117832_092017-X.webp' data-src='https://http2.mlstatic.com/motorola-celulares-smartphones-D_Q_NP_765168-MLA26028117832_092017-X.webp' data-srcset='https://http2.mlstatic.com/motorola-celulares-smartphones-D_Q_NP_765168-MLA26028117832_092017-X.webp 1x, https://http2.mlstatic.com/motorola-celulares-smartphones-D_NQ_NP_765168-MLA26028117832_092017-V.webp 2x' />

图片网址位于data-src属性中，而不是src。

该网站正在使用延迟加载插件，等待用户在设置src之前将图像滚动到视口中。那时它会将data-src属性复制到src。您发布的内容显然是发生后的元素，而不是scrapy看到的原始HTML源代码。

如果无法找到data-src属性，您只需更改脚本即可查找src属性。

使用Scrapy获取img src会得到奇怪的结果，为什么？

1 个答案: