Scrapy蜘蛛:从img src下载所有图像

时间:2021-02-01 14:56:30

标签: python scrapy web-crawler

我从一个网站上抓取了一些链接,我正在使用 scrapy spider 进行抓取。

 # image urls
        look_inside_image_urls = response.xpath('//ul[@class="list-unstyled pages"]/li').extract_first()

        for i in look_inside_image_urls:
            print("============> look_inside_image_urls ===============>", i)

但是我没有类型值。只是我是 li 的任意数量的图像链接。我通过循环下载。

这是我的 HTML 代码

<div class="lookInsideDiv" style="display: block;">
                <div class="exitBtn"><i class="ion-close-round"></i></div>
                <div class="pagesArea">
                    <ul class="list-unstyled pages">
                        
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/fc955fd4b_117698-1.jpg"></li>
                        
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/11f94595e_117698-2.jpg"></li>
                        
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/555959ec2_117698-3.jpg"></li>
                        
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/81b071d0c_117698-4.jpg"></li>
                        
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/30ef8b806_117698-5.jpg"></li>
                        
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/6cb40391f_117698-6.jpg"></li>
                        
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/a41c97880_117698-7.jpg"></li>
                        
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/d1a4bff6e_117698-8.jpg"></li>
                        
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/9503cfda1_117698-9.jpg"></li>
                        
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/54f1774ee_117698-10.jpg"></li>
                        
                    </ul>
                </div>
            </div>

我只想像这样获取 li 中的所有链接

https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/fc955fd4b_117698-1.jpg
https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/fc955fd4b_117698-1.jpg
https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/fc955fd4b_117698-1.jpg
https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/fc955fd4b_117698-1.jpg

1 个答案:

答案 0 :(得分:1)

试试这个,使用extract()(它的返回列表)而不是extract_first()(返回第一项)方法来提取所有图像。

look_inside_image_urls = response.xpath('//ul[@class="list-unstyled pages"]/li/img/@src').extract()

for i in look_inside_image_urls:
    print("============> look_inside_image_urls ===============>", i)

编辑

from scrapy.selector import Selector

html ="""<div class="lookInsideDiv" style="display: block;">
                <div class="exitBtn"><i class="ion-close-round"></i></div>
                <div class="pagesArea">
                    <ul class="list-unstyled pages">
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/fc955fd4b_117698-1.jpg"></li>
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/11f94595e_117698-2.jpg"></li>
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/555959ec2_117698-3.jpg"></li>
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/81b071d0c_117698-4.jpg"></li>
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/30ef8b806_117698-5.jpg"></li>
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/6cb40391f_117698-6.jpg"></li>
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/a41c97880_117698-7.jpg"></li>
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/d1a4bff6e_117698-8.jpg"></li>
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/9503cfda1_117698-9.jpg"></li>
                            <li><img src="https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/54f1774ee_117698-10.jpg"></li>
                    </ul>
                </div>
            </div>"""


data = Selector(text=html)
look_inside_image_urls = data.xpath('//*/ul[@class="list-unstyled pages"]/li/img/@src').extract()
for i in look_inside_image_urls:
    print("============> look_inside_image_urls ===============>", i)


============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/fc955fd4b_117698-1.jpg
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/11f94595e_117698-2.jpg
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/555959ec2_117698-3.jpg
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/81b071d0c_117698-4.jpg
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/30ef8b806_117698-5.jpg
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/6cb40391f_117698-6.jpg
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/a41c97880_117698-7.jpg
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/d1a4bff6e_117698-8.jpg
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/9503cfda1_117698-9.jpg
============> look_inside_image_urls ===============> https://s3-ap-southeast-1.amazonaws.com/rokomari110/LookInside20190827/54f1774ee_117698-10.jpg