Question

因此，看起来Scrapy以随机顺序下载图像，我一直试图找到一种方法，可以通过以下两种方式之一对图像进行排序：

按网址划分的顺序下载
排序文件（可能通过使用元数据？）按照它们的顺序进行网址列表

我想以最有效的方式做到这一点，但是现在我无法弄清楚如何使用这两种方法来做到这一点。我调查了Scheduler，但我认为没有任何选择可以改变这一点。

Answer 1

根据页面可能有效或无效的解决方案是您使用lxml解析HTML并为图像构建自己的树结构。您遍历HTML树并查找图像的级别并从中构建自己的树。假装你有这个页面：

 |x|  |x|  |x|
 |x|  |x|  |x|
 |x|  |x|  |x|

其中每个x对应一个图像。解析后的HTML文档的结构可能类似于

<HTML>
     <Table>     
           <Column 1>
               Pic 1
               Pic 2
               Pic 3

           <Column 2>
               Pic 1
               Pic 2
               Pic 3

           <Column 3>
               Pic 1
               Pic 2
               Pic 3
    </Table>
</HTML>

如果您浏览lxml创建的树并为图像及其父项指定深度，您可以创建此结构，告诉您图像的顺序：

Depth 1       Column 1          Column 2            Column 3
Depth 2 Pic       1                  1                   1               
Depth 3 Pic       2                  2                   2        
Depth 4 Pic       3                  3                   3

这只是一个想法，可能不适用于那些有序和/或格式不正确的网页。

我也遇到过这个问题。一个快速的解决方法是在链接排队等待删除之后（基本上在调用main函数或返回更深层的请求时），您将链接写入文件，以便它按顺序排列到您的顺序刮。

抱歉，我现在在家，所以我无法使用代码访问机器。所以你有一个解析的函数。我假设你关注链接。我会写一些伪代码

def parse(self,response):
    currentlink = response.url
    uniqueid = (a sequential number) #callerid refers to starting link

    with open("mylog.txt","a") as f:
        f.write(currentlink+"\t"+str(uniqueid)
    (whatever your logic for your start link)
    (logic for following links, something something callback="otherfn")
    (add uniqueid to your request.meta)
    return request


def otherfn(self,response):
     take current link, take the unique id you created in parse
     with open("mylog.txt","a") as f:
         f.write(picturelink+"\t"+str(uniqueid))

这是一个粗略的轮廓，但有很多变化。我不知道这是否是最佳解决方案，但它并没有真正占用任何运行时间，并且假设您没有经过大量图像/链接，也不会占用太多空间。 / p>

有两把钥匙告诉你真正的顺序：

def parse(self,response):
    currentlink = response.url
    callerid = (a sequential number) #callerid refers to starting link
    sequentialid = 1

    with open("mylog.txt","a") as f:
        f.write(currentlink+"\t"+str(uniqueid)
    (whatever your logic for your start link)
    (logic for following links, assign each link you follow a sequential id that tells you the order of the request calls. callback="otherfn")
    (add callerid,sequentialid to your request.meta)
    return request


def otherfn(self,response):
     take current link, take the unique id you created in parse
     with open("mylog.txt","a") as f:
         f.write(picturelink \t %s \t %s %(uniqueid,sequentialid))

如何订购scrapy下载的商品？

1 个答案: