Question

我刚开始使用Scrapy的文档，我想知道是否有人可以为我提供以下代码的正确的逐行说明：

 def parse(self, response):
     filename = response.url.split("/")[-2] + '.html'
     with open(filename, 'wb') as f:
         f.write(response.body)

Answer 1

你见过http://doc.scrapy.org/en/stable/intro/tutorial.html#our-first-spider吗？

parse（）：一个spider的方法，它将使用每个起始URL的下载的Response对象进行调用。响应作为第一个也是唯一的参数传递给方法。

# a method called parse that takes one argument: response 
def parse(self, response):
   # get the URL (string) from the response object [1]
   # split [2] the string on the "/" character
   # generate a filename from the list of split strings
   filename = response.url.split("/")[-2] + '.html'
   # open [3] a file called filename and write [4] into it the body
   # of the response (i.e. the contents of the scraped page) 
   with open(filename, 'wb') as f:
       f.write(response.body)

[1] http://doc.scrapy.org/en/stable/topics/request-response.html#scrapy.http.Response

[2] https://docs.python.org/2/library/stdtypes.html#str.split

[3] https://docs.python.org/2/library/functions.html#open

[4] https://docs.python.org/2/library/stdtypes.html#file.write

Answer 2

你有一个蜘蛛下载一个网页并将响应保存在一个文件中。蜘蛛应用于收到您定义的parse方法的响应的回调：

line1：定义接收响应的parse方法作为参数。响应是您从网络服务器获得的。

第2行：定义将保存响应数据的文件名。在根据“/”字符拆分URL后，该名称将从URL中取出，作为URL中的最后一个字符串。然后将.html附加到文件名。

第3行：打开定义的文件，将数据写入二进制模式'wb'

第4行：将HTML数据写入从response.body获取的文件中。

从Scrapy开始

2 个答案: