Scrapy将html保存为临时文件

时间:2017-07-11 19:29:03

标签: python amazon-s3 scrapy web-crawler

我正在编写一个scrapy网络抓取工具,用于保存我访问过的网页中的html,并将其上传到S3。由于他们正在上传到S3,因此没有必要保留本地副本

蜘蛛类

class MySpider(CrawlSpider):
    name = 'my name'  
    start_urls = ['my url']
    allowed_domains = ['my domain']
    rules = (Rule (LinkExtractor(allow=()), callback="parse_item", follow= True),
  )

    def parse_item(self,response): 
        item = MyItem()
        item['url'] = response.url
        item['html'] = response.body
        return item

pipelines.py

save_path = 'My path'

if not os.path.exists(save_path):
    os.makedirs(save_path)

class HtmlFilePipeline(object):
    def process_item(self, item, spider):
        page = item['url'].split('/')[-1]
        filename = '%s.html' % page
        with open(os.path.join(save_path, filename), 'wb') as f:
            f.write(item['html'])
        self.UploadtoS3(filename)

    def UploadtoS3(self, filename):
    ...

我在python文档中读到我可以创建一个NamedTemporaryFile:https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile

当它被删除时,我有点模糊。如果我要使用NamedTemporaryFile,如何在成功上传到S3后删除该文件?

1 个答案:

答案 0 :(得分:3)

延伸我的评论:

您可以使用io.StringIO方法创建文本缓冲区,而不是保存/读取/删除文件。

这将是这样的:

var data;
var url = "http://api.openweathermap.org/data/2.5/weather?id=7535661&APPID=56104080d6cb412468dad1627cb27da6";

var myRequest;

function sendRequest(url) {
  myRequest = new XMLHttpRequest();
  myRequest.onreadystatechange = function() {
    if (myRequest.readyState == 4 && myRequest.status == 200) {
      data = JSON.parse(myRequest.responseText);
      console.log(data);
    }
  }
  myRequest.open("GET", url, true);
  myRequest.send();
}

sendRequest(url);

文档:https://docs.python.org/3/library/io.html