我正在编写一个scrapy网络抓取工具,用于保存我访问过的网页中的html,并将其上传到S3。由于他们正在上传到S3,因此没有必要保留本地副本
蜘蛛类
class MySpider(CrawlSpider):
name = 'my name'
start_urls = ['my url']
allowed_domains = ['my domain']
rules = (Rule (LinkExtractor(allow=()), callback="parse_item", follow= True),
)
def parse_item(self,response):
item = MyItem()
item['url'] = response.url
item['html'] = response.body
return item
pipelines.py
save_path = 'My path'
if not os.path.exists(save_path):
os.makedirs(save_path)
class HtmlFilePipeline(object):
def process_item(self, item, spider):
page = item['url'].split('/')[-1]
filename = '%s.html' % page
with open(os.path.join(save_path, filename), 'wb') as f:
f.write(item['html'])
self.UploadtoS3(filename)
def UploadtoS3(self, filename):
...
我在python文档中读到我可以创建一个NamedTemporaryFile:https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile
当它被删除时,我有点模糊。如果我要使用NamedTemporaryFile,如何在成功上传到S3后删除该文件?
答案 0 :(得分:3)
延伸我的评论:
您可以使用io.StringIO方法创建文本缓冲区,而不是保存/读取/删除文件。
这将是这样的:
var data;
var url = "http://api.openweathermap.org/data/2.5/weather?id=7535661&APPID=56104080d6cb412468dad1627cb27da6";
var myRequest;
function sendRequest(url) {
myRequest = new XMLHttpRequest();
myRequest.onreadystatechange = function() {
if (myRequest.readyState == 4 && myRequest.status == 200) {
data = JSON.parse(myRequest.responseText);
console.log(data);
}
}
myRequest.open("GET", url, true);
myRequest.send();
}
sendRequest(url);