我正在尝试从.csv文件下载链接,并将下载的文件存储在文件夹中。我为此使用了多线程库,即 mt-files-downloader 。文件下载正常,但是下载约313个文件花费太多时间。这些文件的最大大小约为400Kb。当我尝试使用节点进行常规下载时,我可以在一两分钟之内下载它们,但是使用此库,下载速度应该很快,因为我使用的是多线程库,但需要花费大量时间。以下是我的代码,任何帮助将是有用的。谢谢!
var rec;
csv
.fromStream(stream, { headers: ["Recording", , , , , , , ,] })
.on("data", function (records) {
rec = records.Recording;
//console.log(rec);
download(rec);
})
.on("end", function () {
console.log('Reading complete')
});
function download(rec) {
var filename = rec.replace(/\//g, '');
var filePath = './recordings/'+filename;
var downloadPath = path.resolve(filePath)
var fileUrl = 'http:' + rec;
var downloader = new Downloader();
var dl = downloader.download(fileUrl, downloadPath);
dl.start();
dl.on('error', function(dl) {
var dlUrl = dl.url;
console.log('error downloading = > '+dl.url+' restarting download....');
if(!dlUrl.endsWith('.wav') && !dlUrl.endsWith('Recording')){
console.log('resuming file download => '+dlUrl);
dl.resume();
}
});
}
答案 0 :(得分:0)
您是对的,下载313个400kB的文件应该不会花很长时间-而且我认为这与您的代码无关-也许连接不好?您是否尝试过通过curl
下载单个文件?
无论如何,我在您的方法中发现了两个可以提供帮助的问题:
使用具有on('data')
事件的流的主要缺点是,在读取所有块时或多或少地同步执行所有块。这意味着您的代码将执行on('data')
处理程序中的任何内容,而不会等待下载完成。现在唯一的限制因素是服务器读取cv的速度-我希望每秒数百万行是正常的。
从服务器的角度来看,您只是一次请求313个文件,因此,在某些等待和干扰的请求中,您不希望推测服务器的实际技术机制。
这可以通过使用流式框架来解决,例如scramjet
,event-steram
或highland
。我是第一个的作者,在这种情况下,它是最简单的恕我直言,但是您可以使用其中的一些更改代码来匹配其API-无论如何在任何情况下都非常相似。
这是一个受到严重评论的代码,它将并行运行几次下载:
const {StringStream} = require("scramjet");
const sleep = require("sleep-promise");
const Downloader = require('mt-files-downloader');
const downloader = new Downloader();
const {StringStream} = require("scramjet");
const sleep = require("sleep-promise");
const Downloader = require('mt-files-downloader');
const downloader = new Downloader();
// First we create a StringStream class from your csv stream
StringStream.from(csvStream)
// we parse it as CSV without columns
.CSVParse({header: false})
// we set the limit of parallel operations, it will get propagated.
.setOptions({maxParallel: 16})
// now we extract the first column as `recording` and create a
// download request.
.map(([recording]) => {
// here's the first part of your code
const filename = rec.replace(/\//g, '');
const filePath = './recordings/'+filename;
const downloadPath = path.resolve(filePath)
const fileUrl = 'http:' + rec;
// at this point we return the dl object so we can keep these
// parts separate.
// see that the download hasn't been started yet
return downloader.download(fileUrl, downloadPath);
})
// what we get is a stream of not started download objects
// so we run this asynchronous function. If this returns a Promise
// it will wait
.map(
async (dl) => new Promise((res, rej) => {
// let's assume a couple retries we allow
let retries = 10;
dl.on('error', async (dl) => {
try {
// here we reject if the download fails too many times.
if (retries-- === 0) throw new Error(`Download of ${dl.url} failed too many times`);
var dlUrl = dl.url;
console.log('error downloading = > '+dl.url+' restarting download....');
if(!dlUrl.endsWith('.wav') && !dlUrl.endsWith('Recording')){
console.log('resuming file download => '+dlUrl);
// lets wait half a second before retrying
await sleep(500);
dl.resume();
}
} catch(e) {
// here we call the `reject` function - meaning that
// this file wasn't downloaded despite retries.
rej(e);
}
});
// here we call `resolve` function to confirm that the file was
// downloaded.
dl.on('end', () => res());
})
)
// we log some message and ignore the result in case of an error
.catch(e => {
console.error('An error occured:', e.message);
return;
})
// Every steram must have some sink to flow to, the `run` method runs
// every operation above.
.run();
您还可以使用流来推送某种日志消息,最后使用pipe(process.stderr)
代替那些console.logs。请检查scramjet documentation以获得更多信息,并查看Mozilla doc on async functions