Question

我正在尝试从.csv文件下载链接，并将下载的文件存储在文件夹中。我为此使用了多线程库，即 mt-files-downloader 。文件下载正常，但是下载约313个文件花费太多时间。这些文件的最大大小约为400Kb。当我尝试使用节点进行常规下载时，我可以在一两分钟之内下载它们，但是使用此库，下载速度应该很快，因为我使用的是多线程库，但需要花费大量时间。以下是我的代码，任何帮助将是有用的。谢谢！

var rec;


csv
    .fromStream(stream, { headers: ["Recording", , , , , , , ,] })
    .on("data",  function (records) {
        rec = records.Recording;
        //console.log(rec);
         download(rec);

    })


    .on("end", function () {
        console.log('Reading complete')
    });

  function download(rec) {

    var filename = rec.replace(/\//g, '');
    var filePath = './recordings/'+filename;
    var downloadPath = path.resolve(filePath)
    var fileUrl = 'http:' + rec;

    var downloader = new Downloader();
    var dl = downloader.download(fileUrl, downloadPath);
        dl.start();   

        dl.on('error', function(dl) { 
            var dlUrl = dl.url;
            console.log('error downloading = > '+dl.url+' restarting download....');

            if(!dlUrl.endsWith('.wav') && !dlUrl.endsWith('Recording')){
                console.log('resuming file download => '+dlUrl);
                dl.resume();
            }

        });


}

Answer 1

您是对的，下载313个400kB的文件应该不会花很长时间-而且我认为这与您的代码无关-也许连接不好？您是否尝试过通过curl下载单个文件？

无论如何，我在您的方法中发现了两个可以提供帮助的问题：

首先-您同时下载所有文件（这可能会在服务器上造成一些开销）
秒-您的错误处理将循环运行，而无需等待并检查实际文件，因此，如果存在404-您将向服务器发送请求。

使用具有on('data')事件的流的主要缺点是，在读取所有块时或多或少地同步执行所有块。这意味着您的代码将执行on('data')处理程序中的任何内容，而不会等待下载完成。现在唯一的限制因素是服务器读取cv的速度-我希望每秒数百万行是正常的。

从服务器的角度来看，您只是一次请求313个文件，因此，在某些等待和干扰的请求中，您不希望推测服务器的实际技术机制。

这可以通过使用流式框架来解决，例如scramjet，event-steram或highland。我是第一个的作者，在这种情况下，它是最简单的恕我直言，但是您可以使用其中的一些更改代码来匹配其API-无论如何在任何情况下都非常相似。

这是一个受到严重评论的代码，它将并行运行几次下载：

const {StringStream} = require("scramjet");
const sleep = require("sleep-promise");
const Downloader = require('mt-files-downloader');

const downloader = new Downloader();

const {StringStream} = require("scramjet");
const sleep = require("sleep-promise");
const Downloader = require('mt-files-downloader');

const downloader = new Downloader();

// First we create a StringStream class from your csv stream
StringStream.from(csvStream)
    // we parse it as CSV without columns
    .CSVParse({header: false})
    // we set the limit of parallel operations, it will get propagated.
    .setOptions({maxParallel: 16})
    // now we extract the first column as `recording` and create a
    // download request.
    .map(([recording]) => {
        // here's the first part of your code
        const filename = rec.replace(/\//g, '');
        const filePath = './recordings/'+filename;
        const downloadPath = path.resolve(filePath)
        const fileUrl = 'http:' + rec;

        // at this point we return the dl object so we can keep these
        // parts separate.
        // see that the download hasn't been started yet
        return downloader.download(fileUrl, downloadPath);
    })
    // what we get is a stream of not started download objects
    // so we run this asynchronous function. If this returns a Promise
    // it will wait
    .map(
        async (dl) => new Promise((res, rej) => {
            // let's assume a couple retries we allow
            let retries = 10;

            dl.on('error', async (dl) => {
                try {
                    // here we reject if the download fails too many times.
                    if (retries-- === 0) throw new Error(`Download of ${dl.url} failed too many times`);

                    var dlUrl = dl.url;
                    console.log('error downloading = > '+dl.url+' restarting download....');

                    if(!dlUrl.endsWith('.wav') && !dlUrl.endsWith('Recording')){
                        console.log('resuming file download => '+dlUrl);
                        // lets wait half a second before retrying
                        await sleep(500);
                        dl.resume();
                    }
                } catch(e) {
                    // here we call the `reject` function - meaning that 
                    // this file wasn't downloaded despite retries.
                    rej(e);
                }
            });
            // here we call `resolve` function to confirm that the file was
            // downloaded.
            dl.on('end', () => res());
        })
    )
    // we log some message and ignore the result in case of an error
    .catch(e => {
        console.error('An error occured:', e.message);
        return;
    })
    // Every steram must have some sink to flow to, the `run` method runs
    // every operation above.
    .run();

您还可以使用流来推送某种日志消息，最后使用pipe(process.stderr)代替那些console.logs。请检查scramjet documentation以获得更多信息，并查看Mozilla doc on async functions

如何在node.js中使用多线程从.csv文件下载多个链接？

1 个答案: