Question

我正在NodeJs中构建一个Web爬虫。除其他功能外，它还能够下载文件（图像等）。我已经将处理下载的代码放在一个单独的模块中。

我很快意识到，有时文件的URL并没有以实际的扩展名结尾。这是一个问题，因为在将其写入磁盘之前，我需要知道它。

我所做的就是依靠扩展名“如果存在”，并且依靠内容类型标头（并不总是存在，并且格式不一致），如果不存在。

以下是用于确定文件名的代码：

 getFileName() {

    const extension  =  path.extname(this.url)//Gets the url.
    const extensionWithoutDot = extension.substr(1);
    //Checks if the extension length makes sense. Pure hack..I did it because some "extensions" might not be an actual one.
    const urlEndsWithValidExtension =extensionWithoutDot.length >=2  && extensionWithoutDot.length <=4 ?  true : false;
    const baseName = path.basename(this.url);
    console.log('extension', path.extname(this.url))
    let fileName = "";
    if (urlEndsWithValidExtension) {//If it makes sense, i treat it normally.

        fileName = sanitize(baseName);
    }
     else {//If not, i rely on the content type. 
        var contentType = this.response.headers['content-type'];

        const extension = contentType.split("/")[1];
        fileName = `${sanitize(baseName)}.${extension}`;
    }

    const fileProcessor = new FileProcessor({ fileName, path: this.dest });
    if (this.clone) {
        fileName = fileProcessor.getAvailableFileName();
    }
    return fileName;
}

此代码用于在实际流开始之前（或将arraybuffer写入磁盘之前）确定文件名。

从我在许多不同站点“玩转”时获得的经验中，我了解到可以期待各种“惊喜”。这段代码处理了大多数代码，但显然不是全部。

是否有一些可靠，可靠的方式来获取真实文件扩展名？任何合适的NPM模块也都可以（找不到）。

在Nodejs中，检测下载文件扩展名的最全面，最可靠的方法是什么？

0 个答案: