Question

我在终端中使用wget下载大量图片。

示例 - $ wget -i images.txt

我在images.txt文件中有所有图片网址。

但是，图片网址往往与example.com/unqiueNumber/images/main_250.jpg

相似

表示所有图像都以main_250.jpg

命名

我真正需要的是使用每个图像的整个网址保存的图像，以便“唯一编号”是文件名的一部分。

有什么建议吗？

Answer 1

假设图像的网址位于名为images.txt的文本文件中，每行一个网址即可运行
cat images.txt | sed 'p;s/\//-/g' | sed 'N;s/\n/ -O /' | xargs wget以使用由网址形成的文件名下载每张图片。

现在解释一下：

在此示例中，我将使用
https://www.newton.ac.uk/files/covers/968361.jpg https://www.moooi.com/sites/default/files/styles/large/public/product-images/random_detail.jpg?itok=ErJveZTY

as images.txt（您可以在文件中添加任意数量的图像，只要它们采用相同的格式）。

cat images.txt将文件内容传输到标准输出
sed 'p;s/\//-/g'将文件打印到stdout，其中一行包含url，然后在下一行打印目标文件名，如下所示：

https://www.newton.ac.uk/files/covers/968361.jpg https:--www.newton.ac.uk-files-covers-968361.jpg https://www.moooi.com/sites/default/files/styles/large/public/product-images/random_detail.jpg?itok=ErJveZTY https:--www.moooi.com-sites-default-files-styles-large-public-product-images-random_detail.jpg?itok=ErJveZTY
sed 'N;s/\n/ -O /'将每个图像的两行（url和目标文件名）组合成一行，并在中间添加-O选项（这是为了让wget知道第二个参数是预期的文件名），此部分的结果如下所示：

https://www.newton.ac.uk/files/covers/968361.jpg -O https:--www.newton.ac.uk-files-covers-968361.jpg https://www.moooi.com/sites/default/files/styles/large/public/product-images/random_detail.jpg?itok=ErJveZTY -O https:--www.moooi.com-sites-default-files-styles-large-public-product-images-random_detail.jpg?itok=ErJveZTY
最后xargs wget为每一行运行wget作为选项，此示例中的endresult分别是当前目录中名为https:--www.newton.ac.uk-files-covers-968361.jpg和https:--www.moooi.com-sites-default-files-styles-large-public-product-images-random_detail.jpg?itok=ErJveZTY的两个图像。

Answer 2

使用GNU Parallel，您可以执行以下操作：

cat images.txt | parallel wget -O '{= s:/:-:g; =}' {}

Answer 3

我有一个不太优雅的解决方案，可能无法在任何地方使用。

您可能知道，如果您的URL以查询结尾，则wget将在文件名中使用该查询。例如如果您有http://domain/page?q=blabla，则下载后将得到一个名为page?q=blabla的文件。通常，这很烦人，但是您可以利用它来发挥自己的优势。

假设您要下载一些index.html页，并希望跟踪它们的来源，并避免以index.html，index.html.1，{{1} }等。您的输入文件index.html.2可能类似于以下内容：

urls.txt

如果您调用https://google.com/ https://bing.com/ https://duckduckgo.com/，则会得到编号为index.html的文件。但是，如果您使用伪造的查询“篡改”您的网址，则会得到有用的文件名。

编写一个脚本，将每个网址作为查询附加到自身，例如

wget -i urls.txt

看起来俗气吧？但是，如果您现在执行https://google.com/?url=https://google.com/ https://bing.com/?url=https://bing.com/ https://duckduckgo.com/?url=https://duckduck.com/，则会得到以下文件：

wget -i urls.txt

而不是未描述编号的index.html?url=https:%2F%2Fbing.com%2F index.html?url=https:%2F%2Fduckduck.com%2F index.html?url=https:%2F%2Fgoogle.com%2F。当然，它们看起来很难看，但是您可以清理文件名，瞧瞧！每个文件都有其来源。

该方法可能有一些限制，例如如果您要从中下载的站点实际执行查询并解析参数等。

否则，您必须使用bash脚本或其他编程语言来解决index.html之外的文件名/源URL问题。

使用wget将完整URL用作保存的文件名

3 个答案: