并行wget下载文件无法正常退出

时间:2018-02-06 19:42:23

标签: bash wget background-process

我正在尝试从包含链接(超过15 000 +)的文件(test.txt)下载文件。

我有这个脚本:

#!/bin/bash

function download {

FILE=$1

while read line; do
        url=$line

        wget -nc -P ./images/ $url

        #downloading images which are not in the test.txt, 
        #by guessing name: 12345_001.jpg, 12345_002.jpg..12345_005.jpg etc.

        wget -nc  -P ./images/ ${url%.jpg}_{001..005}.jpg
done < $FILE

}  

#test.txt contains the URLs
split -l 1000 ./temp/test.txt ./temp/split

#read splitted files and pass to the download function
for f in ./temp/split*; do
    download $f &
done

的test.txt:

http://xy.com/12345.jpg
http://xy.com/33442.jpg
...

我将文件splitting分成几块,然后守护(download $f &)wget进程,以便它可以跳转到另一个包含链接的分割文件。

脚本正在运行,但脚本最后没有退出,我必须在结尾处输入。如果我从行&中删除download $f &它可以正常工作,但我放弃了并行下载。

修改

我发现这不是并行化wget下载的最佳方法。使用GNU Parallel会很棒。

enter image description here

4 个答案:

答案 0 :(得分:3)

脚本正在退出,但后台的wget进程在脚本退出后生成输出,并在shell提示符后打印。所以你需要按 Enter 来获得另一个提示。

使用-q选项wget关闭输出。

while read line; do
        url=$line
        wget -ncq -P ./images/ "$url"
        wget -ncq  -P ./images/ "${url%.jpg}"_{001..005}.jpg
done < "$FILE"

答案 1 :(得分:1)

@ Barmar的回答是正确的。但是,我想提出一个不同的,更有效的解决方案。您可以考虑使用Wget2

Wget2是GNU Wget的下一个主要版本。它带有许多新功能,包括多线程下载。因此,使用GNU wget2,您需要做的就是传递--max-threads选项并选择要生成的并行线程数。

您可以非常轻松地从git存储库中编译它。在AURDebian

中还存在Arch Linux的软件包

编辑:完全披露:我是GNU Wget和GNU Wget2的维护者之一。

答案 2 :(得分:1)

我可以向你推荐 GNU Parallel 吗?

parallel --dry-run -j32 -a URLs.txt 'wget -ncq -P ./images/ {}; wget -ncq  -P ./images/ {.}_{001..005}.jpg'

我只猜测你的输入文件在URLs.txt中的样子是什么:

http://somesite.com/image1.jpg
http://someothersite.com/someotherimage.jpg

或者,使用您自己的方法和函数:

#/bin/bash

# define and export a function for "parallel" to call
doit(){
   wget -ncq -P ./images/ "$1"
   wget -ncq -P ./images/ "$2_{001..005}.jpg"
}
export -f doit

parallel --dry-run  -j32 -a URLs.txt doit {} {.}

答案 3 :(得分:0)

  1. 请阅读wget手册页/帮助。
  2. 记录和输入文件:

    -i, - input-file = FILE下载在本地或外部文件中找到的网址。

      -o,  --output-file=FILE    log messages to FILE.
      -a,  --append-output=FILE  append messages to FILE.
      -d,  --debug               print lots of debugging information.
      -q,  --quiet               quiet (no output).
      -v,  --verbose             be verbose (this is the default).
      -nv, --no-verbose          turn off verboseness, without being quiet.
           --report-speed=TYPE   Output bandwidth as TYPE.  TYPE can be bits.
      -i,  --input-file=FILE     download URLs found in local or external FILE.
      -F,  --force-html          treat input file as HTML.
      -B,  --base=URL            resolves HTML input-file links (-i -F)
                                 relative to URL.
           --config=FILE         Specify config file to use.
    

    下载:

    -nc, - no-clobber跳过要下载的下载内容                                  现有文件(覆盖它们)。

      -t,  --tries=NUMBER            set number of retries to NUMBER (0 unlimits).
           --retry-connrefused       retry even if connection is refused.
      -O,  --output-document=FILE    write documents to FILE.
      -nc, --no-clobber              skip downloads that would download to
                                     existing files (overwriting them).
      -c,  --continue                resume getting a partially-downloaded file.
           --progress=TYPE           select progress gauge type.
      -N,  --timestamping            don't re-retrieve files unless newer than
                                     local.
      --no-use-server-timestamps     don't set the local file's timestamp by
                                     the one on the server.
      -S,  --server-response         print server response.
           --spider                  don't download anything.
      -T,  --timeout=SECONDS         set all timeout values to SECONDS.
           --dns-timeout=SECS        set the DNS lookup timeout to SECS.
           --connect-timeout=SECS    set the connect timeout to SECS.
           --read-timeout=SECS       set the read timeout to SECS.
      -w,  --wait=SECONDS            wait SECONDS between retrievals.
           --waitretry=SECONDS       wait 1..SECONDS between retries of a retrieval.
           --random-wait             wait from 0.5*WAIT...1.5*WAIT secs between retrievals.
           --no-proxy                explicitly turn off proxy.
      -Q,  --quota=NUMBER            set retrieval quota to NUMBER.
           --bind-address=ADDRESS    bind to ADDRESS (hostname or IP) on local host.
           --limit-rate=RATE         limit download rate to RATE.
           --no-dns-cache            disable caching DNS lookups.
           --restrict-file-names=OS  restrict chars in file names to ones OS allows.
           --ignore-case             ignore case when matching files/directories.
      -4,  --inet4-only              connect only to IPv4 addresses.
      -6,  --inet6-only              connect only to IPv6 addresses.
           --prefer-family=FAMILY    connect first to addresses of specified family,
                                     one of IPv6, IPv4, or none.
           --user=USER               set both ftp and http user to USER.
           --password=PASS           set both ftp and http password to PASS.
           --ask-password            prompt for passwords.
           --no-iri                  turn off IRI support.
           --local-encoding=ENC      use ENC as the local encoding for IRIs.
           --remote-encoding=ENC     use ENC as the default remote encoding.
           --unlink                  remove file before clobber.       
    
    1. 关注how to wait wget finished to get more resources