我是bash的新手。
我想要并行获取一些资源。
以下代码中存在的问题是什么:
for item in $list
do
if [ $i -le 10 ];then
wget -b $item
let "i++"
else
wait
i=1
fi
当我执行这个shell时。投掷错误:
fork: Resource temporarily unavailable
我的问题是如何正确使用wget。
我的问题是有大约四千个url要下载,如果我让所有这些作业并行工作,fork:资源暂时不可用将丢弃。我不知道如何并行控制计数。
答案 0 :(得分:4)
使用jobs|grep
检查后台作业:
#!/bin/bash
urls=('www.cnn.com' 'www.wikipedia.org') ## input data
for ((i=-1;++i<${#urls[@]};)); do
curl -L -s ${urls[$i]} >file-$i.html & ## background jobs
done
until [[ -z `jobs|grep -E -v 'Done|Terminated'` ]]; do
sleep 0.05; echo -n '.' ## do something while waiting
done
echo; ls -l file*\.html ## list downloaded files
结果:
............................
-rw-r--r-- 1 xxx xxx 155421 Jan 20 00:50 file-0.html
-rw-r--r-- 1 xxx xxx 74711 Jan 20 00:50 file-1.html
另一个方差,简单并行的任务:
#!/bin/bash
urls=('www.yahoo.com' 'www.hotmail.com' 'stackoverflow.com')
_task1(){ ## task 1: download files
for ((i=-1;++i<${#urls[@]};)); do
curl -L -s ${urls[$i]} >file-$i.html &
done; wait
}
_task2(){ echo hello; } ## task 2: a fake task
_task3(){ echo hi; } ## task 3: a fake task
_task1 & _task2 & _task3 & ## run them in parallel
wait ## and wait for them
ls -l file*\.html ## list results of all tasks
echo done ## and do something
结果:
hello
hi
-rw-r--r-- 1 xxx xxx 320013 Jan 20 02:19 file-0.html
-rw-r--r-- 1 xxx xxx 3566 Jan 20 02:19 file-1.html
-rw-r--r-- 1 xxx xxx 253348 Jan 20 02:19 file-2.html
done
限制一次并行下载的数量的示例(max = 3):
#!/bin/bash
m=3 ## max jobs (downloads) at a time
t=4 ## retries for each download
_debug(){ ## list jobs to see (debug)
printf ":: jobs running: %s\n" "$(echo `jobs -p`)"
}
## sample input data
## is redirected to filehandle=3
exec 3<<-EOF
www.google.com google.html
www.hotmail.com hotmail.html
www.wikipedia.org wiki.html
www.cisco.com cisco.html
www.cnn.com cnn.html
www.yahoo.com yahoo.html
EOF
## read data from filehandle=3, line by line
while IFS=' ' read -u 3 -r u f || [[ -n "$f" ]]; do
[[ -z "$f" ]] && continue ## ignore empty input line
while [[ $(jobs -p|wc -l) -ge "$m" ]]; do ## while $m or more jobs in running
_debug ## then list jobs to see (debug)
wait -n ## and wait for some job(s) to finish
done
curl --retry $t -Ls "$u" >"$f" & ## download in background
printf "job %d: %s => %s\n" $! "$u" "$f" ## print job info to see (debug)
done
_debug; wait; ls -l *\.html ## see final results
输出:
job 22992: www.google.com => google.html
job 22996: www.hotmail.com => hotmail.html
job 23000: www.wikipedia.org => wiki.html
:: jobs running: 22992 22996 23000
job 23022: www.cisco.com => cisco.html
:: jobs running: 22996 23000 23022
job 23034: www.cnn.com => cnn.html
:: jobs running: 23000 23022 23034
job 23052: www.yahoo.com => yahoo.html
:: jobs running: 23000 23034 23052
-rw-r--r-- 1 xxx xxx 61473 Jan 21 01:15 cisco.html
-rw-r--r-- 1 xxx xxx 155055 Jan 21 01:15 cnn.html
-rw-r--r-- 1 xxx xxx 12514 Jan 21 01:15 google.html
-rw-r--r-- 1 xxx xxx 3566 Jan 21 01:15 hotmail.html
-rw-r--r-- 1 xxx xxx 74711 Jan 21 01:15 wiki.html
-rw-r--r-- 1 xxx xxx 319967 Jan 21 01:15 yahoo.html
在阅读更新后的问题后,我认为使用lftp会更容易,可以登录和下载(自动跟踪链接+重试下载+继续下载);你永远不需要担心作业/分叉资源,因为你只运行几个lftp命令。只需将下载列表放入一些较小的列表中,lftp
将为您下载:
$ cat downthemall.sh
#!/bin/bash
## run: lftp -c 'help get'
## to know how to use lftp to download files
## with automatically retry+continue
p=() ## pid list
for l in *\.lst; do
lftp -f "$l" >/dev/null & ## run proccesses in parallel
p+=("--pid=$!") ## record pid
done
until [[ -f d.log ]]; do sleep 0.5; done ## wait for the log file
tail -f d.log ${p[@]} ## print results when downloading
输出:
$ cat 1.lst
set xfer:log true
set xfer:log-file d.log
get -c http://www.microsoft.com -o micro.html
get -c http://www.cisco.com -o cisco.html
get -c http://www.wikipedia.org -o wiki.html
$ cat 2.lst
set xfer:log true
set xfer:log-file d.log
get -c http://www.google.com -o google.html
get -c http://www.cnn.com -o cnn.html
get -c http://www.yahoo.com -o yahoo.html
$ cat 3.lst
set xfer:log true
set xfer:log-file d.log
get -c http://www.hp.com -o hp.html
get -c http://www.ibm.com -o ibm.html
get -c http://stackoverflow.com -o stack.html
$ rm *log *html;./downthemall.sh
2018-01-22 02:10:13 http://www.google.com.vn/?gfe_rd=cr&dcr=0&ei=leVkWqiOKfLs8AeBvqBA -> /tmp/1/google.html 0-12538 103.1 KiB/s
2018-01-22 02:10:13 http://edition.cnn.com/ -> /tmp/1/cnn.html 0-153601 362.6 KiB/s
2018-01-22 02:10:13 https://www.microsoft.com/vi-vn/ -> /tmp/1/micro.html 0-129791 204.0 KiB/s
2018-01-22 02:10:14 https://www.cisco.com/ -> /tmp/1/cisco.html 0-61473 328.0 KiB/s
2018-01-22 02:10:14 http://www8.hp.com/vn/en/home.html -> /tmp/1/hp.html 0-73136 92.2 KiB/s
2018-01-22 02:10:14 https://www.ibm.com/us-en/ -> /tmp/1/ibm.html 0-32700 131.4 KiB/s
2018-01-22 02:10:15 https://vn.yahoo.com/?p=us -> /tmp/1/yahoo.html 0-318657 208.4 KiB/s
2018-01-22 02:10:15 https://www.wikipedia.org/ -> /tmp/1/wiki.html 0-74711 60.7 KiB/s
2018-01-22 02:10:16 https://stackoverflow.com/ -> /tmp/1/stack.html 0-253033 180.8
答案 1 :(得分:2)
有了更新的问题,这是一个更新的答案。
在后台脚本启动10(可以更改为任意数量)wget
进程并监视它们。一旦其中一个进程完成,它将获取列表中的下一个进程,并尝试在后台运行相同的$maxn
(10)进程,直到它从列表中的url用完为止({{1} })。有一些内联评论可以帮助您理解。
$urlfile
运行:
$ cat wget.sh
#!/bin/bash
wget_bg()
{
> ./wget.pids # Start with empty pidfile
urlfile="$1"
maxn=$2
cnt=0;
while read -r url
do
if [ $cnt -lt $maxn ] && [ ! -z "$url" ]; then # Only maxn processes will run in the background
echo -n "wget $url ..."
wget "$url" &>/dev/null &
pidwget=$! # This gets the backgrounded pid
echo "$pidwget" >> ./wget.pids # fill pidfile
echo "pid[$pidwget]"
((cnt++));
fi
while [ $cnt -eq $maxn ] # Start monitoring as soon the maxn process hits
do
while read -r pids
do
if ps -p $pids > /dev/null; then # Check pid running
:
else
sed -i "/$pids/d" wget.pids # If not remove it from pidfile
((cnt--)); # decrement counter
fi
done < wget.pids
done
done < "$urlfile"
}
# This runs 10 wget processes at a time in the bg. Modify for more or less.
wget_bg ./test.txt 10
答案 2 :(得分:-2)
在if语句中添加:
until wget -b $item do
printf '.'
sleep 2
done
循环将等待进程完成并打印“。”每2秒