所以我一直在努力完成这项任务,但仍然没有出错。该程序似乎没有下载任何pdf。与此同时,我检查了存储最终链接的文件 - 所有内容都正确存储。 $ PDFURL也检查,存储正确的值。任何bash粉丝准备好帮忙吗?
#!/bin/sh
#create a temporary directory where all the work will be conducted
TMPDIR=`mktemp -d /tmp/chiheisen.XXXXXXXXXX`
echo $TMPDIR
#no arguments given - error
if [ "$#" == "0" ]; then
exit 1
fi
# argument given, but wrong format
URL="$1"
#URL regex
URL_REG='(https?|ftp|file)://[-A-Za-z0-9\+&@#/%?=~_|!:,.;]*[-A-Za-z0-9\+&@#/%=~_|]'
if [[ ! $URL =~ $URL_REG ]]; then
exit 1
fi
# go to directory created
cd $TMPDIR
#download the html page
curl -s "$1" > htmlfile.html
#grep only links into temp.txt
cat htmlfile.html | grep -o -E 'href="([^"#]+)\.pdf"' | cut -d'"' -f2 > temp.txt
# iterate through lines in the file and try to download
# the pdf files that are there
cat temp.txt | while read PDFURL; do
#if this is an absolute URL, download the file directly
if [[ $PDFURL == *http* ]]
then
curl -s -f -O $PDFURL
err="$?"
if [ "$err" -ne 0 ]
then
echo ERROR "$(basename $PDFURL)">&2
else
echo "$(basename $PDFURL)"
fi
else
#update url - it is always relative to the first parameter in script
PDFURLU="$1""/""$(basename $PDFURL)"
curl -s -f -O $PDFURLU
err="$?"
if [ "$err" -ne 0 ]
then
echo ERROR "$(basename $PDFURLU)">&2
else
echo "$(basename $PDFURLU)"
fi
fi
done
#delete the files
rm htmlfile.html
rm temp.txt
P.S。我刚刚发现的另一个小问题。也许问题出在if in regex?我非常希望看到那样的东西:
if [[ $PDFURL =~ (https?|ftp|file):// ]]
但这不起作用。我没有不必要的括号,为什么?
P.P.S。我还在以http开头的URL上运行此脚本,程序提供了所需的输出。但是,它仍未通过测试。