弹出窗口使用wget阻止从网站批量下载pdf

时间:2018-04-30 11:55:46

标签: pdf download batch-processing wget

我想使用下面的bash脚本,使用for year in {14..57}; do for month in `seq -w 1 12`; do # -w for leading zero for day in `seq -w 1 31`; do wget -A pdf -nc -E -nd --no-check-certificate --content-disposition http://anno.onb.ac.at/pdfs/ONB_lzg_18$year$month$day.pdf done done done website奥地利国家图书馆下载一些免费下载的pdf(旧报纸的副本):< / p>

http://anno.onb.ac.at/pdfs/ONB_lzg_18140630.pdf
Aufl"osen des Hostnamens anno.onb.ac.at (anno.onb.ac.at)... 193.170.112.230
Verbindungsaufbau zu anno.onb.ac.at (anno.onb.ac.at)|193.170.112.230|:80 ... verbunden.
HTTP-Anforderung gesendet, auf Antwort wird gewartet ... 404 Not Found
FEHLER 404: Not Found.

除了一些报纸问题不可用外,即使它们存在,我也无法下载任何问题。我会得到诸如1814年6月30日现有问题的错误,例如:

wget

但是,如果您要手动下载相应的pdf(here,请参见右上角),您必须按&#34;确定&#34;在弹出确认中。完成此操作后,我甚至可以通过{{1}}无问题地下载问题。

如何告诉wget通过命令行确认确认(一旦你想下载pdf就得到的问题),请看下面的截图?在wget中有命令吗?

enter image description here

1 个答案:

答案 0 :(得分:1)

您的代码中存在两个问题。

  1. lgz报纸并非适用于所有日期
  2. 并不总是在您使用的URL上生成和缓存PDF。您需要先运行其他URL以确保生成PDF
  3. 以下是应该运行的更新代码

    #!/bin/bash
    
    for year in {14..57}; do
      DATES=$(curl -sS "http://anno.onb.ac.at/cgi-content/anno?aid=lzg&datum=18$year&zoom=33" |   gawk 'match($0, /datum=([^&]+)/, ary) {print ary[1]}' | xargs echo)
    
      for date in $DATES
      do 
          echo "Downloading for $date"
    
          curl "http://anno.onb.ac.at/cgi-content/anno_pdf.pl?aid=lzg&datum=$date" -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' -H 'DNT: 1' -H "Referer: http://anno.onb.ac.at/cgi-content/anno?aid=lzg&datum=$date" -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: en-US,en;q=0.9' --compressed
    
          wget -A pdf -nc -E -nd --no-check-certificate --content-disposition http://anno.onb.ac.at/pdfs/ONB_lzg_$date.pdf
      done
    done