Question

我想使用此Wikipedia页面 - http://en.wikipedia.org/wiki/Current_members_of_the_United_States_House_of_Representatives

它包含几个.jpg图像的链接，我想将所有图像下载到一个文件夹中。我在Mac上。

我尝试过使用wget但到目前为止一直无法使用。

编辑：为了澄清，我想要一个脚本点击页面上的每个链接，然后下载页面。这是因为我需要首先重定向页面。

Answer 1

您可以将xmlstarlet用于此目的：

xmlstarlet sel --net --html -t -m "//img" -v "@src" -n 'http://en.wikipedia.org/wiki/Current_members_of_the_United_States_House_of_Representatives'

会在src的页面中为您提供img代码的所有http://en.wikipedia.org/wiki/Current_members_of_the_United_States_House_of_Representatives字段。

您会注意到输出行缺少标题http:，因此我们必须添加此内容。

然后：

while IFS= read -r line; do
    [[ $line = //* ]] && line="http:$line"
    wget "$line"
done < <(
    xmlstarlet sel --net --html -t -m "//img" -v "@src" -n 'http://en.wikipedia.org/wiki/Current_members_of_the_United_States_House_of_Representatives'
)

应检索图像文件。

根据您的评论，我现在了解您的要求：您希望获取包含href节点的a个节点的所有img字段。满足此要求的xpath是：

//a[img]

因此，

xmlstarlet sel --net --html -t -m "//a[img]" -v "@href" -n 'http://en.wikipedia.org/wiki/Current_members_of_the_United_States_House_of_Representatives'

将为您提供这些href。

现在，检索到的网址不是您要下载的图片;相反，它是另一个HTML页面，其中包含指向所需图像的链接。我已使用以下xpath选择了这些页面中的图像：

//div[@class='fullImageLink']/a

即a节点内的div个节点class="fullImageLink"。这似乎没问题，启发式。

然后，这应该做：

#!/bin/bash

base="http://en.wikipedia.org"

get_image() {
   local url=$base$1
   printf "*** %s: " "$url"
   IFS= read -r imglink < <(xmlstarlet sel --net --html -t -m "//div[@class='fullImageLink']/a" -v "@href" -n "$url")
   if [[ -z $imglink ]]; then
      echo " ERROR ***"
      return 1
   fi
   imglink="http:$imglink"
   echo " Downloading"
   wget -q "$imglink" &
}

while IFS= read -r url; do
   [[ $url = /wiki/File:* ]] || continue
   get_image "$url"
done < <(
   xmlstarlet sel --net --html -t -m "//a[img]" -v "@href" -n "$base/wiki/Current_members_of_the_United_States_House_of_Representatives"
)

你会得到比你想要的更多的东西，但它是一个很好的基础:)。

从维基百科页面下载所有链接的文件

1 个答案: