Question

我有一个用于从网站列表中收集数据的小脚本。我一直在使用lynx，但在查看数据后，我注意到有些网站没有返回任何结果。

#!/bin/bash

[ "$1" ] || exit 1

tmp=$(mktemp "${1}_XXXXXXXXX")

cat <<EOF > "$tmp"
https://google.com/search?q=${1}
https://duckduckgo.com/?q=${1}
https://www.bing.com/search?q=${1}
EOF

while read; do

    lynx -nonumbers -dump -hiddenlinks=merge -listonly "$REPLY" | \
    grep -i "${1}" | awk '!x[$0]++' >> file.txt

done < "$tmp"

rm "$tmp"

事实证明它是证书验证问题。显然lynx没有标志来忽略验证。虽然我知道验证是每个人的最佳利益，但我需要能够从列表中的每个网站提取数据。

所以我考虑使用Python和BeautifulSoup。从this answer我可以从单个网址中提取链接。并从this answer忽略验证。

使用Python 3.6，这是我到目前为止：

from bs4 import BeautifulSoup
import urllib.request
import ssl

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

resp = urllib.request.urlopen('https://google.com', context=ctx)
soup = BeautifulSoup(resp, "lxml")

for link in soup.find_all('a', href=True):
    print(link['href'])

我想将bash脚本中的相同列表传递给Python脚本，以从列表中的每个URL中提取链接。基本上，这个列表的每一行

https://google.com/search?q=${1}
https://duckduckgo.com/?q=${1}
https://www.bing.com/search?q=${1}

将以URLS传递给resp = urllib.request.urlopen('URLS', context=ctx)

我该怎么做？

Answer 1

尝试Python字符串格式化。

'https://google.com/search?q=%s' % ('text',)会产生'https://google.com/search?q=text'，如果这就是你要找的那个

Answer 2

阅读网站名称，从列表中说出，遍历它们，发送请求并解析响应。

site_list = ['http://example.com', 'https://google.com']

for site in site_list:

    resp = urllib.request.urlopen(site)
    soup = BeautifulSoup(resp, "lxml")

    for link in soup.find_all('a', href=True):
        print(link['href'])

使用BeautifulSoup

2 个答案: