Question

抓取工具应从较早被抓取的网址列表中下载文档。

当我使用办公室网络时，运行它没有问题，但是当我使用家里的wifi在家里运行刮板时，刮板会不断出现相同的错误。

我尝试了另一篇文章中的一些建议-通过设置超时变量。 Python HTTPConnectionPool Failed to establish a new connection: [Errno 11004] getaddrinfo failed

但是它不能解决问题。

我希望能得到一些解释以及解决方案。我对网络问题不太了解。谢谢

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import os

doc_urls= [
'http://www.ha.org.hk/haho/ho/bssd/19d079Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/18S065Pg.htm',
'http://www.ha.org.hk/haho/ho/bssd/19d080Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/NTECT6AT003Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/19D093Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/19d098Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/19d103Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/18G044Pe.htm',
'http://www.ha.org.hk/haho/ho/bssd/19d104Pa.htm',
]

base_url = "http://www.ha.org.hk"

for doc in doc_urls:
    with requests.Session() as session:
        r = session.get(doc)
        # get all documents links
        docs = BeautifulSoup(r.text, "html.parser").select("a[href]")
        print('Visiting:',doc)
        for doc in docs:
            href = doc.attrs["href"]
            name = doc.text
            print(f">>> Downloading file name: {name}, href: {href}")
            # open document page
            r = session.get(href)
            # get file path
            file_path = re.search("(?<=window.open\\(')(.*)(?=',)", r.text).group(0)
            file_name = file_path.split("/")[-1]
            # get file and save
            r = session.get(f"{base_url}/{file_path}")
            with open('C:\\Users\\Desktop\\tender_documents\\' + file_name, 'wb') as f:
                f.write(r.content)

如上所述，刮板在我办公室的网络上运行良好。当我尝试使用自己的wifi以及岳母的wifi时，它失败了。我的岳母和我使用同一家wifi提供商-如果有帮助的话。

无法建立新连接：[Errno 11001] getaddrinfo失败

0 个答案: