Question

我想从某些页面中提取文档列表。

当我不断获取URL列表时，我遇到问题

Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

对这种情况的原因进行了清楚的说明。

如果我针对一个网址页面运行，那么应该没有问题。

我有一个单独的代码使用Selenium / Webdriver，但是使用Selenium的问题是不同文件类型的下载行为。

例如，如果URL将您带到pdf文件，它将打开一个显示完整pdf文件的新页面。如果将网址链接到Excel文件，则行为会有所不同。

更多详细信息，请点击此处How do I control Selenium PDF and Excel files download behavior?

我最终得到了建议的代码，在该代码下虽然它可能不使用Selenium，但是却可以获取所有文件。

谢谢！

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import os

doc_urls = ['http://www.ha.org.hk/haho/ho/bssd/18G042Pc.htm'
'http://www.ha.org.hk/haho/ho/bssd/HKWCT03018A2Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/19D070Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/NTECT6AQ011Pa.htm',
'http://www.ha.org.hk/haho/ho/bssd/T18G052Pa.htm',
]

base_url = "http://www.ha.org.hk"


for doc in doc_urls:
    with requests.Session() as session:
        r = session.get(doc)
        # get all documents links
        docs = BeautifulSoup(r.text, "html.parser").select("a[href]")
        print('Visiting:',doc)
        for doc in docs:
            href = doc.attrs["href"]
            name = doc.text
            print(f">>> Downloading file name: {name}, href: {href}")
            # open document page
            r = session.get(href)
            # get file path
            # check for attibute, if not, file doesn't exist: contact admin. but how to contact the hospital admin?
            if hasattr(re.search("(?<=window.open\\(')(.*)(?=',)", r.text), 'group'):
                file_path = re.search("(?<=window.open\\(')(.*)(?=',)", r.text).group(0)
                print(file_path)
                file_name = file_path.split("/")[-1]
                # get file and save
                r = session.get(f"{base_url}/{file_path}")
                with open('C:\\Users\\tender_documents\\'+ today_yyMMddhh + '\\' + file_name, 'wb') as f:
                    f.write(r.content)
            else:
                print(f">>> File name: {name}, href: {href}", " is missing")
                continue

Answer 1

这只是一个错字，您正在尝试使用整个正则表达式匹配项：

        r = session.get(f"{base_url}/{file_path}")

应该是

        r = session.get(f"{base_url}/{file_name}")

遍历网址时出现Errno 11001 getaddrinfo失败

1 个答案: