Question

我有一个Python脚本，该脚本利用pd.read_html来对网页上包含的DataFrame数据进行处理。该脚本还将不同的日期循环到url中，以便我可以读取多天的数据。我知道脚本的语法是正确的，但是当我在有代理的公司中运行该脚本时，它将失败。这是特定的URL和使用代理失败的行：

url = r'https://services.tcpl.ca/cor/public/gdsr/GdsrNGTLImperial20191216.htm'

df = pd.read_html(url)

我认为我需要为脚本提供代理信息。

我已使用以下内容通过其他脚本的代理，但不适用于我抓取的熊猫：

import os

proxy = "http://proxy-xxxx-xxx:85"

os.environ['http_proxy'] = proxy

我也将它用于请求脚本，但不适用于熊猫，我不认为pandas.read_html（）有一个参数，您可以在其中传递请求之类的代理：

http_proxy = 'http://proxy-xxxx-xxx:85'
https_proxy = 'https://proxy-xxxx-xxx:85'

proxy_Dict = { 'http' : http_proxy,
               'https' : https_proxy,
             }

url = (r'http://www.tccustomerexpress.com/alberta/dashboard/ngtldash7days.csv')

r = requests.get(url, proxies=proxy_Dict).text

我对代理和熊猫的工作方式还很陌生，因此，我感谢任何信息。我不知道pandas是在后台使用请求还是urllib3，但是是否有某种方法可以首先与代理“握手”网站，然后再使用pandas.read_html（）呢？

感谢您的宝贵时间！

Answer 1

您必须使用

request.get

使用以下命令完成代码： df = pd.read_html(StringIO(r))

如何为熊猫read_html设置代理？

1 个答案: