Question

我正在尝试通过以下网址以编程方式访问csv：http://www.cmegroup.com/CmeWS/exp/voiProductDetailsViewExport.ctl?media=xls&tradeDate=20180627&reportType=F&productId=425

我已经尝试了两种方法，一种是通过简单地将URL传递到data_sheet = pd.read_csv(sheet_url)。尝试使用此方法时，我收到一个HTTP Error 403: Forbidden异常。

def get_sheet(self):
        # Accesses CME direct URL (at the moment...will add functionality for ICE later)
        # Gets sheet and puts it in dataframe
        #Returns dataframe sheet

        sheet_url = "http://www.cmegroup.com/CmeWS/exp/voiProductDetailsViewExport.ctl?media=xls&tradeDate="+str(self.date_of_report)+"&reportType="\
        + str(self.report_type)+"&productId=" + str(self.product)

        header = {
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
            "X-Requested-With": "XMLHttpRequest"
        }

        data_sheet = pd.read_csv(sheet_url)

        return data_sheet

我还尝试过假装自己是浏览器，认为该站点不允许直接调用csv，但是随后我收到了Invalid file path or buffer object type: <class 'requests.models.Response'>异常

def get_sheet(self):
        # Accesses CME direct URL (at the moment...will add functionality for ICE later)
        # Gets sheet and puts it in dataframe
        #Returns dataframe sheet

        sheet_url = "http://www.cmegroup.com/CmeWS/exp/voiProductDetailsViewExport.ctl?media=xls&tradeDate="+str(self.date_of_report)+"&reportType="\
        + str(self.report_type)+"&productId=" + str(self.product)

        header = {
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
            "X-Requested-With": "XMLHttpRequest"
        }

        req = requests.get(url = sheet_url, headers = header)

        data_sheet = pd.read_csv(req)

        return data_sheet

我的最终目标是简单地检索该URL上的CSV并返回一个数据框。我想念什么？

更新：我做了一些修改，只是打印了req，得到了Response [200]的输出，从HTTP文档中可以看到，这意味着服务器正在接收我的信息。有谁知道问题是否出在我直接访问一个保存csv的URL，通常，如果您单击与该URL关联的按钮，它将自动下载文件。在检查我的下载文件夹时，我看不到该文件的任何下载。因此，当服务器可能正在接收有效请求时，我可能无法正确处理url行为。有什么想法吗？

Answer 1

您的代码有两点错误：

您正在将响应对象传递给熊猫，

data_sheet = pd.read_csv(sheet_url) 当您的实际csv数据位于sheet_url.content
pandas无法从csv读取string，pd.read_csv仅适用于文件对象。因此，要读取下载的内容，您需要使用字符串编写器创建文件来创建物理文件，或者使用io.StringIO(response.content.decode('utf-8'))

使用io模块的示例是：

import requests
import io
import pandas as pd

response = requests.get('http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv')

file_object = io.StringIO(response.content.decode('utf-8'))
pd.read_csv(file_object)

Answer 2

您可以简单地使用带有标头的请求来避免禁止的 403 错误，然后在读取 excel 文件时执行skiprows，以确保文件中的图像在导入 python 时不会产生问题。

import pandas as pd    
import requests

hdr = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36",
            "X-Requested-With": "XMLHttpRequest"} #change the version of the browser accordingly

resp = requests.get('http://www.cmegroup.com/CmeWS/exp/voiProductDetailsViewExport.ctl?media=xls&tradeDate=20180627&reportType=F&productId=425', headers = hdr)

pd.read_excel(resp.content, skiprows = range(0,5))

接收HTTP错误403：禁止CSV下载

2 个答案: