Question

我试图从我们的服务器检索gzip压缩文件并将文件加载到pandas数据框中。

在pandas中，文档pandas.read_csv接受有效的URL方案，例如http, ftp, s3, and file。我使用的链接是https，不要认为应该导致问题。

我尝试了两种方法让它发挥作用。

方法1：

import pandas as pd


print "Downloading file" 
link = 'https://myserver/logfile.csv.gz'

df = pd.read_csv(link, compression='gzip', header=0, sep=',', quotechar='"')

print df

这没有用，我收到了以下错误。

Traceback (most recent call last):
  File "download.py", line 14, in <module>
    df = pd.read_csv(link, compression='gzip', header=0, sep=',', quotechar='"')
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_79_g9e4e447-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 470, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_79_g9e4e447-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 246, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_79_g9e4e447-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 562, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_79_g9e4e447-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 699, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_79_g9e4e447-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1066, in __init__
    self._reader = _parser.TextReader(src, **kwds)
  File "pandas/parser.pyx", line 509, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4722)
  File "pandas/parser.pyx", line 624, in pandas.parser.TextReader._get_header (pandas/parser.c:6111)
  File "pandas/parser.pyx", line 820, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8142)
  File "pandas/parser.pyx", line 1758, in pandas.parser.raise_parser_error (pandas/parser.c:20728)
pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

然后经过一些谷歌搜索后，我决定尝试以下方法。

import urllib
import gzip
import StringIO
import pandas as pd
import requests

print "Downloading file" 
link = 'https://myserver/logfile.csv.gz'

r = requests.get(link)
gz = gzip.GzipFile(StringIO.StringIO(r.content))

df = pd.read_csv(gz, compression='gzip', header=0, sep=',', quotechar='"')

print df

我收到以下错误

Traceback (most recent call last):
  File "download.py", line 12, in <module>
    gz = gzip.GzipFile(StringIO.StringIO(r.content))
  File "/usr/lib/python2.7/gzip.py", line 89, in __init__
    fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
TypeError: coercing to Unicode: need string or buffer, instance found

Pandas错误从URL打开gzip文件

0 个答案: