我试图从我们的服务器检索gzip压缩文件并将文件加载到pandas数据框中。
在pandas中,文档pandas.read_csv
接受有效的URL方案,例如http, ftp, s3, and file
。我使用的链接是https
,不要认为应该导致问题。
我尝试了两种方法让它发挥作用。
方法1:
import pandas as pd
print "Downloading file"
link = 'https://myserver/logfile.csv.gz'
df = pd.read_csv(link, compression='gzip', header=0, sep=',', quotechar='"')
print df
这没有用,我收到了以下错误。
Traceback (most recent call last):
File "download.py", line 14, in <module>
df = pd.read_csv(link, compression='gzip', header=0, sep=',', quotechar='"')
File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_79_g9e4e447-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 470, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_79_g9e4e447-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 246, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_79_g9e4e447-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 562, in __init__
self._make_engine(self.engine)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_79_g9e4e447-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 699, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.16.0_79_g9e4e447-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1066, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 509, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4722)
File "pandas/parser.pyx", line 624, in pandas.parser.TextReader._get_header (pandas/parser.c:6111)
File "pandas/parser.pyx", line 820, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8142)
File "pandas/parser.pyx", line 1758, in pandas.parser.raise_parser_error (pandas/parser.c:20728)
pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
然后经过一些谷歌搜索后,我决定尝试以下方法。
import urllib
import gzip
import StringIO
import pandas as pd
import requests
print "Downloading file"
link = 'https://myserver/logfile.csv.gz'
r = requests.get(link)
gz = gzip.GzipFile(StringIO.StringIO(r.content))
df = pd.read_csv(gz, compression='gzip', header=0, sep=',', quotechar='"')
print df
我收到以下错误
Traceback (most recent call last):
File "download.py", line 12, in <module>
gz = gzip.GzipFile(StringIO.StringIO(r.content))
File "/usr/lib/python2.7/gzip.py", line 89, in __init__
fileobj = self.myfileobj = __builtin__.open(filename, mode or 'rb')
TypeError: coercing to Unicode: need string or buffer, instance found