我在Suse Enterprise Linux 11上的python 2.7.9上使用pandas 0.18。
我有一个包含多个表的文件:
TABLE_A
col1,col2,...,col8
...
TABLE_B
col1,col2,...,col7
...
表A约为7300行,表B约为100行。我首先通过文件来确定每个表的开始/结束位置。然后,我在pandas w / skiprows中使用read_csv(),nrows选项将相应的表读入内存。我使用引擎=' c'。
我在使用 engine =' c' 时看到了奇怪的行为。我能够毫无问题地阅读TABLE_A的前4552行。但如果我尝试读取4553行,我会收到错误:
>>> df = pd.read_csv(f,engine='c',skiprows=1,nrows=4552)
>>> df.shape
(4552, 7)
>>> df = pd.read_csv(f,engine='c',skiprows=1,nrows=4553)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/python_pkgs/lib/python2.7/site-packages/pandas-0.18.0-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 529, in parser_f
return _read(filepath_or_buffer, kwds)
File "/python_pkgs/lib/python2.7/site-packages/pandas-0.18.0-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 301, in _read
return parser.read(nrows)
File "/python_pkgs/lib/python2.7/site-packages/pandas-0.18.0-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 763, in read
ret = self._engine.read(nrows)
File "/python_pkgs/lib/python2.7/site-packages/pandas-0.18.0-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1213, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 766, in pandas.parser.TextReader.read (pandas/parser.c:7988)
File "pandas/parser.pyx", line 800, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8444)
File "pandas/parser.pyx", line 842, in pandas.parser.TextReader._read_rows (pandas/parser.c:8970)
File "pandas/parser.pyx", line 829, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8838)
File "pandas/parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas/parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 7 fields in line 7421, saw 8
从错误消息看,C解析器似乎继续读取超过指定行的方式,并且遇到了TABLE_B,它只有7列(TABLE_A有8列)。
但是,使用 engine =&#39; python&#39; 阅读效果正常。
>>> df = pd.read_csv(f,engine='python',skiprows=1,nrows=6000)
>>> df.shape
(6000, 7)
>>>
这是一个错误还是一个功能/限制?也许C解析器通过读取块的方式工作?感谢。