熊猫:带引擎= C问题的read_csv()(错误或功能?)

时间:2016-05-05 01:05:03

标签: csv pandas

我在Suse Enterprise Linux 11上的python 2.7.9上使用pandas 0.18。

我有一个包含多个表的文件:

TABLE_A
col1,col2,...,col8
...


TABLE_B
col1,col2,...,col7
...
表A约为7300行,表B约为100行。我首先通过文件来确定每个表的开始/结束位置。然后,我在pandas w / skiprows中使用read_csv(),nrows选项将相应的表读入内存。我使用引擎=' c'。

我在使用 engine =' c' 时看到了奇怪的行为。我能够毫无问题地阅读TABLE_A的前4552行。但如果我尝试读取4553行,我会收到错误:

>>> df = pd.read_csv(f,engine='c',skiprows=1,nrows=4552)
>>> df.shape
(4552, 7)

>>> df = pd.read_csv(f,engine='c',skiprows=1,nrows=4553)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/python_pkgs/lib/python2.7/site-packages/pandas-0.18.0-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 529, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/python_pkgs/lib/python2.7/site-packages/pandas-0.18.0-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 301, in _read
    return parser.read(nrows)
  File "/python_pkgs/lib/python2.7/site-packages/pandas-0.18.0-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 763, in read
    ret = self._engine.read(nrows)
  File "/python_pkgs/lib/python2.7/site-packages/pandas-0.18.0-py2.7-linux-x86_64.egg/pandas/io/parsers.py", line 1213, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 766, in pandas.parser.TextReader.read (pandas/parser.c:7988)
  File "pandas/parser.pyx", line 800, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8444)
  File "pandas/parser.pyx", line 842, in pandas.parser.TextReader._read_rows (pandas/parser.c:8970)
  File "pandas/parser.pyx", line 829, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8838)
  File "pandas/parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas/parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: Expected 7 fields in line 7421, saw 8

从错误消息看,C解析器似乎继续读取超过指定行的方式,并且遇到了TABLE_B,它只有7列(TABLE_A有8列)。

但是,使用 engine =&#39; python&#39; 阅读效果正常。

>>> df = pd.read_csv(f,engine='python',skiprows=1,nrows=6000)
>>> df.shape
(6000, 7)
>>> 

这是一个错误还是一个功能/限制?也许C解析器通过读取块的方式工作?感谢。

0 个答案:

没有答案