熊猫没有标题读取csv(可能在那里)

时间:2015-04-09 13:22:46

标签: python numpy pandas

我正在尝试以块(python-engine)读取.csv文件并跳过标题(或以注释字符开头的任何行)。如果文件有标题,则不知道先验,因此不可能只跳过第一行,因为它可能已经是数据行。

设置header=None确实解决了问题。如果我调用get_chunk并想要行值,我仍然会得到标题/或注释行。

所需的输出与numpy.loadtxt()

相同

下面的代码演示了正在发生的事情:

import numpy as np
from pandas.io.parsers import TextFileReader
fn = '/tmp/test.csv'
np.savetxt(fn, np.arange(300).reshape(100,3), header="makes no sense")
print np.loadtxt(fn).shape # output (100,3)

reader = TextFileReader(fn, chunksize=10, header=None)
reader.get_chunk().values

# output
array([['#', 'makes', 'no', 'sense'],
       ['0.000000000000000000e+00', '1.000000000000000000e+00',
        '2.000000000000000000e+00', None],
       ['3.000000000000000000e+00', '4.000000000000000000e+00',
        '5.000000000000000000e+00', None],
       ['6.000000000000000000e+00', '7.000000000000000000e+00',
        '8.000000000000000000e+00', None],
       ['9.000000000000000000e+00', '1.000000000000000000e+01',
        '1.100000000000000000e+01', None],
       ['1.200000000000000000e+01', '1.300000000000000000e+01',
        '1.400000000000000000e+01', None],
       ['1.500000000000000000e+01', '1.600000000000000000e+01',
        '1.700000000000000000e+01', None],
       ['1.800000000000000000e+01', '1.900000000000000000e+01',
        '2.000000000000000000e+01', None],
       ['2.100000000000000000e+01', '2.200000000000000000e+01',
        '2.300000000000000000e+01', None],
       ['2.400000000000000000e+01', '2.500000000000000000e+01',
        '2.600000000000000000e+01', None]], dtype=object)

如果我通过

指定评论字符
   reader = TextFileReader(fn, chunksize=10, header=None, comment='#')

我得到一个例外:

In [99]: reader = pandas.io.parsers.TextFileReader('/tmp/test.csv', chunksize=10, header=None, index_col=False, comment="#")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-99-64b1c0bce4ef> in <module>()
----> 1 reader = pandas.io.parsers.TextFileReader('/tmp/test.csv', chunksize=10, header=None, index_col=False, comment="#")

/home/marscher/anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    560             self.options['has_index_names'] = kwds['has_index_names']
    561 
--> 562         self._make_engine(self.engine)
    563 
    564     def _get_options_with_defaults(self, engine):

/home/marscher/anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_engine(self, engine)
    703             elif engine == 'python-fwf':
    704                 klass = FixedWidthFieldParser
--> 705             self._engine = klass(self.f, **self.options)
    706 
    707     def _failover_to_python(self):

/home/marscher/anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, **kwds)
   1400         # Set self.data to something that can read lines.
   1401         if hasattr(f, 'readline'):
-> 1402             self._make_reader(f)
   1403         else:
   1404             self.data = f

/home/marscher/anaconda/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_reader(self, f)
   1505                 self.pos += 1
   1506                 self.line_pos += 1
-> 1507                 sniffed = csv.Sniffer().sniff(line)
   1508                 dia.delimiter = sniffed.delimiter
   1509                 if self.encoding is not None:

/home/marscher/anaconda/lib/python2.7/csv.pyc in sniff(self, sample, delimiters)
    180 
    181         quotechar, doublequote, delimiter, skipinitialspace = \
--> 182                    self._guess_quote_and_delimiter(sample, delimiters)
    183         if not delimiter:
    184             delimiter, skipinitialspace = self._guess_delimiter(sample,

/home/marscher/anaconda/lib/python2.7/csv.pyc in _guess_quote_and_delimiter(self, data, delimiters)
    221                       '(?:^|\n)(?P<quote>["\']).*?(?P=quote)(?:$|\n)'):                            #  ".*?" (no delim, no space)
    222             regexp = re.compile(restr, re.DOTALL | re.MULTILINE)
--> 223             matches = regexp.findall(data)
    224             if matches:
    225                 break

TypeError: expected string or buffer

编辑此错误是由于未将评论包装在列表中引起的。

1 个答案:

答案 0 :(得分:1)

我知道这已经超级老了,而且我从来没有弄清楚你的评论错误发生了什么(你对问题的澄清并没有为我解决,但我认为它有一些东西给我与调用类而不是函数有关),但是一些修改提供了我认为你正在寻找的输出。

首先,如果您告诉读者没有标题,它会将任何标题行解释为数据,同时确定读入的数据的形状和类型(例如,数字的字符串格式)。它可以推断是否有标题,不会搞砸形状,将评论作为一个单独的问题。

import numpy as np
from pandas.io.parsers import TextFileReader
fn = '/tmp/test.csv'
np.savetxt(fn, np.arange(300).reshape(100,3), header="makes no sense")
np.loadtxt(fn).shape # output (100,3)

reader = TextFileReader(fn, chunksize=10, header='infer')
reader.get_chunk().values

#output, just inferring headers
array([[  0.,   1.,   2.,  nan],
   [  3.,   4.,   5.,  nan],
   [  6.,   7.,   8.,  nan],
   [  9.,  10.,  11.,  nan],
   [ 12.,  13.,  14.,  nan],
   [ 15.,  16.,  17.,  nan],
   [ 18.,  19.,  20.,  nan],
   [ 21.,  22.,  23.,  nan],
   [ 24.,  25.,  26.,  nan],
   [ 27.,  28.,  29.,  nan]])

nan来自将注释行解释为标题(虽然它也被注释掉了),它有四个部分。

您可以通过更改保存文本的方式来删除标题上的注释标记。

np.savetxt(fn, np.arange(300).reshape(100,3), header="makes no      sense",comments=None)
reader = TextFileReader(fn, chunksize=10, header='infer')
reader.get_chunk().values
#output, without true header commented out
array([[  0.,   1.,   2.],
   [  3.,   4.,   5.],
   [  6.,   7.,   8.],
   [  9.,  10.,  11.],
   [ 12.,  13.,  14.],
   [ 15.,  16.,  17.],
   [ 18.,  19.,  20.],
   [ 21.,  22.,  23.],
   [ 24.,  25.,  26.],
   [ 27.,  28.,  29.]])

这消除了注释掉标题的问题,但没有帮助推断出正确的形状,或者如果你有真正的评论,你也想忽略。

如果你想推断是否有标题,也忽略任何注释行,我只能通过调用函数来弄清楚如何做到这一点。

import pandas
np.savetxt(fn, np.arange(300).reshape(100,3), header="makes no sense")
reader = pandas.read_csv(fn,chunksize=10,header='infer',comment="#")
reader.get_chunk().values
#output, treating the header as a comment, so shape is decided by first data line
array([[ '3.000000000000000000e+00 4.000000000000000000e+00 5.000000000000000000e+00'],
   [ '6.000000000000000000e+00 7.000000000000000000e+00 8.000000000000000000e+00'],
   [ '9.000000000000000000e+00 1.000000000000000000e+01 1.100000000000000000e+01'],
   [ '1.200000000000000000e+01 1.300000000000000000e+01 1.400000000000000000e+01'],
   [ '1.500000000000000000e+01 1.600000000000000000e+01 1.700000000000000000e+01'],
   [ '1.800000000000000000e+01 1.900000000000000000e+01 2.000000000000000000e+01'],
   [ '2.100000000000000000e+01 2.200000000000000000e+01 2.300000000000000000e+01'],
   [ '2.400000000000000000e+01 2.500000000000000000e+01 2.600000000000000000e+01'],
   [ '2.700000000000000000e+01 2.800000000000000000e+01 2.900000000000000000e+01'],
   [ '3.000000000000000000e+01 3.100000000000000000e+01 3.200000000000000000e+01']], dtype=object)

#Or, without the commented out header
np.savetxt(fn, np.arange(300).reshape(100,3), header="makes no sense",comments='')
reader = pandas.read_csv(fn,chunksize=10,header='infer',comment="#")
reader.get_chunk().values
#output, treating the header as a header to determine shape, but comments would also be ignored
array([[ '0.000000000000000000e+00 1.000000000000000000e+00 2.000000000000000000e+00'],
   [ '3.000000000000000000e+00 4.000000000000000000e+00 5.000000000000000000e+00'],
   [ '6.000000000000000000e+00 7.000000000000000000e+00 8.000000000000000000e+00'],
   [ '9.000000000000000000e+00 1.000000000000000000e+01 1.100000000000000000e+01'],
   [ '1.200000000000000000e+01 1.300000000000000000e+01 1.400000000000000000e+01'],
   [ '1.500000000000000000e+01 1.600000000000000000e+01 1.700000000000000000e+01'],
   [ '1.800000000000000000e+01 1.900000000000000000e+01 2.000000000000000000e+01'],
   [ '2.100000000000000000e+01 2.200000000000000000e+01 2.300000000000000000e+01'],
   [ '2.400000000000000000e+01 2.500000000000000000e+01 2.600000000000000000e+01'],
   [ '2.700000000000000000e+01 2.800000000000000000e+01 2.900000000000000000e+01']], dtype=object)