使用制表符分隔符读取csv会产生错误

时间:2016-01-16 17:18:11

标签: python csv numpy io tab-delimited

我有一个CSV文件,它使用' \ t' TAB作为分隔符。它包含5列。我试过这个:

import numpy as np 
#b=np.loadtxt(r'train_set.csv',dtype=str,delimiter=' ')
my_data = np.genfromtxt('train_set.csv', delimiter='\t')
print my_data

但我收到以下错误:

Traceback (most recent call last):
  File "./wordCloud.py", line 7, in <module>
    my_data = np.genfromtxt('train_set.csv', delimiter='\t')
  File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 1667, in genfromtxt
    raise ValueError(errmsg)
ValueError: Some errors were detected !
    Line #14 (got 4 columns instead of 5)
    Line #21 (got 4 columns instead of 5)
    Line #135 (got 4 columns instead of 5)

有什么想法吗?我不太了解Python(还是:))!

数据集(我现在也会检查)看起来像这样:

enter image description here

编辑:

如果我这样做:

my_data = np.genfromtxt('train_set.csv', delimiter='    ')

然后我没有错误,但输出是:

[ nan  nan  nan ...,  nan  nan  nan]

答案给出了这些警告:

...
    Line #26310 (got 4 columns instead of 5)
    Line #26383 (got 4 columns instead of 5)
    Line #26448 (got 4 columns instead of 5)
    Line #26489 (got 4 columns instead of 5)
    Line #26589 (got 4 columns instead of 5)
    Line #26593 (got 4 columns instead of 5)
    Line #26888 (got 4 columns instead of 5)
    Line #27002 (got 4 columns instead of 5)
    Line #27065 (got 4 columns instead of 5)
    Line #27234 (got 3 columns instead of 5)
    Line #27327 (got 4 columns instead of 5)
    Line #27418 (got 4 columns instead of 5)
    Line #27594 (got 4 columns instead of 5)
    Line #27827 (got 4 columns instead of 5)
    Line #27944 (got 4 columns instead of 5)
    Line #28074 (got 4 columns instead of 5)
    Line #28102 (got 4 columns instead of 5)
    Line #28147 (got 4 columns instead of 5)
    Line #28224 (got 4 columns instead of 5)
    Line #28264 (got 4 columns instead of 5)
    Line #28344 (got 4 columns instead of 5)
    Line #28484 (got 4 columns instead of 5)
  warnings.warn(errmsg, ConversionWarning)

输出得到一些奇怪的字符,如:

costing at least \xc2\xa3429

取代costing at least £429

2 个答案:

答案 0 :(得分:1)

您可以查看csv文件的第14,21和135行吗? 这些行不包含5列,因为错误状态(所有这些都包含4列)。

如果第5列应该是空白的,只需在末尾插入\t字符。

查看您的数据,可能这就是您想要的:

my_data = np.genfromtxt('train_set.csv', delimiter='\t',
                        invalid_raise=False, skip_header=1,
                        dtype=None)

invalid_raise:这将跳过无效行(#14,21和135)。请重新检查它们。 (在Libre Office:使用'另存为')

skip_header:这个名字解释了自己。

dtype:应为None,以便每列的数据类型由该列的内容确定。

答案 1 :(得分:1)

我也有同样的问题。我的数据是正确的(见下文)但 numpy 报告了这样的错误:

Line #11787 (got 4 columns instead of 11)
Line #11838 (got 3 columns instead of 11)

我使用python加载数据,然后转换为numpy。所以代替

tabOryg = numpy.genfromtxt(fn, dtype='str', delimiter='\t')

我做到了:

    datas = [i.split('\t')  for i in open(fn) ]
    tabOryg = numpy.array(datas, dtype='str')

它的工作原理。我想知道 genfromtxt 有什么问题。