在Python中读取scipy / numpy中的csv文件

时间:2010-05-18 17:05:19

标签: python csv numpy matplotlib scipy

我无法在python中读取由制表符分隔的csv文件。我使用以下函数:

def csv2array(filename, skiprows=0, delimiter='\t', raw_header=False, missing=None, with_header=True):
    """
    Parse a file name into an array. Return the array and additional header lines. By default,
    parse the header lines into dictionaries, assuming the parameters are numeric,
    using 'parse_header'.
    """
    f = open(filename, 'r')
    skipped_rows = []
    for n in range(skiprows):
        header_line = f.readline().strip()
        if raw_header:
            skipped_rows.append(header_line)
        else:
            skipped_rows.append(parse_header(header_line))
    f.close()
    if missing:
        data = genfromtxt(filename, dtype=None, names=with_header,
                          deletechars='', skiprows=skiprows, missing=missing)
    else:
    if delimiter != '\t':
        data = genfromtxt(filename, dtype=None, names=with_header, delimiter=delimiter,
                  deletechars='', skiprows=skiprows)
    else:
        data = genfromtxt(filename, dtype=None, names=with_header,
                  deletechars='', skiprows=skiprows)        
    if data.ndim == 0:
    data = array([data.item()])
    return (data, skipped_rows)

问题是genfromtxt抱怨我的文件,例如错误:

Line #27100 (got 12 columns instead of 16)

我不确定这些错误来自哪里。有任何想法吗?

以下是导致问题的示例文件:

#Gene   120-1   120-3   120-4   30-1    30-3    30-4    C-1 C-2 C-5 genesymbol  genedesc
ENSMUSG00000000001  7.32    9.5 7.76    7.24    11.35   8.83    6.67    11.35   7.12    Gnai3   guanine nucleotide binding protein alpha
ENSMUSG00000000003  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Pbsn    probasin

有没有更好的方法来编写通用的csv2array函数?感谢。

5 个答案:

答案 0 :(得分:6)

查看python CSV模块:http://docs.python.org/library/csv.html

import csv
reader = csv.reader(open("myfile.csv", "rb"), 
                    delimiter='\t', quoting=csv.QUOTE_NONE)

header = []
records = []
fields = 16

if thereIsAHeader: header = reader.next()

for row, record in enumerate(reader):
    if len(record) != fields:
        print "Skipping malformed record %i, contains %i fields (%i expected)" %
            (record, len(record), fields)
    else:
        records.append(record)

# do numpy stuff.

答案 1 :(得分:2)

请问您为什么不使用内置的csv阅读器? http://docs.python.org/library/csv.html

我用numpy / scipy非常有效地使用它。我会分享我的代码,但遗憾的是它由我的雇主拥有,但编写自己的代码应该非常简单。

答案 2 :(得分:0)

可能它来自您的数据文件中的第27100行......它有12列而不是16列。它有:

separator,1,2,3,4,5,6,7,8,9,10,11,12,separator

期待这样的事情:

separator,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,separator

我不确定你想要如何转换你的数据,但如果你有不规则的线长,最简单的方法是这样的:

lines = f.read().split('someseparator')
for line in lines:
    splitline = line.split(',')
    #do something with splitline

答案 3 :(得分:0)

我成功使用了两种方法; (1):如果我只需要读取任意CSV,我使用CSV模块(如其他用户所指出的),以及(2):如果我需要重复处理已知的CSV(或任何)格式,我会写一个简单的解析器。

似乎您的问题适合第二类,解析器应该非常简单:

f = open('file.txt', 'r').readlines()
for line in f:
 tokens = line.strip().split('\t')
 gene = tokens[0]
 vals = [float(k) for k in tokens[1:10]]
 stuff = tokens[10:]
 # do something with gene, vals, and stuff

您可以在阅读器中添加一行来跳过注释(`if tokens [0] =='#':continue')或处理空行('if tokens == []:continue')。你明白了。

答案 4 :(得分:0)

我认为Nick T的方法是更好的方法。我会做一个改变。正如我将替换以下代码:

for row, record in enumerate(reader):
if len(record) != fields:
    print "Skipping malformed record %i, contains %i fields (%i expected)" %
        (record, len(record), fields)
else:
    records.append(record)

records = np.asrray([row for row in reader if len(row) = fields ])
print('Number of skipped records: %i'%(len(reader)-len(records)) #note you have to do more than len(reader) as an iterator does not have a length like a list or tuple

list comprehension将返回一个numpy数组并利用预编译的库,这应该会大大加快速度。另外,我建议使用print()作为函数而不是print""因为前者是python3的标准,很可能是未来,我会使用logging而不是打印。