我有一个标签分隔文件,其中包含10亿行(想象200多列而不是3列):
abc -0.123 0.6524 0.325
foo -0.9808 0.874 -0.2341
bar 0.23123 -0.123124 -0.1232
如果列数未知,如何在制表符分隔文件中找到列数?
我试过这个:
import io
with io.open('bigfile', 'r') as fin:
num_columns = len(fin.readline().split('\t'))
和(来自@EdChum,Read a tab separated file with first column as key and the rest as values):
import pandas as pd
num_columns = pd.read_csv('bigfile', sep='\s+', nrows=1).shape[1]
如何才能获得列数?哪个是最有效的方法?(想象一下,我突然收到一个列数未知的文件,比如超过100万列)
答案 0 :(得分:2)
对于包含100000列的文件的某些计时,计数似乎最快,但却被一个人关闭:
In [14]: %%timeit
with open("test.csv" ) as f:
r = csv.reader(f, delimiter="\t")
len(next(r))
....:
10 loops, best of 3: 88.7 ms per loop
In [15]: %%timeit
with open("test.csv" ) as f:
next(f).count("\t")
....:
100 loops, best of 3: 11.9 ms per loop
with io.open('test.csv', 'r') as fin:
num_columns = len(next(fin).split('\t'))
....:
10 loops, best of 3: 133 ms per loop
使用str.translate实际上似乎是最快的,但你需要再添加1:
In [5]: %%timeit
with open("test.csv" ) as f:
n = next(f)
(len(n) - len(n.translate(None, "\t")))
...:
100 loops, best of 3: 9.9 ms per loop
熊猫解决方案给我一个错误:
in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7977)()
StopIteration:
使用readline会增加更多开销:
In [19]: %%timeit
with open("test.csv" ) as f:
f.readline().count("\t")
....:
10 loops, best of 3: 28.9 ms per loop
In [30]: %%timeit
with io.open('test.csv', 'r') as fin:
num_columns = len(fin.readline().split('\t'))
....:
10 loops, best of 3: 136 ms per loop
使用python 3.4的不同结果:
In [7]: %%timeit
with io.open('test.csv', 'r') as fin:
num_columns = len(next(fin).split('\t'))
...:
10 loops, best of 3: 102 ms per loop
In [8]: %%timeit
with open("test.csv" ) as f:
f.readline().count("\t")
...:
100 loops, best of 3: 12.7 ms per loop
In [9]:
In [9]: %%timeit
with open("test.csv" ) as f:
next(f).count("\t")
...:
100 loops, best of 3: 11.5 ms per loop
In [10]: %%timeit
with io.open('test.csv', 'r') as fin:
num_columns = len(next(fin).split('\t'))
....:
10 loops, best of 3: 89.9 ms per loop
In [11]: %%timeit
with io.open('test.csv', 'r') as fin:
num_columns = len(fin.readline().split('\t'))
....:
10 loops, best of 3: 92.4 ms per loop
In [13]: %%timeit
with open("test.csv" ) as f:
r = csv.reader(f, delimiter="\t")
len(next(r))
....:
10 loops, best of 3: 176 ms per loop
答案 1 :(得分:0)
有str.count()
方法:
h = file.open('path', 'r')
columns = h.readline().count('\t') + 1
h.close()