说我有这样的文本文件:
A 12 16 91
A 22 56 31
A 17 25 22
B 34,543,683,123 34
A 19 27 32
B 45,48,113,523 64
A 11 24 72
C asd,asd,qwe ewr 123
使用Pandas csv_read我可以:
from_csv = pd.read_csv('test.txt', sep=' ', header=None, names=['a','s','d','f'])
from_csv.head()
如果以B
或C
开头的行不存在,那么该工作正常。
如何告诉read_csv只读取以A
开头的行?
答案 0 :(得分:1)
我同意另一个过滤自己的选项,但我认为如果你以块的形式阅读文件,过滤你想要保留的行,然后使用一个Pandas阅读器(而不是每个创建一个阅读器),它会更快列):
def read_buffered(fle, keep):
READ_SIZE = 10000
with open(fle) as f:
buff = StringIO()
while True:
readBuffer = f.readlines(READ_SIZE)
if not readBuffer:
break
buff.writelines([x for x in readBuffer if x[0] == keep])
buff.seek(0)
return buff
然后你可以将返回的对象传递给像文件一样的pandas
from_csv = pd.read_csv(read_buffered('test.txt','A'),
sep=' ', header=None, names=['a','s','d','f'])
from_csv.head()
在我的测试中,这大约是接受解决方案的两倍(但这可能取决于您过滤掉的行的比例以及是否可以在内存中容纳两个数据副本):
In [128]: timeit pd.read_csv(read_buffered("test.txt","A"), sep=' ', header=None, names=['a','s','d','f'])
10 loops, best of 3: 22 ms per loop
In [129]: timeit read_only_csv("test.txt", "A", 0, sep=" ", columns=['a', 's', 'd', 'f'])
10 loops, best of 3: 45.7 ms per loop
答案 1 :(得分:0)
您可以自己进行过滤:
import pandas as pd
import csv
def read_only_csv(fle, keep, col,sep=",", **kwargs):
with open(fle) as f:
return pd.DataFrame.from_records((r for r in csv.reader(f, delimiter=sep) if r[col] == keep),
**kwargs)
df = read_only_csv("test.txt", "A", 0, sep=" ",columns=['a', 's', 'd', 'f'])
哪会给你:
a s d f
0 A 12 16 91
1 A 22 56 31
2 A 17 25 22
3 A 19 27 32
4 A 11 24 7
对于大约80k行的文件,使用read_csv然后过滤仍然更快,唯一的好处是你不会使用尽可能多的内存。
In [24]: %%timeit
df = pd.read_csv('out.txt', sep=' ', header=None, names=['a','s','d','f'])
df = df[df["a"] == "A"]
....:
10 loops, best of 3: 31.8 ms per loop
In [25]: timeit read_only_csv("out.txt", "A", 0, sep=" ", columns=['a', 's', 'd', 'f'])
10 loops, best of 3: 41.1 ms per loop