我正在使用Pandas读取文本文件,并使用read_csv修剪数据。我想通过在满足某个字符串值时停止csv_read来加速程序,但是在处理块时似乎无法做到这一点。我的数据遵循如下常规模式:
v 2298995.721525 14888281.709655 4538.717779 0.015686 0.035294 0.019608
v 2298996.930769 14888284.103022 4538.596748 0.023529 0.031373 0.027451
v 2299001.331951 14888295.376948 4538.696330 0.027451 0.043137 0.031373
... (about 4.5 million lines of this)
f 155739//155739 157296//157296 156114//156114
f 157296//157296 160780//160780 156113//156113
f 159990//159990 157296//157296 155739//155739
... (about 10 million lines of this)
我可以读取和输出数据,但是如果在第一列中检测到'f'字符串时停止了read_csv,则可以节省大量的处理时间。这是我当前的代码:
import pandas as pd
import sys
#assign names to columns
colnames = ['ID', 'X', 'Y', 'Z']
#assign chunk size
c_size=200000
#read input file with space separated columns, strip header, and strip extra columns
for obj_chunk in pd.read_csv(sys.argv[1],sep='\s+', header=3, usecols=[0,1,2,3],chunksize=c_size, ):
dtype={'ID':str,'X':int, 'Y':int, 'Z':int}
obj_chunk.columns = colnames
obj_chunk = obj_chunk[~obj_chunk.ID.str.contains('f')]
obj_chunk.to_csv(sys.argv[2], index=False, header=None, columns = ['X','Y','Z'], mode = 'a')
我已经尝试了一些“如果”然后“中断”的语句,但我认为我走错了方向:
if obj_chunk[obj_chunk.ID.str.contains('f').any]: break
感谢您的帮助!
答案 0 :(得分:0)
一个想法:如果保存了块数据帧的所有行,则意味着我们在第一列中也没有'f',这意味着某些行已被禁止,并且行数是<块的大小(或其大小)。文件结尾), 我将在循环内测试行数:
#assign chunk size
c_size=200000
for obj_chunk in pd.read_csv(sys.argv[1],sep='\s+', header=3, usecols= [0,1,2,3],chunksize=c_size, ):
dtype={'ID':str,'X':int, 'Y':int, 'Z':int}
obj_chunk.columns = colnames
obj_chunk = obj_chunk[~obj_chunk.ID.str.contains('f')]
obj_chunk.to_csv(sys.argv[2], index=False, header=None, columns = ['X','Y','Z'], mode = 'a')
#test nbr of lines
if obj_chunk['ID'].count() < c_size:
break