使用块时满足条件时从熊猫read_csv中断

时间:2019-02-20 16:57:16

标签: python pandas python-2.7 csv

我正在使用Pandas读取文本文件,并使用read_csv修剪数据。我想通过在满足某个字符串值时停止csv_read来加速程序,但是在处理块时似乎无法做到这一点。我的数据遵循如下常规模式:

v 2298995.721525 14888281.709655 4538.717779 0.015686 0.035294 0.019608
v 2298996.930769 14888284.103022 4538.596748 0.023529 0.031373 0.027451
v 2299001.331951 14888295.376948 4538.696330 0.027451 0.043137 0.031373
... (about 4.5 million lines of this)

f 155739//155739 157296//157296 156114//156114
f 157296//157296 160780//160780 156113//156113
f 159990//159990 157296//157296 155739//155739
... (about 10 million lines of this)

我可以读取和输出数据,但是如果在第一列中检测到'f'字符串时停止了read_csv,则可以节省大量的处理时间。这是我当前的代码:

import pandas as pd
import sys

#assign names to columns
colnames = ['ID', 'X', 'Y', 'Z']

#assign chunk size
c_size=200000

#read input file with space separated columns, strip header, and strip extra columns
for obj_chunk in pd.read_csv(sys.argv[1],sep='\s+', header=3, usecols=[0,1,2,3],chunksize=c_size, ):
    dtype={'ID':str,'X':int, 'Y':int, 'Z':int}
    obj_chunk.columns = colnames
    obj_chunk = obj_chunk[~obj_chunk.ID.str.contains('f')]
    obj_chunk.to_csv(sys.argv[2], index=False, header=None, columns = ['X','Y','Z'], mode = 'a')

我已经尝试了一些“如果”然后“中断”的语句,但我认为我走错了方向:

if obj_chunk[obj_chunk.ID.str.contains('f').any]: break

感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

一个想法:如果保存了块数据帧的所有行,则意味着我们在第一列中也没有'f',这意味着某些行已被禁止,并且行数是<块的大小(或其大小)。文件结尾), 我将在循环内测试行数:

#assign chunk size
c_size=200000

for obj_chunk in pd.read_csv(sys.argv[1],sep='\s+', header=3, usecols= [0,1,2,3],chunksize=c_size, ):
    dtype={'ID':str,'X':int, 'Y':int, 'Z':int}
    obj_chunk.columns = colnames
    obj_chunk = obj_chunk[~obj_chunk.ID.str.contains('f')]
    obj_chunk.to_csv(sys.argv[2], index=False, header=None, columns = ['X','Y','Z'], mode = 'a')
    #test nbr of lines
    if obj_chunk['ID'].count() < c_size:
        break