示例csv
time,type,-1,
time,type,0,w
time,type,1,a,12,b,13,c,15,name,apple
time,type,5,r,2,s,43,t,45,u,67,style,blue,font,13
time,type,11,a,12,c,15
time,type,5,r,2,s,43,t,45,u,67,style,green,font,15
time,type,1,a,12,b,13,c,15,name,apple
time,type,11,a,12,c,15
time,type,5,r,2,s,43,t,45,u,67,style,green,font,15
time,type,1,a,12,b,13,c,15,name,apple
time,type,5,r,2,s,43,t,45,u,67,style,yellow,font,9
time,type,19,b,12
type,19,b,42
我想将以下“type,1”,“type,5”,“type,11”,“type,19”中的每一个过滤成单独的pandas框架以供进一步分析。最好的方法是什么? [另外,我将忽略“type,0”和“type,-1”]
示例代码
import pandas as pd
type1_header = ['type','a','b','c','name']
type5_header = ['type','r','s','t','u','style','font']
type11_header = ['type','a','c']
type19_header = ['type','b']
type1_data = pd.read_csv(file_path_to_csv, usecols=[2,4,6,8,10] , names=type1_header)
type5_data = pd.read_csv(file_path_to_csv, usecols=[2,4,6,8,10,12,14] , names=type5_header)
答案 0 :(得分:1)
import pandas as pd
headers = {1:['a','b','c','name'],
5:['r','s','t','u','style','font'],
}
usecols = {1:[4,6,8,10],
5:[4,6,8,10,12,14],
}
frames = {}
for h in headers:
frames[h] = pd.DataFrame(columns=headers[h])
count = 0
for line in open('irreg.csv'):
row = line.split(',')
count += 1
ID = int(row[2])
row_subset = []
if ID in frames:
for col in usecols[ID]: row_subset.append(row[col])
frames[ID].loc[len(frames[ID])] = row_subset
else:
print('WARNING: line %d: type %s not found'%(count, row[2]))
虽然这样做了,你多久做一次以及数据改变的频率如何?对于一次性,可能最容易分割传入的csv文件,例如,通过
grep type,19 irreg.csv > 19.csv
在命令行,然后根据其标题和usecols导入每个csv。