我想编写一个python程序,我可以输入数据集#,它将检索所述表号。我有一个工作解决方案的一部分数据,但我的最终csv文件是3GB,使用此文件会产生内存错误。所以我试图制作的是一个辅助CSV文件,它保存每个表格的开头和结尾的位置。
import pandas as pd
bl = pd.read_csv(r"C:\Users\K\csvsectionofdata.csv", names=
["A","B","C","D"])
bl.loc[0:3, 'D'] = [1,1,1,1]
bl.loc[0:3, 'C'] = [0,1,2,3]
for i in range(1, len(bl)):
if bl.loc[i, "B"] == str("intensity"):
bl.loc[i, 'C'] = 0
bl.loc[i, 'D'] = bl.loc[i-1, 'D'] + 1
else:
bl.loc[i, 'C'] = bl.loc[i-1, 'C'] + 1
bl.loc[i, 'D'] = bl.loc[i-1, 'D']
sl = []
s = int(input()) - 1
print ("--------")
top = (int(bl.D.searchsorted(s, side='right')))
btm = (int(bl.D.searchsorted(s + 1, side='right')))
sl = (bl.iloc[(top + 1):(btm - 1),:])
print (sl)
这就是" bl"看起来像:
m/z,intensity
1,5
2,250
,
m/z,intensity
12,10
13,27
14,51
15,222
16,250
17,4
,
m/z,intensity
12,2
13,7
14,19
15,189
16,250
17,3
,
m/z,intensity
12,7
然后生成的CSV看起来像:
Start,End
0,2
4,10
12,18
20.......
除了将整个数据集加载到内存中之外,还有一些方法可以做到更加pythonic吗?
答案 0 :(得分:0)
我实际上是靠自己想出来的!继承了我的解决方案:
import pandas as pd
import numpy as np
bl = pd.read_csv(r"C:\Users\list.csv", names=["A","B","C","D"], sep=";")
x = int(input())
numbr = pd.DataFrame({'A':bl.query('B == "intensity"').index})
start = numbr.iat[x,0]
end = numbr.iat[(x + 1),0]
print('---------')
print(start)
print(end)
print('----')
print(bl.iloc[start:end,0:2])