我有300条带有df的df,这些行并不总是均匀分布的。他们看起来像这样:
Lags Rep 1 Rep 2 Rep 3
12.500000000E-9 7671.039418 6605.763724 10144.873125
25.000000000E-9 -1.000000 -0.479659 1.454251
37.500000000E-9 31.978402 23.456005 29.678136
50.000000000E-9 5.315013 4.723746 0.227125
62.500000000E-9 1.806673 2.642384 2.681376
75.000000000E-9 NaN NaN NaN
83.500000000E-9 NaN NaN NaN
Time PhtA count 1 PhtA count 2 PhtA count 3
0.000000000E+0 42.743683 10.890961 12.454987
2.428800000E-3 14.533997 8.125305 7.534027
4.857600000E-3 8.621216 7.686615 7.133484
7.286400000E-3 5.779266 10.147095 6.561279
9.715200000E-3 6.046295 8.201599 5.187988
12.144000000E-3 5.226135 7.343292 5.855560
Time PhtB count 1 PhtB count 2 PhtB count 3
0.860800000E-3 12.626648 13.580322 8.220673
1.289600000E-3 10.814667 21.381378 7.038116
2.718400000E-3 7.915497 17.261505 7.648468
3.147200000E-3 9.403229 21.266937 10.013580
分割时,最好有3个这样的df:
第一个df:
Lags Rep 1 Rep 2 Rep 3
12.500000000E-9 7671.039418 6605.763724 10144.873125
25.000000000E-9 -1.000000 -0.479659 1.454251
37.500000000E-9 31.978402 23.456005 29.678136
50.000000000E-9 5.315013 4.723746 0.227125
62.500000000E-9 1.806673 2.642384 2.681376
第二个df:
Time PhtA count 1 PhtA count 2 PhtA count 3
0.000000000E+0 42.743683 10.890961 12.454987
2.428800000E-3 14.533997 8.125305 7.534027
4.857600000E-3 8.621216 7.686615 7.133484
7.286400000E-3 5.779266 10.147095 6.561279
9.715200000E-3 6.046295 8.201599 5.187988
12.144000000E-3 5.226135 7.343292 5.855560
第三df
Time PhtB count 1 PhtB count 2 PhtB count 3
0.860800000E-3 12.626648 13.580322 8.220673
1.289600000E-3 10.814667 21.381378 7.038116
2.718400000E-3 7.915497 17.261505 7.648468
3.147200000E-3 9.403229 21.266937 10.013580
三个块的长度并不总是相同,这就是为什么我要寻求帮助以编程方式解决此问题的原因。我可以说的第一个df的一些细节是:
第一个块始终以一串值NaN(在示例中只有两个)的行结尾
还有另外两个以命名列标题开头的块(时间,PhtA计数1,PhtA计数2,...)
最后两个块没有任何NaN值
所有块的行数都是可变的,尽管标题始终相同
总是有一个空行分隔各个块
任何帮助将不胜感激。
谢谢。
答案 0 :(得分:2)
首先将所有数据读入保留空行的df中,然后在这些空行处将其拆分并转换为数字:
df = pd.read_csv('data.csv', sep='\s{2,}', skip_blank_lines=False, engine='python')
x = df[df.Lags.isnull()==True].index.values
df1 = df[0:x[0]].dropna().apply(pd.to_numeric)
df2 = df[x[0]+2:x[1]].apply(pd.to_numeric)
df2.columns=df.iloc[x[0]+1].values
df3 = df[x[1]+2:].apply(pd.to_numeric)
df3.columns = df.iloc[x[1]+1].values
print(df1);print(df2); print(df3)
的输出:
Lags Rep 1 Rep 2 Rep 3
0 1.250000e-08 7671.039418 6605.763724 10144.873125
1 2.500000e-08 -1.000000 -0.479659 1.454251
2 3.750000e-08 31.978402 23.456005 29.678136
3 5.000000e-08 5.315013 4.723746 0.227125
4 6.250000e-08 1.806673 2.642384 2.681376
Time PhtA count 1 PhtA count 2 PhtA count 3
9 0.000000 42.743683 10.890961 12.454987
10 0.002429 14.533997 8.125305 7.534027
11 0.004858 8.621216 7.686615 7.133484
12 0.007286 5.779266 10.147095 6.561279
13 0.009715 6.046295 8.201599 5.187988
14 0.012144 5.226135 7.343292 5.855560
Time PhtB count 1 PhtB count 2 PhtB count 3
17 0.000861 12.626648 13.580322 8.220673
18 0.001290 10.814667 21.381378 7.038116
19 0.002718 7.915497 17.261505 7.648468
20 0.003147 9.403229 21.266937 10.013580
df = pd.read_csv('data.csv', sep='\s{2,}', skip_blank_lines=False, engine='python', header=None)
x = [-1] + list(df[df.iloc[:,0].isnull()==True].index.values) + [len(df)]
for i in range(1,len(x)):
globals()[f'df{i}'] = df[x[i-1]+2:x[i]].dropna().apply(pd.to_numeric)
globals()[f'df{i}'].columns = df.iloc[x[i-1]+1].values