Question

我有一个这样的数据框：

   X       Y        Z   num abscissa
0.93103 2.02765  0.00000 0   0.000
2.83744 1.92627  0.00000 0   1.909
5.09852 1.11521  0.00000 0   4.311
8.06897 1.16590  0.00000 0   7.282
0.93103 2.99078  0.00000 1   0.000
3.28079 3.90323  0.00000 1   2.521
4.83251 3.90323  0.00000 1   4.072
6.16256 3.14286  0.00000 1   5.604
7.22660 3.39631  0.00000 1   6.698
7.58128 2.17972  0.00000 1   7.966
0.66502 6.13364  0.00000 2   0.000
1.46305 5.72811  0.00000 2   0.895
2.57143 5.62673  0.00000 2   2.008
4.30049 5.42396  0.00000 2   3.749
5.23153 5.37327  0.00000 2   4.681
6.11823 5.32258  0.00000 2   5.570
0.75369 10.0833  0.00000 3   0.000
2.12808 10.0833  0.00000 3   1.374
4.07882 9.93056  0.00000 3   3.331
5.32020 9.62500  0.00000 3   4.610
6.78325 9.26852  0.00000 3   6.115
0.35468 10.4398  0.00000 4   0.000
3.85714 10.4398  0.00000 4   3.502
7.98030 10.2361  0.00000 4   7.631

我想根据我选择的固定长度，通过“横坐标”列进行上采样（或下采样）来创建新的数据帧。这是一个有效的示例，但是我使用的是for循环，当原始文件包含很多行时，这会导致执行时间非常慢。

import pandas as pd
import numpy as np
import time
from scipy import interpolate
#
fileIN = 'streamlines_file.csv'        
fileOUT = 'streamlines_sampled.csv'    
#
stream = pd.read_csv(fileIN,header=0,names=['X','Y','Z','num','abscissa']) #reading df
#
numstream = list(set(stream['num']))
#
df_new = np.zeros(shape=(0,5)) 
dist_sampling = 1.5
LL = list(stream.groupby('num')['abscissa'].max())
NP = list((stream.groupby('num')['abscissa'].max()/dist_sampling).astype(int)+1)
#
for i,j in enumerate(numstream): 
    f= interpolate.interp1d(np.array(stream[stream['num']==j]['abscissa']),np.array(stream[stream['num']==j]),axis=0)
    new = np.linspace(0.0,LL[i],NP[i])
    datanew = f(new)
    df_new = np.append(df_new,datanew,axis=0)
#
df_out = pd.DataFrame(df_new,columns=['X','Y','Z','num','abscissa'])
df_out.to_csv(fileOUT,index=False)

通过使用“ groupby”，我能够从循环中取出采样间隔的评估，但是我不知道如何通过避免循环来评估插值。我的原始文件很大（30000个“ num”，带有多个点），循环中的每个插值大约需要0.25s，导致超过2个小时的计算时间。

有什么方法可以加速事情吗？我知道我必须摆脱for循环，但我不知道为什么。

谢谢

熊猫groupby插值

0 个答案: