在Pandas

时间:2015-07-24 15:14:28

标签: python pandas

我有两个带时间戳数据的数据帧。我想选择两个数据帧的时间戳都小于某个阈值的所有值。

例如,数据帧1和2看起来像这样,除了具有不同的,不可预测的时钟值:

   clock      head        px        py        pz        qw         
0      0.000000 -0.316579  0.119198  0.149585  0.271688  0.987492 -0.002514   
1      0.200000 -0.316642  0.119212  0.149593  0.271678  0.987487 -0.002522   
2      1.200000 -0.316546  0.119199  0.149585  0.271669  0.987495 -0.002507   


   clock      head        px        py        pz        qw         
0      0.010000 -0.316579  0.119198  0.149585  0.271688  0.987492 -0.002514   
1      1.1040000 -0.316642  0.119212  0.149593  0.271678  0.987487 -0.002522   
2      2.4030000 -0.316546  0.119199  0.149585  0.271669  0.987495 -0.002507   

结果数据框看起来假设阈值为0.1:

   clock      head1        head2        px1        px2        ...         
0      0.000000 -0.316579 -0.316579  0.119198  0.119198  ...
1      1.200000 -0.316546 -0.316642  0.119199  0.119212  ...

我目前的方法是:在两个数据帧中创建一个相同的“填充”列,在此列上合并(创建len(dataframe1)* len(dataframe2)长度数据帧),然后对我想要的列进行过滤:

#rename the dataframe keys so that they are different
dataframe1.columns = [i+str(1) for i in dataframe1.columns.values]
dataframe1['filler'] = 0
dataframe2.columns = [i+str(2) for i in dataframe2.columns.values]
dataframe2['filler'] = 0
# merge requires a column to merge on, so merge on the filler
df_merged = dataframe1.merge(dataframe2,on='filler',how='left')
#pick out only the rows with the time differences within the threshold
mask = (df_merged[keyword+str(1)]<= df_merged[keyword+str(2)]+threshold) & (df_merged[keyword+str(1)]> df_merged[keyword+str(2)]-threshold)
df_merged = df_merged[mask]
#delete the filler column
del df_merged['filler']
#reindex the dataframe
df_merged.index = arange(0, len(df_merged))

这非常快,并且给了我想要的输出,但是,创建一个“填充”列然后我必须删除它感觉很愚蠢。我想知道是否有一个我错过的更明显的解决方案。

在“关键字”列上合并并不能满足我的需求,只有在时间完全相同且没有时差阈值的情况下,才会生成包含完整数据的数据帧。

1 个答案:

答案 0 :(得分:0)

您可以使用np.wheredf2的{​​{1}}列数据更改为匹配clock,如果它在模糊匹配的阈值范围内。

df1

这样做的好处是不会合并任何与阈值不匹配的行,因此如果您的DataFrames还包含import pandas as pd import numpy as np # THE TEST DATA YOU GAVE US ------------------------- columns = ['clock', 'head', 'px', 'py', 'pz', 'qw'] series1 = [(0.0, 0.1, 0.5), (-0.316579, -0.316642, -0.316546), (0.119198, 0.119212, 0.119199), (0.149585, 0.149593, 0.149585), (0.271688, 0.271678, 0.271669), (0.987492, 0.987487, 0.987495), (-0.002514, -0.002522, -0.002507)] series2 = [(0.01, 0.104, 0.403), (-0.316579, -0.316642, -0.316546), (0.119198, 0.119212, 0.119199), (0.149585, 0.149593, 0.149585), (0.271688, 0.271678, 0.271669), (0.987492, 0.987487, 0.987495), (-0.002514, -0.002522, -0.002507)] # THE TEST DATA YOU GAVE US ------------------------- df1 = pd.DataFrame(dict(zip(columns, series1))) df2 = pd.DataFrame(dict(zip(columns, series2))) threshold = 0.99 df2['clock'] = np.where( abs(df1['clock'] - df2['clock']) < threshold, df1['clock'], df2['clock']) merged_df = df1.merge(df2, on='clock', how='outer') print(merged_df) clock head_x px_x py_x pz_x qw_x head_y px_y py_y pz_y qw_y 0 0.0 -0.316579 0.119198 0.149585 0.271688 0.987492 -0.316579 0 0.119198 0.149585 0.271688 0.987492 1 0.1 -0.316642 0.119212 0.149593 0.271678 0.987487 -0.316642 1 0.119212 0.149593 0.271678 0.987487 2 0.5 -0.316546 0.119199 0.149585 0.271669 0.987495 -0.316546 2 0.119199 0.149585 0.271669 0.987495 df1['clock'] == 6的数据行({{1}之外()阈值),最后会有两行,一行df2['clock'] == 7,所有0.99都满clock == 6,一行_y,所有NaN已满clock == 7 s