我有两个带时间戳数据的数据帧。我想选择两个数据帧的时间戳都小于某个阈值的所有值。
例如,数据帧1和2看起来像这样,除了具有不同的,不可预测的时钟值:
clock head px py pz qw
0 0.000000 -0.316579 0.119198 0.149585 0.271688 0.987492 -0.002514
1 0.200000 -0.316642 0.119212 0.149593 0.271678 0.987487 -0.002522
2 1.200000 -0.316546 0.119199 0.149585 0.271669 0.987495 -0.002507
clock head px py pz qw
0 0.010000 -0.316579 0.119198 0.149585 0.271688 0.987492 -0.002514
1 1.1040000 -0.316642 0.119212 0.149593 0.271678 0.987487 -0.002522
2 2.4030000 -0.316546 0.119199 0.149585 0.271669 0.987495 -0.002507
结果数据框看起来假设阈值为0.1:
clock head1 head2 px1 px2 ...
0 0.000000 -0.316579 -0.316579 0.119198 0.119198 ...
1 1.200000 -0.316546 -0.316642 0.119199 0.119212 ...
我目前的方法是:在两个数据帧中创建一个相同的“填充”列,在此列上合并(创建len(dataframe1)* len(dataframe2)长度数据帧),然后对我想要的列进行过滤:
#rename the dataframe keys so that they are different
dataframe1.columns = [i+str(1) for i in dataframe1.columns.values]
dataframe1['filler'] = 0
dataframe2.columns = [i+str(2) for i in dataframe2.columns.values]
dataframe2['filler'] = 0
# merge requires a column to merge on, so merge on the filler
df_merged = dataframe1.merge(dataframe2,on='filler',how='left')
#pick out only the rows with the time differences within the threshold
mask = (df_merged[keyword+str(1)]<= df_merged[keyword+str(2)]+threshold) & (df_merged[keyword+str(1)]> df_merged[keyword+str(2)]-threshold)
df_merged = df_merged[mask]
#delete the filler column
del df_merged['filler']
#reindex the dataframe
df_merged.index = arange(0, len(df_merged))
这非常快,并且给了我想要的输出,但是,创建一个“填充”列然后我必须删除它感觉很愚蠢。我想知道是否有一个我错过的更明显的解决方案。
在“关键字”列上合并并不能满足我的需求,只有在时间完全相同且没有时差阈值的情况下,才会生成包含完整数据的数据帧。
答案 0 :(得分:0)
您可以使用np.where
将df2
的{{1}}列数据更改为匹配clock
,如果它在模糊匹配的阈值范围内。
df1
这样做的好处是不会合并任何与阈值不匹配的行,因此如果您的DataFrames还包含import pandas as pd
import numpy as np
# THE TEST DATA YOU GAVE US -------------------------
columns = ['clock', 'head', 'px', 'py', 'pz', 'qw']
series1 = [(0.0, 0.1, 0.5),
(-0.316579, -0.316642, -0.316546),
(0.119198, 0.119212, 0.119199),
(0.149585, 0.149593, 0.149585),
(0.271688, 0.271678, 0.271669),
(0.987492, 0.987487, 0.987495),
(-0.002514, -0.002522, -0.002507)]
series2 = [(0.01, 0.104, 0.403),
(-0.316579, -0.316642, -0.316546),
(0.119198, 0.119212, 0.119199),
(0.149585, 0.149593, 0.149585),
(0.271688, 0.271678, 0.271669),
(0.987492, 0.987487, 0.987495),
(-0.002514, -0.002522, -0.002507)]
# THE TEST DATA YOU GAVE US -------------------------
df1 = pd.DataFrame(dict(zip(columns, series1)))
df2 = pd.DataFrame(dict(zip(columns, series2)))
threshold = 0.99
df2['clock'] = np.where(
abs(df1['clock'] - df2['clock']) < threshold, df1['clock'], df2['clock'])
merged_df = df1.merge(df2, on='clock', how='outer')
print(merged_df)
clock head_x px_x py_x pz_x qw_x head_y px_y py_y pz_y qw_y
0 0.0 -0.316579 0.119198 0.149585 0.271688 0.987492 -0.316579 0 0.119198 0.149585 0.271688 0.987492
1 0.1 -0.316642 0.119212 0.149593 0.271678 0.987487 -0.316642 1 0.119212 0.149593 0.271678 0.987487
2 0.5 -0.316546 0.119199 0.149585 0.271669 0.987495 -0.316546 2 0.119199 0.149585 0.271669 0.987495
和df1['clock'] == 6
的数据行({{1}之外()阈值),最后会有两行,一行df2['clock'] == 7
,所有0.99
都满clock == 6
,一行_y
,所有NaN
已满clock == 7
s