我有两个pandas数据框,BigDecimal total =
entity.getAssociate()
.stream()
.flatMap (a -> Stream.of(a.getPropertyA(),a.getPropertyB(),a.getPropertyC(),a.getPropertyD()))
.reduce(BigDecimal.ZERO, BigDecimal::add);
和df1
。两者都包含时间序列数据。
DF1
df2
DF2
Event Number Timestamp_A
A 1 7:00
A 2 8:00
A 3 9:00
基本上,我想确定最接近事件A的事件B,并正确分配。
因此,我需要从df1中的每个Timestamp_A逐行减去({1)Event Number Timestamp_B
B 1 9:01
B 2 8:01
B 3 7:01
中的每个Timestamp_B。这会生成一系列值,我希望将其作为最小值并将其放入df2
中的新列。
df1
我不熟悉pandas中的逐行操作。 当我在做的时候:
Event Number Timestamp_A Closest_Timestamp_B
A 1 7:00 7:01
A 2 8:00 8:01
A 3 9:00 9:01
我得到的结果是 ValueError:
for index, row in df1.iterrows():
s = df1.Timestamp_A.values - df2["Timestamp_B"][:]
Closest_Timestamp_B = s.min()
如何正确执行两个pandas数据帧之间的逐行比较?
答案 0 :(得分:1)
可能有更好的方法可以做到这一点,但这是一种方式:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Event':['A','A','A'],'Number':[1,2,3],
'Timestamp_A':['7:00','8:00','9:00']})
df2 = pd.DataFrame({'Event':['B','B','B'],'Number':[1,2,3],
'Timestamp_B':['7:01','8:01','9:01']})
df1['Closest_timestamp_B'] = np.zeros(len(df1.index))
for index, row in df1.iterrows():
df1['Closest_timestamp_B'].iloc[index] = df2.Timestamp_B.loc[np.argmin(np.abs(pd.to_datetime(df2.Timestamp_B) -pd.to_datetime(row.Timestamp_A)))]
df1
Event Number Timestamp_A Closest_timestamp_B
0 A 1 7:00 7:01
1 A 2 8:00 8:01
2 A 3 9:00 9:01
答案 1 :(得分:1)
您最好的选择是使用基础numpy数据结构来创建Timestamp_A
到Timestamp_B
的矩阵。由于您需要将A中的每个事件与B中的每个事件进行比较,这是一个O(N ^ 2)计算,非常适合矩阵。
import pandas as pd
import numpy as np
df1 = pd.DataFrame([['A',1,'7:00'],
['A',2,'8:00'],
['A',3,'9:00']], columns=['Event', 'Number', 'Timestamp_A'])
df2 = pd.DataFrame([['B',1,'9:01'],
['B',2,'8:01'],
['B',3,'7:01']], columns=['Event', 'Number', 'Timestamp_B'])
df1.Timestamp_A = pd.to_datetime(df1.Timestamp_A)
df2.Timestamp_B = pd.to_datetime(df2.Timestamp_B)
# create a matrix with the index of df1 as the row index, and the index
# of df2 as the column index
M = df1.Timestamp_A.values.reshape((len(df1),1)) - df2.Timestamp_B.values
# use argmin to find the index of the lowest value (after abs())
index_of_B = np.abs(M).argmin(axis=0)
df1['Closest_timestamp_B'] = df2.Timestamp_B[index_of_B]
df1
# returns:
Event Number Timestamp_A Closest_timestamp_B
0 A 1 2017-07-05 07:00:00 2017-07-05 09:01:00
1 A 2 2017-07-05 08:00:00 2017-07-05 08:01:00
2 A 3 2017-07-05 09:00:00 2017-07-05 07:01:00
如果要返回时间戳的原始格式,可以使用:
df1.Timestamp_A = df1.Timestamp_A.dt.strftime('%H:%M').str.replace(r'^0','')
df1.Closest_timestamp_B = df1.Closest_timestamp_B.dt.strftime('%H:%M').str.replace(r'^0','')
df1
# returns:
Event Number Timestamp_A Closest_timestamp_B
0 A 1 7:00 9:01
1 A 2 8:00 8:01
2 A 3 9:00 7:01
答案 2 :(得分:1)
如何使用merge_asof
来获取最近的活动?
确保您的数据类型正确无误:
df1.Timestamp_A = df1.Timestamp_A.apply(pd.to_datetime)
df2.Timestamp_B = df2.Timestamp_B.apply(pd.to_datetime)
按时间排序:
df1.sort_values('Timestamp_A', inplace=True)
df2.sort_values('Timestamp_B', inplace=True)
现在您可以在最近的时间合并两个数据帧:
df3 = pd.merge_asof(df2, df1,
left_on='Timestamp_B',
right_on='Timestamp_A',
suffixes=('_df2', '_df1'))
#clean up the datetime formats
df3[['Timestamp_A', 'Timestamp_B']] = df3[['Timestamp_A', 'Timestamp_B']] \
.applymap(pd.datetime.time)
#put df1 columns on the right
df3 = df3.iloc[:,::-1]
print(df3)
Timestamp_A Number_df1 Event_df1 Timestamp_B Number_df2 Event_df2
0 07:00:00 1 A 07:01:00 3 B
1 08:00:00 2 A 08:01:00 2 B
2 09:00:00 3 A 09:01:00 1 B
答案 3 :(得分:0)
使用apply将每行上的Timestamp_A与所有Timestamp_B进行比较,并使用min diff获取行的索引,然后使用索引提取Timestamp_B。
df1['Closest_Timestamp_B'] = (
df1.apply(lambda x: abs(pd.to_datetime(x.Timestamp_A).value -
df2.Timestamp_B.apply(lambda x: pd.to_datetime(x).value))
.idxmin(),axis=1)
.apply(lambda x: df2.Timestamp_B.loc[x])
)
df1
Out[271]:
Event Number Timestamp_A Closest_Timestamp_B
0 A 1 7:00 7:01
1 A 2 8:00 8:01
2 A 3 9:00 9:01