如何正确执行两个pandas数据帧中系列之间的所有VS-all-row-by-row比较?

时间:2017-07-05 12:04:14

标签: python pandas

我有两个pandas数据框,BigDecimal total = entity.getAssociate() .stream() .flatMap (a -> Stream.of(a.getPropertyA(),a.getPropertyB(),a.getPropertyC(),a.getPropertyD())) .reduce(BigDecimal.ZERO, BigDecimal::add); df1。两者都包含时间序列数据。

DF1

df2

DF2

Event   Number  Timestamp_A
A       1       7:00
A       2       8:00
A       3       9:00

基本上,我想确定最接近事件A的事件B,并正确分配。

因此,我需要从df1中的每个Timestamp_A逐行减去({1)Event Number Timestamp_B B 1 9:01 B 2 8:01 B 3 7:01 中的每个Timestamp_B。这会生成一系列值,我希望将其作为最小值并将其放入df2中的新列。

df1

我不熟悉pandas中的逐行操作。 当我在做的时候:

Event   Number  Timestamp_A Closest_Timestamp_B
A       1       7:00        7:01
A       2       8:00        8:01
A       3       9:00        9:01

我得到的结果是 ValueError:

for index, row in df1.iterrows():
    s = df1.Timestamp_A.values - df2["Timestamp_B"][:]
    Closest_Timestamp_B = s.min()

如何正确执行两个pandas数据帧之间的逐行比较?

4 个答案:

答案 0 :(得分:1)

可能有更好的方法可以做到这一点,但这是一种方式:

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Event':['A','A','A'],'Number':[1,2,3],
                   'Timestamp_A':['7:00','8:00','9:00']})
df2 = pd.DataFrame({'Event':['B','B','B'],'Number':[1,2,3],
                   'Timestamp_B':['7:01','8:01','9:01']})
df1['Closest_timestamp_B'] = np.zeros(len(df1.index))
for index, row in df1.iterrows():
    df1['Closest_timestamp_B'].iloc[index] = df2.Timestamp_B.loc[np.argmin(np.abs(pd.to_datetime(df2.Timestamp_B) -pd.to_datetime(row.Timestamp_A)))]

df1 
Event   Number  Timestamp_A Closest_timestamp_B
0   A   1   7:00    7:01
1   A   2   8:00    8:01
2   A   3   9:00    9:01

答案 1 :(得分:1)

您最好的选择是使用基础numpy数据结构来创建Timestamp_ATimestamp_B的矩阵。由于您需要将A中的每个事件与B中的每个事件进行比较,这是一个O(N ^ 2)计算,非常适合矩阵。

import pandas as pd
import numpy as np

df1 = pd.DataFrame([['A',1,'7:00'],
    ['A',2,'8:00'],
    ['A',3,'9:00']], columns=['Event', 'Number', 'Timestamp_A'])

df2 = pd.DataFrame([['B',1,'9:01'],
    ['B',2,'8:01'],
    ['B',3,'7:01']], columns=['Event', 'Number', 'Timestamp_B'])

df1.Timestamp_A = pd.to_datetime(df1.Timestamp_A)
df2.Timestamp_B = pd.to_datetime(df2.Timestamp_B)

# create a matrix with the index of df1 as the row index, and the index
# of df2 as the column index
M = df1.Timestamp_A.values.reshape((len(df1),1)) - df2.Timestamp_B.values

# use argmin to find the index of the lowest value (after abs())
index_of_B = np.abs(M).argmin(axis=0)

df1['Closest_timestamp_B'] = df2.Timestamp_B[index_of_B]

df1
# returns:
  Event  Number         Timestamp_A  Closest_timestamp_B
0     A       1 2017-07-05 07:00:00  2017-07-05 09:01:00
1     A       2 2017-07-05 08:00:00  2017-07-05 08:01:00
2     A       3 2017-07-05 09:00:00  2017-07-05 07:01:00

如果要返回时间戳的原始格式,可以使用:

df1.Timestamp_A = df1.Timestamp_A.dt.strftime('%H:%M').str.replace(r'^0','')
df1.Closest_timestamp_B = df1.Closest_timestamp_B.dt.strftime('%H:%M').str.replace(r'^0','')

df1
# returns:
  Event  Number Timestamp_A Closest_timestamp_B
0     A       1        7:00                9:01
1     A       2        8:00                8:01
2     A       3        9:00                7:01

答案 2 :(得分:1)

如何使用merge_asof来获取最近的活动?

确保您的数据类型正确无误:

df1.Timestamp_A = df1.Timestamp_A.apply(pd.to_datetime)
df2.Timestamp_B = df2.Timestamp_B.apply(pd.to_datetime)

按时间排序:

df1.sort_values('Timestamp_A', inplace=True)
df2.sort_values('Timestamp_B', inplace=True)

现在您可以在最近的时间合并两个数据帧:

df3 = pd.merge_asof(df2, df1, 
                left_on='Timestamp_B', 
                right_on='Timestamp_A', 
                suffixes=('_df2', '_df1'))
#clean up the datetime formats
df3[['Timestamp_A', 'Timestamp_B']] = df3[['Timestamp_A', 'Timestamp_B']] \
                                          .applymap(pd.datetime.time)
#put df1 columns on the right      
df3 = df3.iloc[:,::-1]

print(df3)
  Timestamp_A  Number_df1 Event_df1 Timestamp_B  Number_df2 Event_df2
0    07:00:00           1         A    07:01:00           3         B
1    08:00:00           2         A    08:01:00           2         B
2    09:00:00           3         A    09:01:00           1         B

答案 3 :(得分:0)

使用apply将每行上的Timestamp_A与所有Timestamp_B进行比较,并使用min diff获取行的索引,然后使用索引提取Timestamp_B。

df1['Closest_Timestamp_B'] = (
    df1.apply(lambda x: abs(pd.to_datetime(x.Timestamp_A).value - 
                            df2.Timestamp_B.apply(lambda x: pd.to_datetime(x).value))
                            .idxmin(),axis=1)
       .apply(lambda x: df2.Timestamp_B.loc[x])       
)

df1
Out[271]: 
  Event  Number Timestamp_A Closest_Timestamp_B
0     A       1        7:00                7:01
1     A       2        8:00                8:01
2     A       3        9:00                9:01